Data parsing

Data Parsing — incorrectly specified categories?

102 USD

Project translated automatically. Log in or register, to view the original

Set up individually A-parser, Content Downloader, X-parser or any other parser for parsing a list of URLs of a single blog-article

Input data:

- URLs of info blog pages

Output data:

- text with html markup in file.txt format (1 file = 1 text, see example)

- saved images in a separate /images/ folder located inside the folder with text files txt

Parameters:

save only text, images, and headings (interested only in the article body + metatags). do not take: content at the beginning, author, commercial and advertising inserts
take only the first image of the slider
save tags: title, description, h1 - h6, i, p, blockquote, ol, ul, alt, strong, b
save Description text at the beginning in the tag {desc}text{/desc}
save text hyperlinks inside the text to external sources
save links to themselves in relative form, but without everything before the slash (including the slash itself), i.e. when site.ru/category/url/ - it should look like: <a href="gripp/">anchor</a> where "gripp/" is the url. (site.ru/category/ - the beginning of the url is not needed, only the url tail is needed)
since we save relative links, we also need to save the tails of the URLs of the pages themselves, for example, we scan the page: https://site.ru/rubrika/rubcy/ means inside the text, for example, make a tag with the url tail [url]rubcy[/url] (we only take the url tail without slashes)
do not save links with anchors, unnecessary symbols like curly and square brackets at the end of the sentence [1], authors, advertisements
separate code lines into paragraphs so that the entire parsed code is not one line.
need to make similar highlighted texts in the form of the <blockquote> tag, which is a quote in WordPress
the last thing we take in the article is the source and frequently asked questions.
for saving categories in tags:

[category]mat.category[/category]

[category]category[/category]

take only the first (parent) and last (regular) category

Proposals

Current freelance projects in the category Data Parsing

Reddit API

Web Programming 26 proposals 30 July

Not specified
Website parsing, bypassing Akamai protection

Python 39 proposals 30 July

Not specified
Парсинг маркетплейсу

Bot Development 31 proposals 30 July

38 USD
Automatic import of prices from supplier price lists in Google Sheets CSV format to HOROSHOP

Web Programming 66 proposals 29 July

111 USD
Find and add links to photos for 900 airplanes.

Data Processing 33 proposals 28 July

45 USD