Technical assignment for the development of a parser (scraper) for Amazon
Task: It is necessary to develop a reliable and fault-tolerant scraper for obtaining information from Amazon for a large number of products (millions of ASINs). The scraper must operate stably in 24/7 mode and minimize the occurrence of HTTP 503 errors (blocking or access restrictions).
Mandatory requirements:
Data parsing:
- Obtaining product information: name, price, rating, number of reviews, stock availability, product description, images, and other information from the product page based on a given list of ASINs.
- Support for a large volume of requests (from 100,000 to several million products).
Stability and scalability:
- The system must operate around the clock (24/7), without regular stops and the need for manual restarts.
- Provide mechanisms for request balancing, use of proxy servers, IP address rotation, as well as request delay mechanisms to minimize the risk of blocks and HTTP 503 errors.
Bypassing Amazon's protection and restrictions:
- Provide methods for bypassing Amazon's anti-bot protection (CAPTCHA, IP blocking, User-Agent restrictions, etc.).
- Use mechanisms for automatic recognition and solving of CAPTCHA (for example, using anti-captcha services).
Proxy management:
- The system must integrate the use of proxy servers with the ability for automatic rotation and monitoring of their performance.
- Set up monitoring of proxy quality, excluding blocked and slow IPs.
Error management and logging:
- Implement logging of all scraper actions: successful requests, errors, blocks, and response times.
- Implement a system for automatic request retries in case of errors, with configurable retry counts and intervals between them.
Data format and storage:
- Ability to export data in convenient formats (CSV, JSON, databases).
- Implement a fast and efficient structure for storing the obtained data.
Management interface (optional):
- Ability to conveniently manage tasks, view statistics, and the status of the scraper through a web interface or API.
Requirements for the performer:
- Experience with web scraping from Amazon.
- Knowledge of technologies and tools for bypassing protection (proxy, anti-captcha).
- Experience with large volumes of data and asynchronous requests.
Expected result: A working, stable, and scalable tool capable of performing tasks for parsing a large amount of data from Amazon around the clock, minimizing the likelihood of blocks and errors.
Current freelance projects in the category C & C++
Reverse engineering of console utilities for querying SSD controllers (Flash ID)1. Purpose of the work Extraction of the application programming interface (API) for interaction with SSD/NVMe controllers from the provided set of console utilities (Phison, Silicon Motion, Realtek, Maxiotek, Marvell, JMicron, etc.). The result should be working code in C/C++… C & C++, Desktop Apps ∙ 11 days 11 hours back ∙ 6 proposals |
Development of a Minecraft Java Seed Map / Seed Viewer for the websiteDevelopment of Minecraft Java Seed Map / Seed Viewer for the websiteProject Description A browser-based tool Minecraft Java Seed Map / Seed Viewer needs to be developed, which will work on our website and allow the user to enter a seed from Minecraft Java Edition and view an… C & C++, HTML & CSS ∙ 11 days 18 hours back ∙ 17 proposals |