Technical assignment for the development of a parser (scraper) for Amazon
Task: It is necessary to develop a reliable and fault-tolerant scraper for obtaining information from Amazon for a large number of products (millions of ASINs). The scraper must operate stably in 24/7 mode and minimize the occurrence of HTTP 503 errors (blocking or access restrictions).
Mandatory requirements:
Data parsing:
- Obtaining product information: name, price, rating, number of reviews, stock availability, product description, images, and other information from the product page based on a given list of ASINs.
- Support for a large volume of requests (from 100,000 to several million products).
Stability and scalability:
- The system must operate around the clock (24/7), without regular stops and the need for manual restarts.
- Provide mechanisms for request balancing, use of proxy servers, IP address rotation, as well as request delay mechanisms to minimize the risk of blocks and HTTP 503 errors.
Bypassing Amazon's protection and restrictions:
- Provide methods for bypassing Amazon's anti-bot protection (CAPTCHA, IP blocking, User-Agent restrictions, etc.).
- Use mechanisms for automatic recognition and solving of CAPTCHA (for example, using anti-captcha services).
Proxy management:
- The system must integrate the use of proxy servers with the ability for automatic rotation and monitoring of their performance.
- Set up monitoring of proxy quality, excluding blocked and slow IPs.
Error management and logging:
- Implement logging of all scraper actions: successful requests, errors, blocks, and response times.
- Implement a system for automatic request retries in case of errors, with configurable retry counts and intervals between them.
Data format and storage:
- Ability to export data in convenient formats (CSV, JSON, databases).
- Implement a fast and efficient structure for storing the obtained data.
Management interface (optional):
- Ability to conveniently manage tasks, view statistics, and the status of the scraper through a web interface or API.
Requirements for the performer:
- Experience with web scraping from Amazon.
- Knowledge of technologies and tools for bypassing protection (proxy, anti-captcha).
- Experience with large volumes of data and asynchronous requests.
Expected result: A working, stable, and scalable tool capable of performing tasks for parsing a large amount of data from Amazon around the clock, minimizing the likelihood of blocks and errors.
Current freelance projects in the category C & C++
Black Ukraine (RP-project on base MTA)
1162 USD
|
Residential Proxy Infrastructure EngineerWe're building a residential proxy network from scratch — fully owned, no third-party suppliers. We need one exceptional network engineer to build the entire technical foundation. What you'll build: - Android background SDK that routes proxy traffic through user devices… C & C++, DevOps ∙ 4 days 10 hours back ∙ 13 proposals |
Improvement in the existing version of 1C retail block for the distribution center (DC)In general, I will explain what kind of database we have - There is a main server where there is a retail database (where all receipts are recorded) - a UTP database - where all sales are transferred - it calculates the markup and stock balances - small databases of retail… C & C++, C# ∙ 5 days 1 hour back ∙ 6 proposals |
About the Master Program "KONSTRUCTOR"
4182 USD
We are looking for a very experienced C++ developer to modernize existing software (master program). The program is responsible for creating derivative software representing audio-visual sessions of psychological correction. The current version is written in pure WinAPI (Visual… C & C++, Desktop Apps ∙ 10 days 4 hours back ∙ 19 proposals |
Writing code for ArduinoIt is necessary to develop software for a weight dispenser based on Arduino Uno. Components: Arduino Uno R3 HX711 + load cell LCD1602 I2C display MAX7219 LED matrix 8x32 5 control buttons 4-channel relay 2 signal lamps Coarse dosing vibrator Precise dosing vibrator Operation… C & C++, Embedded Systems & Microcontrollers ∙ 11 days 12 hours back ∙ 15 proposals |