Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak

Parsing a protected SPA website, Bypassing Cloudflare and anti-bot systems

Objective: Gather 100% accurate data on over 1000 exhibitors (name, country, booth number, hidden emails and phone numbers, categories) from the official Salone del Mobile website.

Main challenges:

Aggressive anti-bot protection (Cloudflare): Standard requests (requests/httpx) returned 403 Forbidden. Regular headless browsers (Selenium, Playwright) and even frameworks like undetected-chromedriver were instantly blocked.

Complex SPA architecture (React / Next.js): The website did not have standard HTML links. All navigation occurred exclusively through React event handlers (onClick), making traditional URL collection impossible. Additionally, contact details were hidden in non-semantic tags (for example,
).

My solution:
To achieve perfect accuracy and bypass protection, I developed a custom hybrid approach:

Connection via Chrome DevTools Protocol (CDP): Instead of launching a new instance of an automated browser, my script used Playwright to connect to an already running, "live" session of Google Chrome (http://localhost:9222). This provided a 100% "trust factor" of a legitimate user (along with real cookies, history, and Canvas fingerprints). Cloudflare was bypassed without any solved captchas.

Intelligent navigation: The script visually mimicked human behavior — intercepting dynamic locators, physically clicking the mouse to trigger React states, and using the site's internal router to return to the list while maintaining pagination.

HTML parsing: The captured page state was processed through BeautifulSoup and complex regular expressions (Regex) for accurate extraction of "broken" or poorly formatted links and phone numbers.

Technologies used:

Python 3.12

Playwright (Sync API): interaction with the DOM and connection via CDP.

BeautifulSoup4 & Regex: precise searching and data extraction.

Pandas: structuring and exporting data into clean CSV (UTF-8 with BOM) and Excel.

Result:
The script autonomously collected and perfectly formatted data for over 1200 companies. The created architecture allows for scalable parsing without the risk of getting banned by IP.
Work details
Added 12 March
131 views
Freelancer
Viktor Gayoha
Ukraine Chernovtsy  2  0

Available for hire Available for hire
2 Safes completed
On the service 2 years