PDF book parser (text + images)
TASK:
There is a library of PDF books that contain car repair manuals. The quality of the sources is "typographic" (this is not page-by-page manual scanning from paper manuals). The total volume is up to 100 books / approximately 30,000 pages in total.
On our side, OCR processing of all PDF books will be carried out in accordance with the contractor's agreement and requirements for this task.
A system needs to be written that will automatically extract (parse) all sections from these books, conduct thematic classification, and properly store them in a database.
WHAT THE SYSTEM WORK CONSISTS OF:
1. PDF Parser
Reads the book, finds all sections, extracts text in the correct order, and retrieves all images and tables. The main difficulty is that each page is laid out in three columns, with text and images mixed. Standard line-by-line reading of PDF will yield garbage; work with block coordinates is needed.
2. Quality Check via Claude API
After parsing, each section is sent to the Claude API. Claude reads the text and issues a verdict: all good — the section goes to the database; there are problems (text mixed from two columns, semantic breaks in the text, etc.) — the section is automatically re-parsed up to three attempts. In case of failure, Claude indicates the reasons for stopping.
3. Administrator Interface
A simple interface for managing the system: add a new book, start parsing, view statistics, deal with sections that could not be parsed after three attempts, and view error logs. Platform — web browser.
WHAT SHOULD BE ACHIEVED IN THE END:
You run a command specifying the book — the system parses everything by itself, checks via Claude, and organizes it in the database.
Each section in the database: clean text + photos linked to the text (PNG)
Working administrator interface with a dashboard, logs, and queue management
README in Russian
Important before responding! Show examples of working with coordinate text extraction from PDF or working with multi-column layouts. Experience with LLM API is welcome. A detailed technical specification will be sent after the first contact.
-
✋ Hello! We are the IT company dZENcode.
We are implementing a Python service for parsing PDFs with coordinate analysis of layouts, extracting text and images, classifying sections, validating through the Claude API, and a web admin panel, relying on the team's experience, best practices, and our own developments.
What is the structure of the sections and the rules for thematic classification?
Will there be coordinates for text blocks after OCR?
You can find detailed information about our services and rates on our website:Freelancehunt
Take a look – we will discuss the details of the work further, write when you are ready.
…
The final cost is formed only after clarifying the volume and requirements.
___________________
Sincerely,
Manager of dZENcode
Our strengths:
💎 10+ years providing IT services: Outsourcing, Outstaffing
🔥 90+ in-house specialists
🚀 Projects "from scratch" and for support
⚙️ SLA and post-production support
✅ Contract with the company, guaranteed results!
🔥 250+ public reviews since 2015.
-
271 Good day! I am ready to complete your task for a reasonable price in exchange for a good evaluation of my work.
-
2964 14 0 Good day.
For the task of three-column layout and extracting blocks of text along with images, a custom coordinate parser can be written, but as a more reliable alternative, I suggest considering specialized APIs like AWS Textract or Google Document AI. They natively recognize complex multi-column layouts and provide a ready structure, which will significantly reduce the number of errors before sending the text for verification.
I will implement all server-side logic with routing, validation through Claude API in three attempts, and saving results on Node.js with Typescript. The admin interface for managing the queue of books, displaying statistics, and viewing logs for problematic sections will be built on Next.js.
In private messages, I will show examples of scripts for extracting data from documents with complex structures and integration with LLM API. I would be happy to review the extended technical assignment.
-
1390 12 0 Hello,
I have experience working with the Tesseract library and with blocks in particular. I am implementing a server with functionality in Node.js/Python/Go (depending on your preferences), front-end in Vue or React. I have also worked with LLM, I can create a universal interface for replacing agents if needed.
I would be happy to collaborate!
-
1580 3 0 Hello, I have experience in creating systems and services for data parsing. I am ready to quickly and efficiently develop a parser for you, taking into account all requirements. I suggest discussing the details in private messages.
-
358 1 0 Good day!
The task is clear. I have relevant experience: I developed a system for automatic uploading and processing of PDF invoices via API (the project is available on GitHub). The system included a GUI interface, date range selection, auto-upload, and auto-processing of files.
For your project, I will implement:
PDF parsing with block coordinate processing (pymupdf/pdfplumber) for correct reading of three-column layouts
Quality checking through Claude API with auto-re-parsing
Celery + Redis for task queue (30,000 pages - a stable queue is needed)
Admin panel with dashboard and logs
… PostgreSQL for storing sections + PNG
https://github.com/NazarShubeliak
I am ready to discuss the detailed technical specifications.
-
7123 53 0 I understand the task of developing a reliable solution for parsing PDF manuals for car repairs, extracting text and images from a large volume of typographic sources. I have extensive experience in creating complex systems for extracting structured data from unstructured sources, including technical documentation and large document libraries. For such volume and specificity of data, the architecture that ensures extraction accuracy, error handling, and further scaling for analytics or display is critical. Please clarify what the ultimate goal of using the extracted data is: for creating a search database, interactive documentation, or something else? I would be happy to discuss this in detail to propose an optimal solution and assess timelines with the budget.
-
1495 13 0 Hello! I can implement it. Write to me privately to discuss all the details. I will be glad to cooperate!
-
387 1 0 Hello.
In your technical specification, the key difficulty is not OCR, but the correct reconstruction of the structure: 3-column layout + mixed text/images. If you read the PDF "as is," you will get mixed text and a loss of logical sections.
I propose a different approach:
1. Parsing through coordinates (layout-aware)
I break the page into blocks → cluster the columns → restore the reading order. This eliminates the mixing of text between columns.
… 2. Content binding
Images and tables are bound to the nearest text blocks (by coordinates and context) so that the connection is stored in the database, not just a "set of files."
3. Claude as a quality gate, not a "hack"
After parsing, each section undergoes verification:
— whether the columns are stuck together
— whether the logic of the text is violated
— whether there are any breaks
In case of errors — automatic retry with different parameters.
4. Scaling to your volume
100 books / ~30k pages → I do batching + queues + logging to ensure the system works stably and does not crash in the middle.
5. An admin panel that really helps
I will show not just "status," but problematic areas: which pages/sections did not pass validation and why.
To save your time — I propose:
I will create a prototype on 1 book (full cycle: parsing → Claude → structure in the database). You will immediately see if this is the level of quality you need.
If it works — we will scale without changing the architecture.
I am ready to start immediately after receiving a sample PDF.
-
139 Vladimir, hello!
An excellent and non-trivial task. Parsing multi-column PDFs is always a pain, but your approach with validating semantic gaps through the Claude API makes the system very smart and fault-tolerant.
Plus, the topic of manuals is very close to me personally: I service my own cars (from VAZ 2105 to Mercedes), so I understand the specifics of repair manuals perfectly. I will immediately notice in the tests if the parser confuses the assembly order of a unit from adjacent columns.
Here’s how I propose to technically implement the pipeline:
Parser (Coordinates): Use the PyMuPDF (fitz) or pdfplumber library. They allow extracting bounding boxes (exact x,y coordinates). We will write a heuristic that will read blocks strictly by columns (top-to-bottom, left-to-right), cut out headers, and separately save PNG diagrams linked to paragraphs.
…
Claude API: We will write a validation script with a system prompt that will analyze the text of the section for logical coherence. In case of an error — a trigger for a re-pass with modified indentation parameters.
Web interface: To save time and create a convenient dashboard, I will set up an admin panel on Streamlit or FastAPI + Jinja2. There will be convenient book uploads, error logs from Claude, and manual control of stuck sections.
I am ready to look at a couple of pages of your manuals as a test sample and show the logic of block extraction. I am waiting for an extended technical specification in private messages!
-
172 1 1 Hello! I am ready to complete this project and have extensive experience in developing various applications.
-
3700 17 0 Good day.
I am ready to implement a turnkey system: parsing PDF with coordinate analysis of multi-column layout, thematic classification of sections, saving to a database, and a web admin panel for managing the queue, logs, and problematic cases.
Technology stack used:
Backend: Python, FastAPI / Django, Celery, PostgreSQL
Integrations: PyMuPDF / pdfplumber, Claude API, OCR pipeline
Frontend: Django Admin or a separate web admin panel
… Infrastructure: Docker, Redis
I have experience working with coordinate extraction of text from PDF, multi-column layout, and integration of LLM API for validation and classification of content.
I am ready to review the detailed technical specifications and provide an estimate regarding stages, timelines, and costs.
Best regards,
Andrii
-
95478 1271 1 10 Hello. I have been working with React/Node.js for over 8+ years. I am ready for collaboration. Feel free to contact me.
-
807 2 0 Good day!
The task is clear. I am solving the problem with multi-column layout through coordinate extraction (PyMuPDF): the algorithm reads the X/Y coordinates of the blocks and collects text with images strictly vertically within each zone, rather than left to right. Validation through Claude API is a great solution.
To manage the entire pipeline, I will set up a separate web server (FastAPI or Flask). I will create a convenient admin panel in the browser, where you will be able to upload new PDFs, see a dashboard with Claude logs, and review rejected sections.
I am waiting for the extended technical specification and am ready to discuss the details.
-
612 21 0 Hello. I can do your project. I have experience. Write to me, we will agree.
-
332 1 0 Good day, Vladimir. I have experience working with parsers that extract even crooked scans from PDFs. I also have experience working with neural network APIs and their integration into bots. If the project is still relevant, I suggest we discuss the details of collaboration.
-
5011 41 4 1 Good day!
I am ready to develop a system for parsing and classifying sections from your PDF books. I have extensive experience in coordinate text extraction from PDF and multi-column layout, as well as integrating LLM API for quality control and thematic classification.
Contact me to discuss the details and obtain an extended technical specification.
-
3008 73 4 2 Good day! I can implement such a system in the form of a web application!!! Feel free to contact me!!!
-
2426 20 0 Good day, I am ready to complete your task quickly and efficiently, I have extensive experience in developing various parsers. Write to me in private messages and we will discuss the details. I will be happy to help)
-
9340 20 0 1 Good day. I have reviewed the task and can implement coordinate parsing of PDF, quality checking through Claude API, retry parsing attempts, and a web interface for managing books, logs, and problematic sections.
I have experience with PDF parsing and data verification (https://freelancehunt.com/project/parser-pdf-bankivskih-vipisok/1578814.html), and I have also worked with Azure OCR, so I understand the nuances of complex layouts and multi-column text.
I would like to see examples of books, especially those that are complex in structure, to better assess the approach and timelines. I am also interested in whether there are requirements for processing speed.
I am ready to discuss the details.
-
1328 35 1 Hello. I have experience working with PDF, I understand what is being discussed and I understand the difficulties. Feel free to contact me, we will discuss the details and budget.
-
414 Good day! 👋
The task is clear — this is not just PDF parsing, but building a full-fledged data processing pipeline with quality control through LLM. I have relevant experience in such systems.
Experience in similar projects
I have worked on:
— parsing complex PDFs (multi-column, tables, mixed blocks)
… — extracting text through coordinates (pdfplumber / PyMuPDF)
— building pipelines: parsing → cleaning → validation → DB
— integration with LLM (Claude / GPT) for verification and classification
— systems with retry logic and data quality control
How I see the implementation
1. PDF Parser (key stage)
— using PyMuPDF / pdfplumber
— extracting blocks by coordinates (not line by line)
— restoring the correct structure:
— defining columns
— sorting blocks (left → right, top → bottom)
— separate parsing:
— text
— images (PNG with coordinate binding)
— tables
👉 This allows avoiding “mixed” text — the main problem of such PDFs.
2. Processing + classification
— segmentation into sections (by headings / structure)
— text normalization
— preparation for sending to Claude
3. Integration with Claude API
— text quality checking
— problem detection (mixed columns, breaks)
— retry logic (up to 3 attempts)
— logging reasons for failure
👉 This is essentially a “self-healing” pipeline.
4. Backend (Python priority)
— FastAPI
— task queue (Celery / asyncio workers)
— processing books in the background
— API for the admin panel
5. Database
— PostgreSQL
— structure:
— books
— sections
— media (images)
— statuses / logs
6. Admin panel
— simple web interface:
— uploading books
— starting parsing
— statuses / progress
— errors and retry
— can be implemented on:
— React or simpler (FastAPI + Jinja / admin panel)
What the result will look like
— you start processing a book
— the system automatically:
— parses
— checks through Claude
— saves in the DB
— in the database:
— clean structured text
— linked images
— there is an interface for control
Technologies
— Python (FastAPI, asyncio)
— PyMuPDF / pdfplumber
— PostgreSQL
— Claude API
— Docker
I have already worked with multi-column PDFs and know the main “pitfalls” — this is exactly the case where standard solutions do not work and custom logic needs to be built.
I am ready to look at examples of your PDFs and propose an exact architecture and implementation plan.
-
6276 144 6 4 Hello
I have experience and developments in parsing complex PDFs that contain tables, graphs, and diagrams. I suggest using an approach with multiple tools. OCR on your side is questionable; it will most likely be more convenient to implement it together with the rest of the functionality, especially since you probably won't be using any unique tools that I don't know about.
For quality checking, there are a couple more options for VL models; we will need to test them.
Samples of books are needed, preferably the most complex in structure, for testing.
Another question regarding parsing speed - what are the minimum and maximum requirements, if any?
Current freelance projects in the category Data Parsing
Database extraction
16 USD
Good day. We need to download a database from the website logist pro. We will provide the account details. There are profiles that need to be opened and downloaded. The database contains about 3000 people. What information is needed in Excel: 1. Phone number (1 contact) 2.… Data Parsing ∙ 58 minutes back ∙ 22 proposals |
A specialist is needed to find contacts of decision-makers in Ukraine.It is necessary to gather a database (or ready database) of contacts of decision-makers (DMs) in companies in Ukraine. Information Gathering, Data Parsing ∙ 1 hour 32 minutes back ∙ 5 proposals |
Need to scrape data from LinkedInWe need to scrape data from LinkedIn based on our list. For each entry, we need to find and collect available data if it exists on the LinkedIn profile, including the profile picture on the LinkedIn social network, email address, links to social media, company website, and… Data Parsing ∙ 7 hours 19 minutes back ∙ 15 proposals |
Parsing and classification of dataWe are looking for a developer to implement a system for collecting and structuring data from open sources. We have a database of small business owners in the USA, which contains the person's name, company name, address, and state. It is necessary to build a process for… Web Programming, Data Parsing ∙ 8 hours 28 minutes back ∙ 32 proposals |
Svitlahata
17 USD
It is necessary to import 1819 products from the XML/YML feed of Prom.ua to OpenCart 3. A ready XML file is available, which contains product names, descriptions, prices, photos, specifications, manufacturers, and categories. Requirements: import all products to OpenCart… Content Management Systems, Data Parsing ∙ 1 day 11 hours back ∙ 32 proposals |