Freelance projects

Freelance projects

PDF book parser (text + images)

Data Parsing, Python — incorrectly specified categories?

Project translated automatically. Log in or register, to view the original

TASK:

There is a library of PDF books that contain car repair manuals. The quality of the sources is "typographic" (this is not page-by-page manual scanning from paper manuals). The total volume is up to 100 books / approximately 30,000 pages in total.

On our side, OCR processing of all PDF books will be carried out in accordance with the contractor's agreement and requirements for this task.

A system needs to be written that will automatically extract (parse) all sections from these books, conduct thematic classification, and properly store them in a database.

WHAT THE SYSTEM WORK CONSISTS OF:

1. PDF Parser

Reads the book, finds all sections, extracts text in the correct order, and retrieves all images and tables. The main difficulty is that each page is laid out in three columns, with text and images mixed. Standard line-by-line reading of PDF will yield garbage; work with block coordinates is needed.

2. Quality Check via Claude API

After parsing, each section is sent to the Claude API. Claude reads the text and issues a verdict: all good — the section goes to the database; there are problems (text mixed from two columns, semantic breaks in the text, etc.) — the section is automatically re-parsed up to three attempts. In case of failure, Claude indicates the reasons for stopping.

3. Administrator Interface

A simple interface for managing the system: add a new book, start parsing, view statistics, deal with sections that could not be parsed after three attempts, and view error logs. Platform — web browser.

WHAT SHOULD BE ACHIEVED IN THE END:

You run a command specifying the book — the system parses everything by itself, checks via Claude, and organizes it in the database.

Each section in the database: clean text + photos linked to the text (PNG)

Working administrator interface with a dashboard, logs, and queue management

README in Russian

Important before responding! Show examples of working with coordinate text extraction from PDF or working with multi-column layouts. Experience with LLM API is welcome. A detailed technical specification will be sent after the first contact.

Proposals 32

Dmytro Derev'iankin

2 0

Projects -
Rating -
Rating 596

Budget: 10000 UAH Deadline: 1 day

✋ Hello! We are the IT company dZENcode.

We are implementing a Python service for parsing PDFs with coordinate analysis of layouts, extracting text and images, classifying sections, validating through the Claude API, and a web admin panel, relying on the team's experience, best practices, and our own developments.

What is the structure of the sections and the rules for thematic classification?
Will there be coordinates for text blocks after OCR?

You can find detailed information about our services and rates on our website: Freelancehunt
Take a look – we will discuss the details of the work further, write when you are ready.

The final cost is formed only after clarifying the volume and requirements.

___________________
Sincerely,
Manager of dZENcode

Our strengths:
💎 10+ years providing IT services: Outsourcing, Outstaffing
🔥 90+ in-house specialists
🚀 Projects "from scratch" and for support
⚙️ SLA and post-production support
✅ Contract with the company, guaranteed results!
🔥 250+ public reviews since 2015.

Andriy P.

0 0

Projects -
Rating -
Rating 247

Budget: 2000 UAH Deadline: 2 days

Good day! I am ready to complete your task for a reasonable price in exchange for a good evaluation of my work.

Danilo Manulyak

15 0

Budget: 15000 UAH Deadline: 10 days

Good day.
For the task of three-column layout and extracting blocks of text along with images, a custom coordinate parser can be written, but as a more reliable alternative, I suggest considering specialized APIs like AWS Textract or Google Document AI. They natively recognize complex multi-column layouts and provide a ready structure, which will significantly reduce the number of errors before sending the text for verification.

I will implement all server-side logic with routing, validation through Claude API in three attempts, and saving results on Node.js with Typescript. The admin interface for managing the queue of books, displaying statistics, and viewing logs for problematic sections will be built on Next.js.

In private messages, I will show examples of scripts for extracting data from documents with complex structures and integration with LLM API. I would be happy to review the extended technical assignment.

Dmitro M.

12 0

Budget: 15000 UAH Deadline: 7 days

Hello,
I have experience working with the Tesseract library and with blocks in particular. I am implementing a server with functionality in Node.js/Python/Go (depending on your preferences), front-end in Vue or React. I have also worked with LLM, I can create a universal interface for replacing agents if needed.

I would be happy to collaborate!

Taras O.

4 0

Budget: 1000 UAH Deadline: 1 day

Hello, I have experience in creating systems and services for data parsing. I am ready to quickly and efficiently develop a parser for you, taking into account all requirements. I suggest discussing the details in private messages.

Nazar Shubeliak

1 0

Projects -
Rating -
Rating 358

Budget: 3999 UAH Deadline: 7 days

Good day!

The task is clear. I have relevant experience: I developed a system for automatic uploading and processing of PDF invoices via API (the project is available on GitHub). The system included a GUI interface, date range selection, auto-upload, and auto-processing of files.

For your project, I will implement:
PDF parsing with block coordinate processing (pymupdf/pdfplumber) for correct reading of three-column layouts
Quality checking through Claude API with auto-re-parsing
Celery + Redis for task queue (30,000 pages - a stable queue is needed)
Admin panel with dashboard and logs
PostgreSQL for storing sections + PNG
https://github.com/NazarShubeliak
I am ready to discuss the detailed technical specifications.

Andrii Tyupa

53 0

Budget: 4000 UAH Deadline: 1 day

I understand the task of developing a reliable solution for parsing PDF manuals for car repairs, extracting text and images from a large volume of typographic sources. I have extensive experience in creating complex systems for extracting structured data from unstructured sources, including technical documentation and large document libraries. For such volume and specificity of data, the architecture that ensures extraction accuracy, error handling, and further scaling for analytics or display is critical. Please clarify what the ultimate goal of using the extracted data is: for creating a search database, interactive documentation, or something else? I would be happy to discuss this in detail to propose an optimal solution and assess timelines with the budget.

Oleksandr Z.

14 0

Budget: 2000 UAH Deadline: 1 day

Hello! I can implement it. Write to me privately to discuss all the details. I will be glad to cooperate!

Alisa S.

1 0

Projects -
Rating -
Rating 387

Budget: 10000 UAH Deadline: 2 days

Hello.

In your technical specification, the key difficulty is not OCR, but the correct reconstruction of the structure: 3-column layout + mixed text/images. If you read the PDF "as is," you will get mixed text and a loss of logical sections.

I propose a different approach:

1. Parsing through coordinates (layout-aware)
I break the page into blocks → cluster the columns → restore the reading order. This eliminates the mixing of text between columns.

2. Content binding
Images and tables are bound to the nearest text blocks (by coordinates and context) so that the connection is stored in the database, not just a "set of files."

3. Claude as a quality gate, not a "hack"
After parsing, each section undergoes verification:
— whether the columns are stuck together
— whether the logic of the text is violated
— whether there are any breaks
In case of errors — automatic retry with different parameters.

4. Scaling to your volume
100 books / ~30k pages → I do batching + queues + logging to ensure the system works stably and does not crash in the middle.

5. An admin panel that really helps
I will show not just "status," but problematic areas: which pages/sections did not pass validation and why.

To save your time — I propose:
I will create a prototype on 1 book (full cycle: parsing → Claude → structure in the database). You will immediately see if this is the level of quality you need.

If it works — we will scale without changing the architecture.

I am ready to start immediately after receiving a sample PDF.

Taras Hranychka

0 0

Projects -
Rating -
Rating 163

Budget: 15000 UAH Deadline: 10 days

Vladimir, hello!

An excellent and non-trivial task. Parsing multi-column PDFs is always a pain, but your approach with validating semantic gaps through the Claude API makes the system very smart and fault-tolerant.

Plus, the topic of manuals is very close to me personally: I service my own cars (from VAZ 2105 to Mercedes), so I understand the specifics of repair manuals perfectly. I will immediately notice in the tests if the parser confuses the assembly order of a unit from adjacent columns.

Here’s how I propose to technically implement the pipeline:

Parser (Coordinates): Use the PyMuPDF (fitz) or pdfplumber library. They allow extracting bounding boxes (exact x,y coordinates). We will write a heuristic that will read blocks strictly by columns (top-to-bottom, left-to-right), cut out headers, and separately save PNG diagrams linked to paragraphs.

Claude API: We will write a validation script with a system prompt that will analyze the text of the section for logical coherence. In case of an error — a trigger for a re-pass with modified indentation parameters.

Web interface: To save time and create a convenient dashboard, I will set up an admin panel on Streamlit or FastAPI + Jinja2. There will be convenient book uploads, error logs from Claude, and manual control of stuck sections.

I am ready to look at a couple of pages of your manuals as a test sample and show the logic of block extraction. I am waiting for an extended technical specification in private messages!

Dmytro Zmenkov

1 1

Projects -
Rating -
Rating 121

Budget: 4000 UAH Deadline: 1 day

Hello! I am ready to complete this project and have extensive experience in developing various applications.

Andrii Domashchenko

17 0

Budget: 15000 UAH Deadline: 14 days

Good day.

I am ready to implement a turnkey system: parsing PDF with coordinate analysis of multi-column layout, thematic classification of sections, saving to a database, and a web admin panel for managing the queue, logs, and problematic cases.

Technology stack used:

Backend: Python, FastAPI / Django, Celery, PostgreSQL
Integrations: PyMuPDF / pdfplumber, Claude API, OCR pipeline
Frontend: Django Admin or a separate web admin panel
Infrastructure: Docker, Redis

I have experience working with coordinate extraction of text from PDF, multi-column layout, and integration of LLM API for validation and classification of content.

I am ready to review the detailed technical specifications and provide an estimate regarding stages, timelines, and costs.

Best regards,
Andrii

Andrey K.

1 285 1

Budget: 27000 UAH Deadline: 7 days

Hello. I have been working with React/Node.js for over 8+ years. I am ready for collaboration. Feel free to contact me.

Viktor Gayoha

2 0

Projects -
Rating -
Rating 816

Budget: 10000 UAH Deadline: 10 days

Good day!

The task is clear. I am solving the problem with multi-column layout through coordinate extraction (PyMuPDF): the algorithm reads the X/Y coordinates of the blocks and collects text with images strictly vertically within each zone, rather than left to right. Validation through Claude API is a great solution.

To manage the entire pipeline, I will set up a separate web server (FastAPI or Flask). I will create a convenient admin panel in the browser, where you will be able to upload new PDFs, see a dashboard with Claude logs, and review rejected sections.

I am waiting for the extended technical specification and am ready to discuss the details.

Oleh Patrushev

21 0

Projects 21
Rating -
Rating 612

Budget: 10000 UAH Deadline: 10 days

Hello. I can do your project. I have experience. Write to me, we will agree.

Oleksandr Mikhov

1 0

Projects -
Rating -
Rating 332

Budget: 7000 UAH Deadline: 9 days

Good day, Vladimir. I have experience working with parsers that extract even crooked scans from PDFs. I also have experience working with neural network APIs and their integration into bots. If the project is still relevant, I suggest we discuss the details of collaboration.

Nick Osipov

41 4

Budget: 1000 UAH Deadline: 3 days

Good day!

I am ready to develop a system for parsing and classifying sections from your PDF books. I have extensive experience in coordinate text extraction from PDF and multi-column layout, as well as integrating LLM API for quality control and thematic classification.

Contact me to discuss the details and obtain an extended technical specification.

Tetyana Shumeyko

73 4

Budget: 6000 UAH Deadline: 2 days

Good day! I can implement such a system in the form of a web application!!! Feel free to contact me!!!

Dmytro Parkhomenko

20 0

Budget: 15000 UAH Deadline: 5 days

Good day, I am ready to complete your task quickly and efficiently, I have extensive experience in developing various parsers. Write to me in private messages and we will discuss the details. I will be happy to help)

Ivan Danyleiko

20 0

Budget: 10000 UAH Deadline: 3 days

Good day. I have reviewed the task and can implement coordinate parsing of PDF, quality checking through Claude API, retry parsing attempts, and a web interface for managing books, logs, and problematic sections.

I have experience with PDF parsing and data verification (https://freelancehunt.com/project/parser-pdf-bankivskih-vipisok/1578814.html), and I have also worked with Azure OCR, so I understand the nuances of complex layouts and multi-column text.

I would like to see examples of books, especially those that are complex in structure, to better assess the approach and timelines. I am also interested in whether there are requirements for processing speed.

I am ready to discuss the details.

Vladimir B

35 1

Budget: 5000 UAH Deadline: 3 days

Hello. I have experience working with PDF, I understand what is being discussed and I understand the difficulties. Feel free to contact me, we will discuss the details and budget.

Volodymyr Mahdyk

0 0

Projects -
Rating -
Rating 390

Budget: 5000 UAH Deadline: 5 days

Good day! 👋

The task is clear — this is not just PDF parsing, but building a full-fledged data processing pipeline with quality control through LLM. I have relevant experience in such systems.

Experience in similar projects

I have worked on:

— parsing complex PDFs (multi-column, tables, mixed blocks)
— extracting text through coordinates (pdfplumber / PyMuPDF)
— building pipelines: parsing → cleaning → validation → DB
— integration with LLM (Claude / GPT) for verification and classification
— systems with retry logic and data quality control

How I see the implementation
1. PDF Parser (key stage)

— using PyMuPDF / pdfplumber
— extracting blocks by coordinates (not line by line)
— restoring the correct structure:
— defining columns
— sorting blocks (left → right, top → bottom)
— separate parsing:
— text
— images (PNG with coordinate binding)
— tables

👉 This allows avoiding “mixed” text — the main problem of such PDFs.

2. Processing + classification

— segmentation into sections (by headings / structure)
— text normalization
— preparation for sending to Claude

3. Integration with Claude API

— text quality checking
— problem detection (mixed columns, breaks)
— retry logic (up to 3 attempts)
— logging reasons for failure

👉 This is essentially a “self-healing” pipeline.

4. Backend (Python priority)

— FastAPI
— task queue (Celery / asyncio workers)
— processing books in the background
— API for the admin panel

5. Database

— PostgreSQL
— structure:
— books
— sections
— media (images)
— statuses / logs

6. Admin panel

— simple web interface:
— uploading books
— starting parsing
— statuses / progress
— errors and retry
— can be implemented on:
— React or simpler (FastAPI + Jinja / admin panel)

What the result will look like

— you start processing a book
— the system automatically:
— parses
— checks through Claude
— saves in the DB
— in the database:
— clean structured text
— linked images
— there is an interface for control

Technologies

— Python (FastAPI, asyncio)
— PyMuPDF / pdfplumber
— PostgreSQL
— Claude API
— Docker

I have already worked with multi-column PDFs and know the main “pitfalls” — this is exactly the case where standard solutions do not work and custom logic needs to be built.

I am ready to look at examples of your PDFs and propose an exact architecture and implementation plan.

Sergey Mironov

144 6

Budget: 10000 UAH Deadline: 10 days

Hello
I have experience and developments in parsing complex PDFs that contain tables, graphs, and diagrams. I suggest using an approach with multiple tools. OCR on your side is questionable; it will most likely be more convenient to implement it together with the rest of the functionality, especially since you probably won't be using any unique tools that I don't know about.
For quality checking, there are a couple more options for VL models; we will need to test them.
Samples of books are needed, preferably the most complex in structure, for testing.
Another question regarding parsing speed - what are the minimum and maximum requirements, if any?

Vladimir Novikov
Kyiv, Ukraine

Projects 5
Rating -
Rating 543

PDF book parser (text + images)

Dmytro Derev'iankin

Andriy P.

Danilo Manulyak

Dmitro M.

Taras O.

Nazar Shubeliak

Andrii Tyupa

Oleksandr Z.

Alisa S.

Taras Hranychka

Dmytro Zmenkov

Andrii Domashchenko

Andrey K.

Viktor Gayoha

Oleh Patrushev

Oleksandr Mikhov

Nick Osipov

Tetyana Shumeyko

Dmytro Parkhomenko

Ivan Danyleiko

Vladimir B

Volodymyr Mahdyk

Sergey Mironov

Proposals concealed

Proposals are currently absent

Current freelance projects in the category Data Parsing

Creation of a stable parser/monitor for prices and availability for a demanding RTV/AGD store

Commodity nomenclature management system with flexible rights allocation

Bot/program for parsing channels, chats in TG

Parsing prices and product relevance

Set up automation for updating product stock on Prom from the supplier's file.