Switch to English?
Yes
Переключитись на українську?
Так
Переключиться на русскую?
Да
Przełączyć się na polską?
Tak
Post your project for free and start receiving proposals from freelancers within minutes after publication!

PDF book parser (text + images)

Translated

  1. 596
     2  0
    Work example:
    Сервис аренды автомобилей
    1 day223 USD

    ✋ Hello! We are the IT company dZENcode.

    We are implementing a Python service for parsing PDFs with coordinate analysis of layouts, extracting text and images, classifying sections, validating through the Claude API, and a web admin panel, relying on the team's experience, best practices, and our own developments.

    What is the structure of the sections and the rules for thematic classification?
    Will there be coordinates for text blocks after OCR?

    You can find detailed information about our services and rates on our website: Freelancehunt
    Take a look – we will discuss the details of the work further, write when you are ready.

    The final cost is formed only after clarifying the volume and requirements.

    ___________________
    Sincerely,
    Manager of dZENcode

    Our strengths:
    💎 10+ years providing IT services: Outsourcing, Outstaffing
    🔥 90+ in-house specialists
    🚀 Projects "from scratch" and for support
    ⚙️ SLA and post-production support
    ✅ Contract with the company, guaranteed results!
    🔥 250+ public reviews since 2015.

  2. 271  
    2 days45 USD

    Good day! I am ready to complete your task for a reasonable price in exchange for a good evaluation of my work.

  3. 2964    14  0
    10 days334 USD

    Good day.
    For the task of three-column layout and extracting blocks of text along with images, a custom coordinate parser can be written, but as a more reliable alternative, I suggest considering specialized APIs like AWS Textract or Google Document AI. They natively recognize complex multi-column layouts and provide a ready structure, which will significantly reduce the number of errors before sending the text for verification.

    I will implement all server-side logic with routing, validation through Claude API in three attempts, and saving results on Node.js with Typescript. The admin interface for managing the queue of books, displaying statistics, and viewing logs for problematic sections will be built on Next.js.

    In private messages, I will show examples of scripts for extracting data from documents with complex structures and integration with LLM API. I would be happy to review the extended technical assignment.

  4. 1390    12  0
    7 days334 USD

    Hello,
    I have experience working with the Tesseract library and with blocks in particular. I am implementing a server with functionality in Node.js/Python/Go (depending on your preferences), front-end in Vue or React. I have also worked with LLM, I can create a universal interface for replacing agents if needed.

    I would be happy to collaborate!

  5. 1580    3  0
    1 day22 USD

    Hello, I have experience in creating systems and services for data parsing. I am ready to quickly and efficiently develop a parser for you, taking into account all requirements. I suggest discussing the details in private messages.

  6. 358    1  0
    7 days89 USD

    Good day!

    The task is clear. I have relevant experience: I developed a system for automatic uploading and processing of PDF invoices via API (the project is available on GitHub). The system included a GUI interface, date range selection, auto-upload, and auto-processing of files.

    For your project, I will implement:
    PDF parsing with block coordinate processing (pymupdf/pdfplumber) for correct reading of three-column layouts
    Quality checking through Claude API with auto-re-parsing
    Celery + Redis for task queue (30,000 pages - a stable queue is needed)
    Admin panel with dashboard and logs
    PostgreSQL for storing sections + PNG
    https://github.com/NazarShubeliak
    I am ready to discuss the detailed technical specifications.

  7. 7123    53  0
    1 day89 USD

    I understand the task of developing a reliable solution for parsing PDF manuals for car repairs, extracting text and images from a large volume of typographic sources. I have extensive experience in creating complex systems for extracting structured data from unstructured sources, including technical documentation and large document libraries. For such volume and specificity of data, the architecture that ensures extraction accuracy, error handling, and further scaling for analytics or display is critical. Please clarify what the ultimate goal of using the extracted data is: for creating a search database, interactive documentation, or something else? I would be happy to discuss this in detail to propose an optimal solution and assess timelines with the budget.

  8. 1495    13  0
    1 day45 USD

    Hello! I can implement it. Write to me privately to discuss all the details. I will be glad to cooperate!

  9. 387    1  0
    2 days223 USD

    Hello.

    In your technical specification, the key difficulty is not OCR, but the correct reconstruction of the structure: 3-column layout + mixed text/images. If you read the PDF "as is," you will get mixed text and a loss of logical sections.

    I propose a different approach:

    1. Parsing through coordinates (layout-aware)
    I break the page into blocks → cluster the columns → restore the reading order. This eliminates the mixing of text between columns.

    2. Content binding
    Images and tables are bound to the nearest text blocks (by coordinates and context) so that the connection is stored in the database, not just a "set of files."

    3. Claude as a quality gate, not a "hack"
    After parsing, each section undergoes verification:
    — whether the columns are stuck together
    — whether the logic of the text is violated
    — whether there are any breaks
    In case of errors — automatic retry with different parameters.

    4. Scaling to your volume
    100 books / ~30k pages → I do batching + queues + logging to ensure the system works stably and does not crash in the middle.

    5. An admin panel that really helps
    I will show not just "status," but problematic areas: which pages/sections did not pass validation and why.

    To save your time — I propose:
    I will create a prototype on 1 book (full cycle: parsing → Claude → structure in the database). You will immediately see if this is the level of quality you need.

    If it works — we will scale without changing the architecture.

    I am ready to start immediately after receiving a sample PDF.

  10. 139  
    10 days334 USD

    Vladimir, hello!

    An excellent and non-trivial task. Parsing multi-column PDFs is always a pain, but your approach with validating semantic gaps through the Claude API makes the system very smart and fault-tolerant.

    Plus, the topic of manuals is very close to me personally: I service my own cars (from VAZ 2105 to Mercedes), so I understand the specifics of repair manuals perfectly. I will immediately notice in the tests if the parser confuses the assembly order of a unit from adjacent columns.

    Here’s how I propose to technically implement the pipeline:

    Parser (Coordinates): Use the PyMuPDF (fitz) or pdfplumber library. They allow extracting bounding boxes (exact x,y coordinates). We will write a heuristic that will read blocks strictly by columns (top-to-bottom, left-to-right), cut out headers, and separately save PNG diagrams linked to paragraphs.

    Claude API: We will write a validation script with a system prompt that will analyze the text of the section for logical coherence. In case of an error — a trigger for a re-pass with modified indentation parameters.

    Web interface: To save time and create a convenient dashboard, I will set up an admin panel on Streamlit or FastAPI + Jinja2. There will be convenient book uploads, error logs from Claude, and manual control of stuck sections.

    I am ready to look at a couple of pages of your manuals as a test sample and show the logic of block extraction. I am waiting for an extended technical specification in private messages!

  11. 172    1  1
    1 day89 USD

    Hello! I am ready to complete this project and have extensive experience in developing various applications.

  12. 3700    17  0
    14 days334 USD

    Good day.

    I am ready to implement a turnkey system: parsing PDF with coordinate analysis of multi-column layout, thematic classification of sections, saving to a database, and a web admin panel for managing the queue, logs, and problematic cases.

    Technology stack used:

    Backend: Python, FastAPI / Django, Celery, PostgreSQL
    Integrations: PyMuPDF / pdfplumber, Claude API, OCR pipeline
    Frontend: Django Admin or a separate web admin panel
    Infrastructure: Docker, Redis

    I have experience working with coordinate extraction of text from PDF, multi-column layout, and integration of LLM API for validation and classification of content.

    I am ready to review the detailed technical specifications and provide an estimate regarding stages, timelines, and costs.

    Best regards,
    Andrii

  13. 95478    1271  1   10
    7 days601 USD

    Hello. I have been working with React/Node.js for over 8+ years. I am ready for collaboration. Feel free to contact me.

  14. 807    2  0
    10 days223 USD

    Good day!

    The task is clear. I am solving the problem with multi-column layout through coordinate extraction (PyMuPDF): the algorithm reads the X/Y coordinates of the blocks and collects text with images strictly vertically within each zone, rather than left to right. Validation through Claude API is a great solution.

    To manage the entire pipeline, I will set up a separate web server (FastAPI or Flask). I will create a convenient admin panel in the browser, where you will be able to upload new PDFs, see a dashboard with Claude logs, and review rejected sections.

    I am waiting for the extended technical specification and am ready to discuss the details.

  15. 612    21  0
    10 days223 USD

    Hello. I can do your project. I have experience. Write to me, we will agree.

  16. 332    1  0
    9 days156 USD

    Good day, Vladimir. I have experience working with parsers that extract even crooked scans from PDFs. I also have experience working with neural network APIs and their integration into bots. If the project is still relevant, I suggest we discuss the details of collaboration.

  17. Nick Osipov Web4Business
    5011    41  4   1
    3 days22 USD

    Good day!

    I am ready to develop a system for parsing and classifying sections from your PDF books. I have extensive experience in coordinate text extraction from PDF and multi-column layout, as well as integrating LLM API for quality control and thematic classification.

    Contact me to discuss the details and obtain an extended technical specification.

  18. 3008    73  4   2
    2 days134 USD

    Good day! I can implement such a system in the form of a web application!!! Feel free to contact me!!!

  19. 2426    20  0
    5 days334 USD

    Good day, I am ready to complete your task quickly and efficiently, I have extensive experience in developing various parsers. Write to me in private messages and we will discuss the details. I will be happy to help)

  20. 9340    20  0   1
    3 days223 USD

    Good day. I have reviewed the task and can implement coordinate parsing of PDF, quality checking through Claude API, retry parsing attempts, and a web interface for managing books, logs, and problematic sections.

    I have experience with PDF parsing and data verification (https://freelancehunt.com/project/parser-pdf-bankivskih-vipisok/1578814.html), and I have also worked with Azure OCR, so I understand the nuances of complex layouts and multi-column text.

    I would like to see examples of books, especially those that are complex in structure, to better assess the approach and timelines. I am also interested in whether there are requirements for processing speed.

    I am ready to discuss the details.

  21. 1328    35  1
    3 days111 USD

    Hello. I have experience working with PDF, I understand what is being discussed and I understand the difficulties. Feel free to contact me, we will discuss the details and budget.

  22. 414  
    5 days111 USD

    Good day! 👋

    The task is clear — this is not just PDF parsing, but building a full-fledged data processing pipeline with quality control through LLM. I have relevant experience in such systems.

    Experience in similar projects

    I have worked on:

    — parsing complex PDFs (multi-column, tables, mixed blocks)
    — extracting text through coordinates (pdfplumber / PyMuPDF)
    — building pipelines: parsing → cleaning → validation → DB
    — integration with LLM (Claude / GPT) for verification and classification
    — systems with retry logic and data quality control

    How I see the implementation
    1. PDF Parser (key stage)

    — using PyMuPDF / pdfplumber
    — extracting blocks by coordinates (not line by line)
    — restoring the correct structure:
    — defining columns
    — sorting blocks (left → right, top → bottom)
    — separate parsing:
    — text
    — images (PNG with coordinate binding)
    — tables

    👉 This allows avoiding “mixed” text — the main problem of such PDFs.

    2. Processing + classification

    — segmentation into sections (by headings / structure)
    — text normalization
    — preparation for sending to Claude

    3. Integration with Claude API

    — text quality checking
    — problem detection (mixed columns, breaks)
    — retry logic (up to 3 attempts)
    — logging reasons for failure

    👉 This is essentially a “self-healing” pipeline.

    4. Backend (Python priority)

    — FastAPI
    — task queue (Celery / asyncio workers)
    — processing books in the background
    — API for the admin panel

    5. Database

    — PostgreSQL
    — structure:
    — books
    — sections
    — media (images)
    — statuses / logs

    6. Admin panel

    — simple web interface:
    — uploading books
    — starting parsing
    — statuses / progress
    — errors and retry
    — can be implemented on:
    — React or simpler (FastAPI + Jinja / admin panel)

    What the result will look like

    — you start processing a book
    — the system automatically:
    — parses
    — checks through Claude
    — saves in the DB
    — in the database:
    — clean structured text
    — linked images
    — there is an interface for control

    Technologies

    — Python (FastAPI, asyncio)
    — PyMuPDF / pdfplumber
    — PostgreSQL
    — Claude API
    — Docker

    I have already worked with multi-column PDFs and know the main “pitfalls” — this is exactly the case where standard solutions do not work and custom logic needs to be built.

    I am ready to look at examples of your PDFs and propose an exact architecture and implementation plan.

  23. 6276    144  6   4
    10 days223 USD

    Hello
    I have experience and developments in parsing complex PDFs that contain tables, graphs, and diagrams. I suggest using an approach with multiple tools. OCR on your side is questionable; it will most likely be more convenient to implement it together with the rest of the functionality, especially since you probably won't be using any unique tools that I don't know about.
    For quality checking, there are a couple more options for VL models; we will need to test them.
    Samples of books are needed, preferably the most complex in structure, for testing.
    Another question regarding parsing speed - what are the minimum and maximum requirements, if any?

  24. Another 9 proposals concealed

Current freelance projects in the category Data Parsing

Database extraction

16 USD

Good day. We need to download a database from the website logist pro. We will provide the account details. There are profiles that need to be opened and downloaded. The database contains about 3000 people. What information is needed in Excel: 1. Phone number (1 contact) 2.…

Data Parsing ∙ 58 minutes back ∙ 22 proposals

A specialist is needed to find contacts of decision-makers in Ukraine.

It is necessary to gather a database (or ready database) of contacts of decision-makers (DMs) in companies in Ukraine.

Information GatheringData Parsing ∙ 1 hour 32 minutes back ∙ 5 proposals

Need to scrape data from LinkedIn

We need to scrape data from LinkedIn based on our list. For each entry, we need to find and collect available data if it exists on the LinkedIn profile, including the profile picture on the LinkedIn social network, email address, links to social media, company website, and…

Data Parsing ∙ 7 hours 19 minutes back ∙ 15 proposals

Parsing and classification of data

We are looking for a developer to implement a system for collecting and structuring data from open sources. We have a database of small business owners in the USA, which contains the person's name, company name, address, and state. It is necessary to build a process for…

Web ProgrammingData Parsing ∙ 8 hours 28 minutes back ∙ 32 proposals

Svitlahata

17 USD

It is necessary to import 1819 products from the XML/YML feed of Prom.ua to OpenCart 3. A ready XML file is available, which contains product names, descriptions, prices, photos, specifications, manufacturers, and categories. Requirements: import all products to OpenCart…

Content Management SystemsData Parsing ∙ 1 day 11 hours back ∙ 32 proposals

Client
Vladimir Novikov
Ukraine Kyiv  5  0
Project published
2 months 24 days back
366 views
Tags
  • OCR
  • Web Interface
  • Claude API
  • PDF Parser