pdf_parser.py


Overview

pdf_parser.py is a comprehensive PDF parsing and processing module designed for extracting structured content — including text, tables, and figures — from PDF documents. It leverages a combination of OCR, layout recognition, and machine learning models to analyze the visual and textual layout of PDF pages. The parser supports multi-page, multi-column documents and is capable of handling complex layouts, including nested tables and figures with captions.

Key functionalities include:

This file provides three main parser classes:


Classes and Methods

1. RAGFlowPdfParser

The core class for parsing PDFs into structured bounding boxes with text, tables, and figures.

Initialization

def __init__(self, **kwargs)

Usage Example

parser = RAGFlowPdfParser()

Internal Utility Methods


Core Functional Methods

__ocr(self, pagenum, img, chars, ZM=3, device_id=None)

Runs OCR detection and recognition on an image, merges detected text boxes with character-level data, and appends results to internal state.


__images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None)

Loads the PDF, extracts page images and character-level information, and performs OCR asynchronously on pages.


_layouts_rec(self, ZM, drop=True)

Performs layout recognition on the OCR boxes to classify them into layout components (text, tables, figures).


_table_transformer_job(self, ZM)

Detects tables in the page layouts and extracts table structure components using TableStructureRecognizer.


_text_merge(self)

Merges horizontally adjacent text boxes that belong to the same layout and are close in vertical alignment.


_naive_vertical_merge(self, zoomin=3)

Heuristically merges vertically adjacent boxes within the same column.


_concat_downward(self, concat_between_pages=True)

(Partially disabled in code with return early)

Intended to merge vertically adjacent boxes downward based on learned features and the XGBoost model.


_filter_forpages(self)

Filters out pages or boxes that look like table of contents, acknowledgments, or pages with noisy patterns (e.g., many dots).


_merge_with_same_bullet(self)

Merges consecutive boxes starting with the same bullet character, combining their text content.


_extract_table_figure(self, need_image, ZM, return_html, need_position, separate_tables_figures=False)

Extracts tables and figures from boxes, optionally returns cropped images and HTML representations of tables.


__call__(self, fnm, need_image=True, zoomin=3, return_html=False)

Main entry method to parse a PDF file into text bounding boxes and tables.


parse_into_bboxes(self, fnm, callback=None, zoomin=3)

Advanced parsing method with stepwise progress callbacks.


crop(self, text, ZM=3, need_position=False)

Given a text string containing position tags, extracts and crops the corresponding image regions from the PDF pages.


2. PlainParser

A lightweight parser that reads plain text from PDF pages without layout analysis.


3. VisionParser

Extends RAGFlowPdfParser to integrate with external vision models for enhanced text extraction from images.

def __init__(self, vision_model, *args, **kwargs)
def __call__(self, filename, from_page=0, to_page=100000, **kwargs)

Important Implementation Details


Interaction with Other Components


Visual Diagram

classDiagram
    class RAGFlowPdfParser {
        +__init__(**kwargs)
        +__call__(fnm, need_image=True, zoomin=3, return_html=False)
        +parse_into_bboxes(fnm, callback=None, zoomin=3)
        +crop(text, ZM=3, need_position=False)
        -__ocr(pagenum, img, chars, ZM=3, device_id=None)
        -__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
        -_layouts_rec(ZM, drop=True)
        -_table_transformer_job(ZM)
        -_text_merge()
        -_naive_vertical_merge(zoomin=3)
        -_concat_downward(concat_between_pages=True)
        -_filter_forpages()
        -_merge_with_same_bullet()
        -_extract_table_figure(need_image, ZM, return_html, need_position, separate_tables_figures=False)
        -proj_match(line)
        -_line_tag(bx, ZM)
        -__filterout_scraps(boxes, ZM)
        +remove_tag(txt)
        +extract_positions(txt)
        +get_position(bx, ZM)
    }

    class PlainParser {
        +__call__(filename, from_page=0, to_page=100000, **kwargs)
        +crop(ck, need_position)
        +remove_tag(txt)
    }

    class VisionParser {
        +__init__(vision_model, *args, **kwargs)
        +__call__(filename, from_page=0, to_page=100000, **kwargs)
        -__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
    }

    VisionParser --|> RAGFlowPdfParser

Summary

pdf_parser.py provides a sophisticated and extensible foundation for PDF document parsing within the InfiniFlow ecosystem, combining advanced OCR, layout analysis, and machine learning techniques to deliver structured, high-quality text and table extraction suitable for downstream NLP and vision tasks.