pdf_parser.py

Overview

pdf_parser.py is a comprehensive PDF parsing and processing module designed for extracting structured content — including text, tables, and figures — from PDF documents. It leverages a combination of OCR, layout recognition, and machine learning models to analyze the visual and textual layout of PDF pages. The parser supports multi-page, multi-column documents and is capable of handling complex layouts, including nested tables and figures with captions.

Key functionalities include:

Extraction of text with spatial bounding boxes.
Table and figure detection and structured extraction.
OCR-based recognition for pages with poor text extraction.
Layout analysis and merging of fragmented text boxes.
Support for parallelized processing on multiple devices.
Integration with advanced vision models for enhanced layout understanding.

This file provides three main parser classes:

RAGFlowPdfParser: The primary, feature-rich parser for complex PDF layout extraction.
PlainParser: A simpler parser extracting plain text lines without layout sophistication.
VisionParser: Extends RAGFlowPdfParser with integration to external vision models for enhanced text extraction from images.

Classes and Methods

1. `RAGFlowPdfParser`

The core class for parsing PDFs into structured bounding boxes with text, tables, and figures.

Initialization

def __init__(self, **kwargs)

Initializes OCR, layout recognizers, table structure recognizer, and an XGBoost model for layout merging decisions.
Supports parallel OCR processing if multiple devices are available (PARALLEL_DEVICES).
Loads a pre-trained XGBoost model for judging whether to merge vertical text boxes.
Sets initial page and column configurations.

Usage Example

parser = RAGFlowPdfParser()

Internal Utility Methods

__char_width(self, c): Calculate character width from a character bounding box dictionary.
__height(self, c): Calculate height of a bounding box.
_x_dis(self, a, b): Horizontal distance between two bounding boxes.
_y_dis(self, a, b): Vertical distance between two bounding boxes.
_match_proj(self, b): Checks if a text matches predefined project-related regex patterns.
_updown_concat_features(self, up, down): Extract features between two vertically adjacent boxes to decide if they should be concatenated.
sort_X_by_page(arr, threshold): Sort bounding boxes primarily by page number, then by horizontal and vertical positions with threshold-based adjustment.
_has_color(self, o): Heuristic to determine if a character box has relevant color (filters out greyscale or irrelevant text).
proj_match(self, line): Matches line text against a set of predefined regexes to classify structural elements (e.g., chapters, sections).
_line_tag(self, bx, ZM): Generates a position tag string for a bounding box for identification.
remove_tag(txt): Static method to remove position tags from text.
extract_positions(txt): Static method to extract position metadata from tagged text.
get_position(self, bx, ZM): Returns positional tuples for a bounding box, considering page splits.

Core Functional Methods

`__ocr(self, pagenum, img, chars, ZM=3, device_id=None)`

Runs OCR detection and recognition on an image, merges detected text boxes with character-level data, and appends results to internal state.

Parameters:
- pagenum (int): 1-based page number.
- img (PIL.Image): Image of the page.
- chars (list): List of character bounding boxes.
- ZM (int): Zoom multiplier.
- device_id (int or None): Device ID for OCR processing (for parallelization).
Return: None

`images(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None)`

Loads the PDF, extracts page images and character-level information, and performs OCR asynchronously on pages.

Handles locking to prevent concurrent access to pdfplumber.
Extracts outlines (table of contents) using pypdf.
Determines if the document is primarily English based on sampled text.
Supports progress callbacks.
Parameters:
- fnm (str or bytes): PDF filename or binary content.
- zoomin (int): Image DPI zoom factor for processing.
- page_from (int): Start page index (0-based).
- page_to (int): End page index (exclusive).
- callback (callable): Optional function for progress reporting.
Return: None

`_layouts_rec(self, ZM, drop=True)`

Performs layout recognition on the OCR boxes to classify them into layout components (text, tables, figures).

Uses LayoutRecognizer on page images and OCR boxes.
Adjusts bounding box coordinates with cumulative page height offsets.
Parameters:
- ZM (int): Zoom multiplier.
- drop (bool): Whether to drop uncertain layouts.

`_table_transformer_job(self, ZM)`

Detects tables in the page layouts and extracts table structure components using TableStructureRecognizer.

Crops table images from pages.
Runs table detection model on cropped images.
Annotates bounding boxes with table row, header, column, and spanning information.

`_text_merge(self)`

Merges horizontally adjacent text boxes that belong to the same layout and are close in vertical alignment.

`_naive_vertical_merge(self, zoomin=3)`

Heuristically merges vertically adjacent boxes within the same column.

Detects multi-column layouts by estimating column width.
Sorts boxes accordingly.
Merges boxes unless separated by strong layout or page boundaries.

`_concat_downward(self, concat_between_pages=True)`

(Partially disabled in code with return early)

Intended to merge vertically adjacent boxes downward based on learned features and the XGBoost model.

`_filter_forpages(self)`

Filters out pages or boxes that look like table of contents, acknowledgments, or pages with noisy patterns (e.g., many dots).

`_merge_with_same_bullet(self)`

Merges consecutive boxes starting with the same bullet character, combining their text content.

`_extract_table_figure(self, need_image, ZM, return_html, need_position, separate_tables_figures=False)`

Extracts tables and figures from boxes, optionally returns cropped images and HTML representations of tables.

Merges tables across page boundaries.
Associates captions with tables and figures.
Crops out images corresponding to detected tables and figures.
Parameters:
- need_image (bool): Whether to extract images for tables and figures.
- ZM (int): Zoom multiplier.
- return_html (bool): Whether to return tables as HTML.
- need_position (bool): Whether to return positional metadata.
- separate_tables_figures (bool): Whether to separate tables and figures in output.
Returns: List of (Image, text_or_html) tuples, possibly with position metadata.

`call(self, fnm, need_image=True, zoomin=3, return_html=False)`

Main entry method to parse a PDF file into text bounding boxes and tables.

Runs image extraction, layout recognition, table detection, text merging, concatenation, filtering, and extraction.
Returns filtered text boxes and tables.
Parameters:
- fnm (str or bytes): PDF filename or binary.
- need_image (bool): Whether to extract images for tables/figures.
- zoomin (int): Zoom factor for image processing.
- return_html (bool): Whether to return tables in HTML format.
Returns: Tuple: (list of text bounding boxes, list of tables)

`parse_into_bboxes(self, fnm, callback=None, zoomin=3)`

Advanced parsing method with stepwise progress callbacks.

Similar to call, but with detailed callbacks for each major stage.
After extraction, inserts table and figure bounding boxes back into the text box list, preserving spatial order.
Adds cropped images to bounding boxes.
Parameters:
- fnm (str or bytes): PDF filename or binary.
- callback (callable): Function receiving (progress: float, message: str).
- zoomin (int): Zoom factor.
Returns: List of bounding boxes with layout and image data.

`crop(self, text, ZM=3, need_position=False)`

Given a text string containing position tags, extracts and crops the corresponding image regions from the PDF pages.

Handles multi-page bounding boxes by stitching cropped images vertically.
Applies transparent overlays on edges for visual clarity.
Parameters:
- text (str): Text containing position tags.
- ZM (int): Zoom factor.
- need_position (bool): Whether to return positional metadata.
Returns: PIL Image object (and optionally positions) or None if no valid positions.

2. `PlainParser`

A lightweight parser that reads plain text from PDF pages without layout analysis.

Extracts text lines from pages.
Extracts outlines for navigation.
Does not implement cropping or tag removal (raises NotImplementedError).

3. `VisionParser`

Extends RAGFlowPdfParser to integrate with external vision models for enhanced text extraction from images.

Overrides images to only extract images without OCR.
Uses picture_vision_llm_chunk to process each page image with a vision language model.
Returns a list of text snippets with simple positional metadata.
Initialization

def __init__(self, vision_model, *args, **kwargs)

Calling

def __call__(self, filename, from_page=0, to_page=100000, **kwargs)

Important Implementation Details

Concurrency Control: Uses a global threading lock LOCK_KEY_pdfplumber to avoid simultaneous opening of PDFs with pdfplumber, ensuring thread safety.
OCR Integration: Combines character-level info from pdfplumber with OCR detected boxes to improve text recognition and positioning.
XGBoost Model: A gradient boosting model (updown_concat_xgb.model) is used to decide when to merge vertically adjacent text boxes, improving logical text flow.
Layout Models: Layout classification and table structure recognition are handled by imported LayoutRecognizer and TableStructureRecognizer from the deepdoc.vision module.
Position Tagging: Text lines and boxes are tagged with position metadata to enable precise cropping and image extraction for UI or further processing.
Multi-Page and Multi-Column Support: The parser handles multi-page spanning elements and detects multi-column layouts to reorder text properly.
Model Downloading: If the XGBoost model is not found locally, it is downloaded from HuggingFace Hub automatically.
Language Detection: Samples text to heuristically determine if the document is English to adapt OCR and tokenization accordingly.
Error Handling: Extensive try-except blocks log warnings or errors but allow processing to continue where possible.

Interaction with Other Components

Uses external libraries:
- pdfplumber and pypdf for PDF page and text extraction.
- numpy and PIL for image and numerical processing.
- xgboost for the merging decision model.
- trio for asynchronous parallel OCR processing.
- deepdoc.vision for OCR and layout recognition models.
- rag NLP and vision utilities for tokenization and prompts.
- huggingface_hub for model downloading.
Interacts with the api module for settings and base directory utility.
VisionParser uses a vision language model for enhanced page description.
Designed as part of a larger document analysis or retrieval-augmented generation (RAG) system.

Visual Diagram

classDiagram
    class RAGFlowPdfParser {
        +__init__(**kwargs)
        +__call__(fnm, need_image=True, zoomin=3, return_html=False)
        +parse_into_bboxes(fnm, callback=None, zoomin=3)
        +crop(text, ZM=3, need_position=False)
        -__ocr(pagenum, img, chars, ZM=3, device_id=None)
        -__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
        -_layouts_rec(ZM, drop=True)
        -_table_transformer_job(ZM)
        -_text_merge()
        -_naive_vertical_merge(zoomin=3)
        -_concat_downward(concat_between_pages=True)
        -_filter_forpages()
        -_merge_with_same_bullet()
        -_extract_table_figure(need_image, ZM, return_html, need_position, separate_tables_figures=False)
        -proj_match(line)
        -_line_tag(bx, ZM)
        -__filterout_scraps(boxes, ZM)
        +remove_tag(txt)
        +extract_positions(txt)
        +get_position(bx, ZM)
    }

    class PlainParser {
        +__call__(filename, from_page=0, to_page=100000, **kwargs)
        +crop(ck, need_position)
        +remove_tag(txt)
    }

    class VisionParser {
        +__init__(vision_model, *args, **kwargs)
        +__call__(filename, from_page=0, to_page=100000, **kwargs)
        -__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
    }

    VisionParser --|> RAGFlowPdfParser

Summary

pdf_parser.py provides a sophisticated and extensible foundation for PDF document parsing within the InfiniFlow ecosystem, combining advanced OCR, layout analysis, and machine learning techniques to deliver structured, high-quality text and table extraction suitable for downstream NLP and vision tasks.

pdf_parser.py

Overview

Classes and Methods

1. RAGFlowPdfParser

Initialization

Internal Utility Methods

Core Functional Methods

__ocr(self, pagenum, img, chars, ZM=3, device_id=None)

__images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None)

_layouts_rec(self, ZM, drop=True)

_table_transformer_job(self, ZM)

_text_merge(self)

_naive_vertical_merge(self, zoomin=3)

_concat_downward(self, concat_between_pages=True)

_filter_forpages(self)

_merge_with_same_bullet(self)

_extract_table_figure(self, need_image, ZM, return_html, need_position, separate_tables_figures=False)

__call__(self, fnm, need_image=True, zoomin=3, return_html=False)

parse_into_bboxes(self, fnm, callback=None, zoomin=3)

crop(self, text, ZM=3, need_position=False)

2. PlainParser

3. VisionParser