pdf_parser.py
Overview
pdf_parser.py is a comprehensive PDF parsing and processing module designed for extracting structured content — including text, tables, and figures — from PDF documents. It leverages a combination of OCR, layout recognition, and machine learning models to analyze the visual and textual layout of PDF pages. The parser supports multi-page, multi-column documents and is capable of handling complex layouts, including nested tables and figures with captions.
Key functionalities include:
Extraction of text with spatial bounding boxes.
Table and figure detection and structured extraction.
OCR-based recognition for pages with poor text extraction.
Layout analysis and merging of fragmented text boxes.
Support for parallelized processing on multiple devices.
Integration with advanced vision models for enhanced layout understanding.
This file provides three main parser classes:
RAGFlowPdfParser: The primary, feature-rich parser for complex PDF layout extraction.PlainParser: A simpler parser extracting plain text lines without layout sophistication.VisionParser: ExtendsRAGFlowPdfParserwith integration to external vision models for enhanced text extraction from images.
Classes and Methods
1. RAGFlowPdfParser
The core class for parsing PDFs into structured bounding boxes with text, tables, and figures.
Initialization
def __init__(self, **kwargs)
Initializes OCR, layout recognizers, table structure recognizer, and an XGBoost model for layout merging decisions.
Supports parallel OCR processing if multiple devices are available (
PARALLEL_DEVICES).Loads a pre-trained XGBoost model for judging whether to merge vertical text boxes.
Sets initial page and column configurations.
Usage Example
parser = RAGFlowPdfParser()
Internal Utility Methods
__char_width(self, c): Calculate character width from a character bounding box dictionary.__height(self, c): Calculate height of a bounding box._x_dis(self, a, b): Horizontal distance between two bounding boxes._y_dis(self, a, b): Vertical distance between two bounding boxes._match_proj(self, b): Checks if a text matches predefined project-related regex patterns._updown_concat_features(self, up, down): Extract features between two vertically adjacent boxes to decide if they should be concatenated.sort_X_by_page(arr, threshold): Sort bounding boxes primarily by page number, then by horizontal and vertical positions with threshold-based adjustment._has_color(self, o): Heuristic to determine if a character box has relevant color (filters out greyscale or irrelevant text).proj_match(self, line): Matches line text against a set of predefined regexes to classify structural elements (e.g., chapters, sections)._line_tag(self, bx, ZM): Generates a position tag string for a bounding box for identification.remove_tag(txt): Static method to remove position tags from text.extract_positions(txt): Static method to extract position metadata from tagged text.get_position(self, bx, ZM): Returns positional tuples for a bounding box, considering page splits.
Core Functional Methods
__ocr(self, pagenum, img, chars, ZM=3, device_id=None)
Runs OCR detection and recognition on an image, merges detected text boxes with character-level data, and appends results to internal state.
Parameters:
pagenum(int): 1-based page number.img(PIL.Image): Image of the page.chars(list): List of character bounding boxes.ZM(int): Zoom multiplier.device_id(int or None): Device ID for OCR processing (for parallelization).
Return: None
__images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None)
Loads the PDF, extracts page images and character-level information, and performs OCR asynchronously on pages.
Handles locking to prevent concurrent access to
pdfplumber.Extracts outlines (table of contents) using
pypdf.Determines if the document is primarily English based on sampled text.
Supports progress callbacks.
Parameters:
fnm(str or bytes): PDF filename or binary content.zoomin(int): Image DPI zoom factor for processing.page_from(int): Start page index (0-based).page_to(int): End page index (exclusive).callback(callable): Optional function for progress reporting.
Return: None
_layouts_rec(self, ZM, drop=True)
Performs layout recognition on the OCR boxes to classify them into layout components (text, tables, figures).
Uses
LayoutRecognizeron page images and OCR boxes.Adjusts bounding box coordinates with cumulative page height offsets.
Parameters:
ZM(int): Zoom multiplier.drop(bool): Whether to drop uncertain layouts.
_table_transformer_job(self, ZM)
Detects tables in the page layouts and extracts table structure components using TableStructureRecognizer.
Crops table images from pages.
Runs table detection model on cropped images.
Annotates bounding boxes with table row, header, column, and spanning information.
_text_merge(self)
Merges horizontally adjacent text boxes that belong to the same layout and are close in vertical alignment.
_naive_vertical_merge(self, zoomin=3)
Heuristically merges vertically adjacent boxes within the same column.
Detects multi-column layouts by estimating column width.
Sorts boxes accordingly.
Merges boxes unless separated by strong layout or page boundaries.
_concat_downward(self, concat_between_pages=True)
(Partially disabled in code with return early)
Intended to merge vertically adjacent boxes downward based on learned features and the XGBoost model.
_filter_forpages(self)
Filters out pages or boxes that look like table of contents, acknowledgments, or pages with noisy patterns (e.g., many dots).
_merge_with_same_bullet(self)
Merges consecutive boxes starting with the same bullet character, combining their text content.
_extract_table_figure(self, need_image, ZM, return_html, need_position, separate_tables_figures=False)
Extracts tables and figures from boxes, optionally returns cropped images and HTML representations of tables.
Merges tables across page boundaries.
Associates captions with tables and figures.
Crops out images corresponding to detected tables and figures.
Parameters:
need_image(bool): Whether to extract images for tables and figures.ZM(int): Zoom multiplier.return_html(bool): Whether to return tables as HTML.need_position(bool): Whether to return positional metadata.separate_tables_figures(bool): Whether to separate tables and figures in output.
Returns: List of
(Image, text_or_html)tuples, possibly with position metadata.
__call__(self, fnm, need_image=True, zoomin=3, return_html=False)
Main entry method to parse a PDF file into text bounding boxes and tables.
Runs image extraction, layout recognition, table detection, text merging, concatenation, filtering, and extraction.
Returns filtered text boxes and tables.
Parameters:
fnm(str or bytes): PDF filename or binary.need_image(bool): Whether to extract images for tables/figures.zoomin(int): Zoom factor for image processing.return_html(bool): Whether to return tables in HTML format.
Returns: Tuple: (list of text bounding boxes, list of tables)
parse_into_bboxes(self, fnm, callback=None, zoomin=3)
Advanced parsing method with stepwise progress callbacks.
Similar to
call, but with detailed callbacks for each major stage.After extraction, inserts table and figure bounding boxes back into the text box list, preserving spatial order.
Adds cropped images to bounding boxes.
Parameters:
fnm(str or bytes): PDF filename or binary.callback(callable): Function receiving(progress: float, message: str).zoomin(int): Zoom factor.
Returns: List of bounding boxes with layout and image data.
crop(self, text, ZM=3, need_position=False)
Given a text string containing position tags, extracts and crops the corresponding image regions from the PDF pages.
Handles multi-page bounding boxes by stitching cropped images vertically.
Applies transparent overlays on edges for visual clarity.
Parameters:
text(str): Text containing position tags.ZM(int): Zoom factor.need_position(bool): Whether to return positional metadata.
Returns: PIL Image object (and optionally positions) or None if no valid positions.
2. PlainParser
A lightweight parser that reads plain text from PDF pages without layout analysis.
Extracts text lines from pages.
Extracts outlines for navigation.
Does not implement cropping or tag removal (raises
NotImplementedError).
3. VisionParser
Extends RAGFlowPdfParser to integrate with external vision models for enhanced text extraction from images.
Overrides
imagesto only extract images without OCR.Uses
picture_vision_llm_chunkto process each page image with a vision language model.Returns a list of text snippets with simple positional metadata.
Initialization
def __init__(self, vision_model, *args, **kwargs)
Calling
def __call__(self, filename, from_page=0, to_page=100000, **kwargs)
Important Implementation Details
Concurrency Control: Uses a global threading lock
LOCK_KEY_pdfplumberto avoid simultaneous opening of PDFs withpdfplumber, ensuring thread safety.OCR Integration: Combines character-level info from
pdfplumberwith OCR detected boxes to improve text recognition and positioning.XGBoost Model: A gradient boosting model (
updown_concat_xgb.model) is used to decide when to merge vertically adjacent text boxes, improving logical text flow.Layout Models: Layout classification and table structure recognition are handled by imported
LayoutRecognizerandTableStructureRecognizerfrom thedeepdoc.visionmodule.Position Tagging: Text lines and boxes are tagged with position metadata to enable precise cropping and image extraction for UI or further processing.
Multi-Page and Multi-Column Support: The parser handles multi-page spanning elements and detects multi-column layouts to reorder text properly.
Model Downloading: If the XGBoost model is not found locally, it is downloaded from HuggingFace Hub automatically.
Language Detection: Samples text to heuristically determine if the document is English to adapt OCR and tokenization accordingly.
Error Handling: Extensive try-except blocks log warnings or errors but allow processing to continue where possible.
Interaction with Other Components
Uses external libraries:
pdfplumberandpypdffor PDF page and text extraction.numpyandPILfor image and numerical processing.xgboostfor the merging decision model.triofor asynchronous parallel OCR processing.deepdoc.visionfor OCR and layout recognition models.ragNLP and vision utilities for tokenization and prompts.huggingface_hubfor model downloading.
Interacts with the
apimodule for settings and base directory utility.VisionParseruses a vision language model for enhanced page description.Designed as part of a larger document analysis or retrieval-augmented generation (RAG) system.
Visual Diagram
classDiagram
class RAGFlowPdfParser {
+__init__(**kwargs)
+__call__(fnm, need_image=True, zoomin=3, return_html=False)
+parse_into_bboxes(fnm, callback=None, zoomin=3)
+crop(text, ZM=3, need_position=False)
-__ocr(pagenum, img, chars, ZM=3, device_id=None)
-__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
-_layouts_rec(ZM, drop=True)
-_table_transformer_job(ZM)
-_text_merge()
-_naive_vertical_merge(zoomin=3)
-_concat_downward(concat_between_pages=True)
-_filter_forpages()
-_merge_with_same_bullet()
-_extract_table_figure(need_image, ZM, return_html, need_position, separate_tables_figures=False)
-proj_match(line)
-_line_tag(bx, ZM)
-__filterout_scraps(boxes, ZM)
+remove_tag(txt)
+extract_positions(txt)
+get_position(bx, ZM)
}
class PlainParser {
+__call__(filename, from_page=0, to_page=100000, **kwargs)
+crop(ck, need_position)
+remove_tag(txt)
}
class VisionParser {
+__init__(vision_model, *args, **kwargs)
+__call__(filename, from_page=0, to_page=100000, **kwargs)
-__images__(fnm, zoomin=3, page_from=0, page_to=299, callback=None)
}
VisionParser --|> RAGFlowPdfParser
Summary
pdf_parser.py provides a sophisticated and extensible foundation for PDF document parsing within the InfiniFlow ecosystem, combining advanced OCR, layout analysis, and machine learning techniques to deliver structured, high-quality text and table extraction suitable for downstream NLP and vision tasks.