parser.py
Overview
The parser.py file is a core component of the InfiniFlow system responsible for parsing various document types into structured or semi-structured data formats. It supports multiple input formats including PDF, spreadsheet files, Word documents, markdown files, and plain text. The parsing output can be customized into JSON, Markdown, HTML, or plain text depending on the content type and configuration.
This module integrates multiple parsing strategies and third-party libraries (e.g., Tika for Word parsing, Deepdoc for PDFs and spreadsheets) and also supports vision-language models (VLMs) for image-based text extraction. It is designed to be extensible and configurable, allowing different parsing methods and output formats per document type.
The Parser class extends a base processing framework (ProcessBase) and handles asynchronous invocation with Trio concurrency for efficient operation.
Classes and Functions
Class: ParserParam
Description
ParserParam extends ProcessParamBase and encapsulates configuration and validation logic for parsing input documents. It defines allowed output formats for different document types and setup parameters such as parsing methods, language preferences, and expected file suffixes.
Properties
allowed_output_format(dict[str, list[str]]): Defines the permitted output formats for each high-level document type (e.g., PDF supports json andmarkdown).setups(dict[str, dict]): Contains configuration dictionaries for each document type specifying parsing methods, suffixes, language settings, and output formats.
Methods
init(self): Initializes default configurations and allowed output formats.check(self) -> None
Validates the current configuration insetupsfor correctness, including:Valid parse methods for PDFs and images.
Presence of required fields such as language for PDFs and VLM name when applicable.
Output formats conform to allowed values.
get_input_form(self) -> dict[str, dict]
Returns an empty dictionary placeholder. This method could be extended to provide input forms or schemas for configuration UI.
Usage Example
param = ParserParam()
param.check() # Validates the default configuration or after manual adjustments
Class: Parser
Description
Parser extends ProcessBase and implements the core logic for parsing various document types. It dispatches parsing to specialized private methods based on file extensions and configured parsing methods. The class supports asynchronous operation with Trio and reports progress via callbacks.
Properties
component_name(str): Set to"Parser"to identify this component.
Private Methods (Parsing Handlers)
Each private method takes a ParserFromUpstream object containing the input blob (binary data) and metadata such as filename.
_pdf(self, from_upstream: ParserFromUpstream) -> None
Parses PDF files using one of three methods:deepdoc: UsesRAGFlowPdfParserto extract bounding boxes.plain_text: UsesPlainParserto extract lines of text.vlm: Uses a Vision-Language Model viaVisionParserfor OCR-based text extraction.
Outputs in JSON or Markdown depending on configuration.
_spreadsheet(self, from_upstream: ParserFromUpstream) -> None
Parses spreadsheet files (xls, xlsx, csv) usingExcelParser. Supports output as HTML, JSON, or Markdown._word(self, from_upstream: ParserFromUpstream) -> None
Parses Word documents using Apache Tika's Python binding. Extracts text content split into sections and outputs JSON._markdown(self, from_upstream: ParserFromUpstream) -> None
Parses markdown files using a naive markdown parser. Extracts text sections and tables, collects embedded images, and outputs JSON combining text and images._text(self, from_upstream: ParserFromUpstream) -> None
Parses plain text files by decoding binary content. Outputs JSON or plain text based on configuration using utilityget_text.
Public Methods
async _invoke(self, **kwargs) -> None
The main asynchronous entry point that validates input, selects the appropriate parsing method by matching file suffix, and runs the parsing method in a thread-safe manner using Trio concurrency.
Usage Example
import trio
parser = Parser()
params = ParserParam()
parser._param = params
# Example input blob for a PDF file
input_data = {
"blob": pdf_binary_data,
"name": "document.pdf"
}
async def run_parser():
await parser._invoke(**input_data)
trio.run(run_parser)
Implementation Details and Algorithms
Parsing Strategy Selection:
The_invokemethod dynamically selects the parsing method based on the file extension suffix matching configured inParserParam.setups. This allows flexible extension for new formats or parsing methods.PDF Parsing Methods:
Deepdoc: Uses
RAGFlowPdfParserto parse PDF into bounding boxes, which likely represent text blocks with spatial metadata.Plain Text: Extracts lines of text without layout information.
Vision-Language Model (VLM): Invokes a specialized external model to perform OCR and text extraction with positional metadata.
Spreadsheet Parsing:
UsesExcelParserto convert spreadsheets into HTML, Markdown, or JSON, enabling different consumption patterns such as rendering or structured data extraction.Markdown Parsing:
Utilizes a naive markdown parser to extract text and images, with logic to concatenate multiple images found in a section for better visual representation.Threading and Async:
The actual parsing methods are synchronous (CPU or IO-bound) but are called viatrio.to_thread.run_syncinside the async_invoke, allowing the caller to remain asynchronous and scalable.Progress Reporting:
Each parsing method reports progress viaself.callback()with a small random increment simulating progress updates.
Interactions with Other Parts of the System
Imports and Dependencies:
api.dbandapi.db.services.llm_service: For accessing LLM bundles and type information, especially for Vision-Language Model integration.deepdoc.parseranddeepdoc.parser.pdf_parser: Provide core parsing utilities for PDFs, spreadsheets, and images.rag.flow.base: Base classes for processing flow (ProcessBase,ProcessParamBase).rag.flow.parser.schema.ParserFromUpstream: Data model representing incoming document data and metadata.rag.llm.cv_model.Base: Utilities for image processing and encoding.Third-party libraries such as
tikafor Word document parsing andtriofor asynchronous execution.
Output Handling:
The parser sets outputs usingself.set_output(key, value)which integrates with the broader process framework to propagate parsed data downstream.Error Handling:
Input validation errors set an error output_ERROR, allowing the system to detect and handle parsing failures gracefully.
Visual Diagram
classDiagram
class ParserParam {
+allowed_output_format: dict
+setups: dict
+__init__()
+check()
+get_input_form() dict
}
class Parser {
+component_name: str
+_pdf(from_upstream: ParserFromUpstream)
+_spreadsheet(from_upstream: ParserFromUpstream)
+_word(from_upstream: ParserFromUpstream)
+_markdown(from_upstream: ParserFromUpstream)
+_text(from_upstream: ParserFromUpstream)
+_invoke(**kwargs)
}
Parser --|> ProcessBase
ParserParam --|> ProcessParamBase
Summary
The parser.py file provides a flexible, extensible, and configurable parsing framework central to InfiniFlow's document ingestion pipeline. By supporting multiple document types and parsing strategies, it allows the system to convert heterogeneous inputs into structured data usable for downstream NLP, search, or analysis tasks. The design cleanly separates configuration (ParserParam) from execution (Parser), and leverages asynchronous execution patterns for efficient processing.
If you need further details on integration or usage examples, please specify!