parser.py

Overview

The parser.py file is a core component of the InfiniFlow system responsible for parsing various document types into structured or semi-structured data formats. It supports multiple input formats including PDF, spreadsheet files, Word documents, markdown files, and plain text. The parsing output can be customized into JSON, Markdown, HTML, or plain text depending on the content type and configuration.

This module integrates multiple parsing strategies and third-party libraries (e.g., Tika for Word parsing, Deepdoc for PDFs and spreadsheets) and also supports vision-language models (VLMs) for image-based text extraction. It is designed to be extensible and configurable, allowing different parsing methods and output formats per document type.

The Parser class extends a base processing framework (ProcessBase) and handles asynchronous invocation with Trio concurrency for efficient operation.

Classes and Functions

Class: `ParserParam`

Description

ParserParam extends ProcessParamBase and encapsulates configuration and validation logic for parsing input documents. It defines allowed output formats for different document types and setup parameters such as parsing methods, language preferences, and expected file suffixes.

Properties

allowed_output_format (dict[str, list[str]]): Defines the permitted output formats for each high-level document type (e.g., PDF supports json and markdown).
setups (dict[str, dict]): Contains configuration dictionaries for each document type specifying parsing methods, suffixes, language settings, and output formats.

Methods

init(self): Initializes default configurations and allowed output formats.
check(self) -> None
Validates the current configuration in setups for correctness, including:
- Valid parse methods for PDFs and images.
- Presence of required fields such as language for PDFs and VLM name when applicable.
- Output formats conform to allowed values.
get_input_form(self) -> dict[str, dict]
Returns an empty dictionary placeholder. This method could be extended to provide input forms or schemas for configuration UI.

Usage Example

param = ParserParam()
param.check()  # Validates the default configuration or after manual adjustments

Class: `Parser`

Description

Parser extends ProcessBase and implements the core logic for parsing various document types. It dispatches parsing to specialized private methods based on file extensions and configured parsing methods. The class supports asynchronous operation with Trio and reports progress via callbacks.

Properties

component_name (str): Set to "Parser" to identify this component.

Private Methods (Parsing Handlers)

Each private method takes a ParserFromUpstream object containing the input blob (binary data) and metadata such as filename.

_pdf(self, from_upstream: ParserFromUpstream) -> None
Parses PDF files using one of three methods:
- deepdoc: Uses RAGFlowPdfParser to extract bounding boxes.
- plain_text: Uses PlainParser to extract lines of text.
- vlm: Uses a Vision-Language Model via VisionParser for OCR-based text extraction.
Outputs in JSON or Markdown depending on configuration.
_spreadsheet(self, from_upstream: ParserFromUpstream) -> None
Parses spreadsheet files (xls, xlsx, csv) using ExcelParser. Supports output as HTML, JSON, or Markdown.
_word(self, from_upstream: ParserFromUpstream) -> None
Parses Word documents using Apache Tika's Python binding. Extracts text content split into sections and outputs JSON.
_markdown(self, from_upstream: ParserFromUpstream) -> None
Parses markdown files using a naive markdown parser. Extracts text sections and tables, collects embedded images, and outputs JSON combining text and images.
_text(self, from_upstream: ParserFromUpstream) -> None
Parses plain text files by decoding binary content. Outputs JSON or plain text based on configuration using utility get_text.

Public Methods

async _invoke(self, **kwargs) -> None
The main asynchronous entry point that validates input, selects the appropriate parsing method by matching file suffix, and runs the parsing method in a thread-safe manner using Trio concurrency.

Usage Example

import trio

parser = Parser()
params = ParserParam()
parser._param = params

# Example input blob for a PDF file
input_data = {
    "blob": pdf_binary_data,
    "name": "document.pdf"
}

async def run_parser():
    await parser._invoke(**input_data)

trio.run(run_parser)

Implementation Details and Algorithms

Parsing Strategy Selection:
The _invoke method dynamically selects the parsing method based on the file extension suffix matching configured in ParserParam.setups. This allows flexible extension for new formats or parsing methods.
PDF Parsing Methods:
- Deepdoc: Uses RAGFlowPdfParser to parse PDF into bounding boxes, which likely represent text blocks with spatial metadata.
- Plain Text: Extracts lines of text without layout information.
- Vision-Language Model (VLM): Invokes a specialized external model to perform OCR and text extraction with positional metadata.
Spreadsheet Parsing:
Uses ExcelParser to convert spreadsheets into HTML, Markdown, or JSON, enabling different consumption patterns such as rendering or structured data extraction.
Markdown Parsing:
Utilizes a naive markdown parser to extract text and images, with logic to concatenate multiple images found in a section for better visual representation.
Threading and Async:
The actual parsing methods are synchronous (CPU or IO-bound) but are called via trio.to_thread.run_sync inside the async _invoke, allowing the caller to remain asynchronous and scalable.
Progress Reporting:
Each parsing method reports progress via self.callback() with a small random increment simulating progress updates.

Interactions with Other Parts of the System

Imports and Dependencies:
- api.db and api.db.services.llm_service: For accessing LLM bundles and type information, especially for Vision-Language Model integration.
- deepdoc.parser and deepdoc.parser.pdf_parser: Provide core parsing utilities for PDFs, spreadsheets, and images.
- rag.flow.base: Base classes for processing flow (ProcessBase, ProcessParamBase).
- rag.flow.parser.schema.ParserFromUpstream: Data model representing incoming document data and metadata.
- rag.llm.cv_model.Base: Utilities for image processing and encoding.
- Third-party libraries such as tika for Word document parsing and trio for asynchronous execution.
Output Handling:
The parser sets outputs using self.set_output(key, value) which integrates with the broader process framework to propagate parsed data downstream.
Error Handling:
Input validation errors set an error output _ERROR, allowing the system to detect and handle parsing failures gracefully.

Visual Diagram

classDiagram
    class ParserParam {
        +allowed_output_format: dict
        +setups: dict
        +__init__()
        +check()
        +get_input_form() dict
    }

    class Parser {
        +component_name: str
        +_pdf(from_upstream: ParserFromUpstream)
        +_spreadsheet(from_upstream: ParserFromUpstream)
        +_word(from_upstream: ParserFromUpstream)
        +_markdown(from_upstream: ParserFromUpstream)
        +_text(from_upstream: ParserFromUpstream)
        +_invoke(**kwargs)
    }

    Parser --|> ProcessBase
    ParserParam --|> ProcessParamBase

Summary

The parser.py file provides a flexible, extensible, and configurable parsing framework central to InfiniFlow's document ingestion pipeline. By supporting multiple document types and parsing strategies, it allows the system to convert heterogeneous inputs into structured data usable for downstream NLP, search, or analysis tasks. The design cleanly separates configuration (ParserParam) from execution (Parser), and leverages asynchronous execution patterns for efficient processing.

If you need further details on integration or usage examples, please specify!

parser.py

Overview

Classes and Functions

Class: ParserParam

Description

Properties

Methods

Usage Example

Class: Parser

Description

Properties

Private Methods (Parsing Handlers)

Public Methods

Usage Example

Implementation Details and Algorithms

Interactions with Other Parts of the System

Visual Diagram

Summary

Class: `ParserParam`

Class: `Parser`