parser.py


Overview

The parser.py file is a core component of the InfiniFlow system responsible for parsing various document types into structured or semi-structured data formats. It supports multiple input formats including PDF, spreadsheet files, Word documents, markdown files, and plain text. The parsing output can be customized into JSON, Markdown, HTML, or plain text depending on the content type and configuration.

This module integrates multiple parsing strategies and third-party libraries (e.g., Tika for Word parsing, Deepdoc for PDFs and spreadsheets) and also supports vision-language models (VLMs) for image-based text extraction. It is designed to be extensible and configurable, allowing different parsing methods and output formats per document type.

The Parser class extends a base processing framework (ProcessBase) and handles asynchronous invocation with Trio concurrency for efficient operation.


Classes and Functions

Class: ParserParam

Description

ParserParam extends ProcessParamBase and encapsulates configuration and validation logic for parsing input documents. It defines allowed output formats for different document types and setup parameters such as parsing methods, language preferences, and expected file suffixes.

Properties

Methods

Usage Example

param = ParserParam()
param.check()  # Validates the default configuration or after manual adjustments

Class: Parser

Description

Parser extends ProcessBase and implements the core logic for parsing various document types. It dispatches parsing to specialized private methods based on file extensions and configured parsing methods. The class supports asynchronous operation with Trio and reports progress via callbacks.

Properties

Private Methods (Parsing Handlers)

Each private method takes a ParserFromUpstream object containing the input blob (binary data) and metadata such as filename.

Public Methods

Usage Example

import trio

parser = Parser()
params = ParserParam()
parser._param = params

# Example input blob for a PDF file
input_data = {
    "blob": pdf_binary_data,
    "name": "document.pdf"
}

async def run_parser():
    await parser._invoke(**input_data)

trio.run(run_parser)

Implementation Details and Algorithms


Interactions with Other Parts of the System


Visual Diagram

classDiagram
    class ParserParam {
        +allowed_output_format: dict
        +setups: dict
        +__init__()
        +check()
        +get_input_form() dict
    }

    class Parser {
        +component_name: str
        +_pdf(from_upstream: ParserFromUpstream)
        +_spreadsheet(from_upstream: ParserFromUpstream)
        +_word(from_upstream: ParserFromUpstream)
        +_markdown(from_upstream: ParserFromUpstream)
        +_text(from_upstream: ParserFromUpstream)
        +_invoke(**kwargs)
    }

    Parser --|> ProcessBase
    ParserParam --|> ProcessParamBase

Summary

The parser.py file provides a flexible, extensible, and configurable parsing framework central to InfiniFlow's document ingestion pipeline. By supporting multiple document types and parsing strategies, it allows the system to convert heterogeneous inputs into structured data usable for downstream NLP, search, or analysis tasks. The design cleanly separates configuration (ParserParam) from execution (Parser), and leverages asynchronous execution patterns for efficient processing.


If you need further details on integration or usage examples, please specify!