chunker.py


Overview

The chunker.py file is a core component of the InfiniFlow system responsible for dividing large textual or structured documents into manageable "chunks" for downstream processing. This chunking process enables efficient handling of documents for tasks such as information retrieval, question answering, keyword extraction, and summarization within the Retrieval-Augmented Generation (RAG) framework.

This module supports multiple chunking methods tailored to different document types and use cases (e.g., general text, Q&A documents, resumes, books, papers). It integrates with external language models (LLMs) for optional automatic keyword extraction and question proposal to enrich the chunked content.

Key functionalities include:


Classes and Functions

Class: ChunkerParam

Description:
Defines and validates the parameters controlling chunking behavior.

Inheritance:
ProcessParamBase

Properties:

Methods:

Usage Example:

param = ChunkerParam()
param.method = "general"
param.chunk_token_size = 256
param.check()  # Raises error if invalid

Class: Chunker

Description:
Main processing class that implements document chunking workflows. Derives from ProcessBase to integrate into the RAG flow.

Properties:

Internal Methods (Chunking Strategies):

Each method receives a ChunkerFromUpstream object encapsulating the incoming document data and returns a list of chunk dictionaries.

Core Async Method:

Parameters:

Returns:

Usage Example:

chunker = Chunker()
chunker._param.method = "general"
await chunker._invoke(json_result=my_json_doc)
# Outputs chunk list accessible via chunker outputs

Implementation Details and Algorithms


Interaction With Other System Components


Mermaid Class Diagram

classDiagram
    class ChunkerParam {
        +method_options: list[str]
        +method: str
        +chunk_token_size: int
        +delimiter: str
        +overlapped_percent: float
        +page_rank: int
        +auto_keywords: int
        +auto_questions: int
        +tag_sets: list
        +llm_setting: dict
        +check()
        +get_input_form() dict
    }

    class Chunker {
        +component_name: str
        -_general(from_upstream)
        -_q_and_a(from_upstream)
        -_resume(from_upstream)
        -_manual(from_upstream)
        -_table(from_upstream)
        -_paper(from_upstream)
        -_book(from_upstream)
        -_laws(from_upstream)
        -_presentation(from_upstream)
        -_one(from_upstream)
        +async _invoke(**kwargs)
    }

    ChunkerParam <|-- Chunker : uses

Summary

The chunker.py file is a foundational module that partitions documents into coherent chunks based on configurable methods and parameters. It supports multiple input formats and integrates advanced LLM-based enrichment for keywords and questions. Its asynchronous design and caching optimize throughput within the InfiniFlow RAG pipeline. The file serves as a key preparatory step enabling efficient downstream knowledge retrieval and generation tasks.