txt_parser.py

Overview

The txt_parser.py file is part of the InfiniFlow project and provides functionality to parse plain text files into token-based chunks. It is specifically designed to support retrieval-augmented generation (RAG) workflows by splitting input text into manageable pieces based on token counts and user-defined delimiters. The core component is the RAGFlowTxtParser class, which reads text from files, segments it into chunks constrained by token count limits, and returns these chunks in a format suitable for downstream NLP or information retrieval tasks.

Detailed Explanation

Imports

re: Python standard library module used for regular expression operations.
get_text (from deepdoc.parser.utils): Utility function to read and decode text from a file.
num_tokens_from_string (from rag.nlp): Function to count the number of tokens in a given string, presumably based on a tokenizer compatible with the RAG model.

Class: `RAGFlowTxtParser`

This class encapsulates the logic for parsing text files into token-limited chunks, respecting custom delimiters.

Usage Example

parser = RAGFlowTxtParser()
chunks = parser("path/to/textfile.txt", chunk_token_num=128, delimiter="\n!?;。；！？")
for chunk, _ in chunks:
    print(chunk)

Methods

`call(self, fnm, binary=None, chunk_token_num=128, delimiter="\n!?;。；！？")`

Purpose: Enables the class instance to be called like a function to parse a file directly.
Parameters:
- fnm (str): File name or path to the text file to be parsed.
- binary (optional): Passed to get_text for reading the file as binary if needed.
- chunk_token_num (int, default=128): Maximum number of tokens allowed per chunk.
- delimiter (str, default="\n!?;。；！？"): String containing delimiter characters used to split the text.
Returns: List[List[str, str]] — A list of lists, each containing a chunk of text and an empty string (format: [[chunk, ""]]).
Description: Reads the text from the given file, then calls the class method parser_txt to process and split the text.

`parser_txt(cls, txt, chunk_token_num=128, delimiter="\n!?;。；！？")` (Class Method)

Purpose: Parses a raw text string into token-limited chunks split according to delimiters.
Parameters:
- txt (str): The raw input text to parse.
- chunk_token_num (int, default=128): Max tokens allowed in each chunk.
- delimiter (str, default="\n!?;。；！？"): Delimiter characters used to split text.
Returns: List[List[str, str]] — List of chunks with each element a [chunk_text, ""].
Raises: TypeError if txt is not a string.
Description:
1. Validates that txt is a string.
2. Initializes empty chunk list (cks) and token count list (tk_nums).
3. Processes the delimiter string to correctly handle escape sequences and Unicode characters, then expands delimiters based on a custom logic with regex.
4. Splits the text into sections using compiled delimiters.
5. Iterates over each section:
  - If the section is a delimiter itself, it is skipped.
  - Otherwise, it is added to the last chunk if the combined token count is less than chunk_token_num; else, it starts a new chunk.
6. Returns the list of chunks wrapped inside lists with empty strings.

Important Implementation Details

Delimiter Handling: The delimiter string is processed with multiple encodings and regex operations to handle Unicode escape sequences and special characters robustly.
Token Counting: Uses num_tokens_from_string to measure token length, ensuring chunks do not exceed a token threshold.
Chunk Aggregation Logic: Instead of splitting strictly on delimiters, the parser attempts to group text sections together without exceeding the token count, which improves contextual integrity of chunks.
Output Format: Each chunk is wrapped in a list with an empty string as the second element. This may be designed to accommodate future metadata or labels.

Interaction with Other System Components

deepdoc.parser.utils.get_text: Reads and decodes text files, abstracting file I/O and encoding concerns.
rag.nlp.num_tokens_from_string: Provides tokenization awareness crucial for chunk sizing in RAG workflows.
Downstream Usage: The output chunks can be fed into retrieval-augmented generation pipelines, indexing modules, or other NLP components that require segmented text inputs.

Mermaid Diagram

classDiagram
    class RAGFlowTxtParser {
        +__call__(fnm, binary=None, chunk_token_num=128, delimiter="\n!?;。；！？")
        +parser_txt(txt, chunk_token_num=128, delimiter="\n!?;。；！？")
    }

Summary

txt_parser.py implements a focused utility class, RAGFlowTxtParser, to read text files and split their content into token-limited chunks based on customizable delimiters. It balances splitting granularity with token count constraints to produce coherent chunks for retrieval-augmented NLP workflows. The parser relies on external utilities for token counting and file reading, making it a modular and reusable component within the InfiniFlow system.