txt_parser.py
Overview
The txt_parser.py file is part of the InfiniFlow project and provides functionality to parse plain text files into token-based chunks. It is specifically designed to support retrieval-augmented generation (RAG) workflows by splitting input text into manageable pieces based on token counts and user-defined delimiters. The core component is the RAGFlowTxtParser class, which reads text from files, segments it into chunks constrained by token count limits, and returns these chunks in a format suitable for downstream NLP or information retrieval tasks.
Detailed Explanation
Imports
re: Python standard library module used for regular expression operations.
get_text (from
deepdoc.parser.utils): Utility function to read and decode text from a file.num_tokens_from_string (from
rag.nlp): Function to count the number of tokens in a given string, presumably based on a tokenizer compatible with the RAG model.
Class: RAGFlowTxtParser
This class encapsulates the logic for parsing text files into token-limited chunks, respecting custom delimiters.
Usage Example
parser = RAGFlowTxtParser()
chunks = parser("path/to/textfile.txt", chunk_token_num=128, delimiter="\n!?;。;!?")
for chunk, _ in chunks:
print(chunk)
Methods
__call__(self, fnm, binary=None, chunk_token_num=128, delimiter="\n!?;。;!?")
Purpose: Enables the class instance to be called like a function to parse a file directly.
Parameters:
fnm(str): File name or path to the text file to be parsed.binary(optional): Passed toget_textfor reading the file as binary if needed.chunk_token_num(int, default=128): Maximum number of tokens allowed per chunk.delimiter(str, default="\n!?;。;!?"): String containing delimiter characters used to split the text.
Returns:
List[List[str, str]]— A list of lists, each containing a chunk of text and an empty string (format:[[chunk, ""]]).Description: Reads the text from the given file, then calls the class method
parser_txtto process and split the text.
parser_txt(cls, txt, chunk_token_num=128, delimiter="\n!?;。;!?") (Class Method)
Purpose: Parses a raw text string into token-limited chunks split according to delimiters.
Parameters:
txt(str): The raw input text to parse.chunk_token_num(int, default=128): Max tokens allowed in each chunk.delimiter(str, default="\n!?;。;!?"): Delimiter characters used to split text.
Returns:
List[List[str, str]]— List of chunks with each element a[chunk_text, ""].Raises:
TypeErroriftxtis not a string.Description:
Validates that
txtis a string.Initializes empty chunk list (
cks) and token count list (tk_nums).Processes the
delimiterstring to correctly handle escape sequences and Unicode characters, then expands delimiters based on a custom logic with regex.Splits the text into sections using compiled delimiters.
Iterates over each section:
If the section is a delimiter itself, it is skipped.
Otherwise, it is added to the last chunk if the combined token count is less than
chunk_token_num; else, it starts a new chunk.
Returns the list of chunks wrapped inside lists with empty strings.
Important Implementation Details
Delimiter Handling: The delimiter string is processed with multiple encodings and regex operations to handle Unicode escape sequences and special characters robustly.
Token Counting: Uses
num_tokens_from_stringto measure token length, ensuring chunks do not exceed a token threshold.Chunk Aggregation Logic: Instead of splitting strictly on delimiters, the parser attempts to group text sections together without exceeding the token count, which improves contextual integrity of chunks.
Output Format: Each chunk is wrapped in a list with an empty string as the second element. This may be designed to accommodate future metadata or labels.
Interaction with Other System Components
deepdoc.parser.utils.get_text: Reads and decodes text files, abstracting file I/O and encoding concerns.rag.nlp.num_tokens_from_string: Provides tokenization awareness crucial for chunk sizing in RAG workflows.Downstream Usage: The output chunks can be fed into retrieval-augmented generation pipelines, indexing modules, or other NLP components that require segmented text inputs.
Mermaid Diagram
classDiagram
class RAGFlowTxtParser {
+__call__(fnm, binary=None, chunk_token_num=128, delimiter="\n!?;。;!?")
+parser_txt(txt, chunk_token_num=128, delimiter="\n!?;。;!?")
}
Summary
txt_parser.py implements a focused utility class, RAGFlowTxtParser, to read text files and split their content into token-limited chunks based on customizable delimiters. It balances splitting granularity with token count constraints to produce coherent chunks for retrieval-augmented NLP workflows. The parser relies on external utilities for token counting and file reading, making it a modular and reusable component within the InfiniFlow system.