init.py
Overview
This __init__.py file is part of the InfiniFlow project, providing core utility functions and methods for text processing, document chunking, and hierarchical content merging. It is focused primarily on handling various text segmentation and tokenization tasks, especially for complex document structures such as PDFs, Word documents (docx), and multilingual content with Chinese, English, and numeral representations (including Roman numerals).
Key functionalities include:
Detecting text encoding and handling diverse character sets.
Recognizing and categorizing bullet points and question markers in text.
Tokenizing text chunks and tables with optional image support.
Merging text sections hierarchically or naively, considering token limits for chunking.
Utility functions to analyze language content (Chinese/English).
Handling images associated with text chunks (concatenation and alignment).
Extracting substrings between tags and managing delimiters for text splitting.
This file acts as a foundational utility module used by document parsers and other components in the system for preparing text content for downstream NLP or retrieval tasks.
Detailed Documentation
Global Constants and Variables
all_codecs:
A comprehensive list of text encoding names to attempt when detecting the encoding of a byte blob.QUESTION_PATTERN:
List of regex patterns to identify question bullets or markers in text (supports Chinese numerals, Arabic numerals, English words, Roman numerals).BULLET_PATTERN:
Nested list of regex patterns to detect various levels of bullet points or section markers in documents (supports multiple numbering styles and languages).
Functions
find_codec(blob) -> str
Purpose:
Detects the most probable text encoding for a given byte string blob.
Parameters:
blob (bytes): The raw byte content of a text.
Returns:
str: The detected encoding name (defaults to"utf-8"if detection is uncertain).
Details:
Uses
chardetto guess encoding from the first 1024 bytes.Attempts to decode with known codecs if chardet confidence low.
Overrides ASCII detection to UTF-8.
Usage Example:
encoding = find_codec(raw_bytes)
text = raw_bytes.decode(encoding)
index_int(index_str) -> int
Purpose:
Converts various numeral representations (Arabic numbers, English words, Chinese numerals, Roman numerals) into an integer.
Parameters:
index_str(str): A numeral string in one of supported forms.
Returns:
int: Parsed integer value or-1if parsing failed.
Implementation Details:
Tries int conversion.
Falls back to
word2number,cn2an(Chinese numeral to Arabic), androman_numbersparsing.
has_qbullet(reg, box, last_box, last_index, last_bull, bull_x0_list) -> (re.Match | None, int)
Purpose:
Determines if a text box contains a question bullet matching a regex pattern and enforces spatial and ordering constraints.
Parameters:
reg(str): Regex pattern for bullet detection.box(dict): Current text box dict with keys like'text','x0','top'.last_box(dict): Previous text box dict.last_index(int): Index of last detected bullet.last_bull(bool): Whether last bullet was detected.bull_x0_list(list): List of x0 positions of detected bullets.
Returns:
Tuple of
(re.Match or None, int): Regex match object if bullet detected, and the parsed index.
Usage:
Used in parsing documents to detect question bullet points and maintain bullet ordering and alignment.
qbullets_category(sections) -> (int, str)
Purpose:
Categorizes a list of text sections by matching question bullet patterns.
Parameters:
sections(list[str]): List of text sections.
Returns:
Tuple
(int, str): Index of the matched question pattern and the pattern string.
random_choices(arr, k) -> list
Wrapper for random.choices limited by the array length.
not_bullet(line) -> bool
Checks if a line matches patterns that should NOT be considered bullets.
bullets_category(sections) -> int
Similar to qbullets_category but for general bullet patterns.
is_english(texts) -> bool
Determines if the input text or list of texts is mostly English characters.
is_chinese(text) -> bool
Determines if the text contains a significant portion of Chinese characters.
tokenize(d, t, eng) -> None
Tokenizes text t and updates dict d with token lists:
content_with_weight: original text with weights.content_ltks: tokenized list.content_sm_ltks: fine-grained tokens.
Removes certain HTML-like tags before tokenization.
tokenize_chunks(chunks, doc, eng, pdf_parser=None) -> list
Tokenizes multiple text chunks, optionally cropping images via a PDF parser, and returns a list of document dicts.
tokenize_chunks_with_images(chunks, doc, eng, images) -> list
Like tokenize_chunks but uses a provided list of images aligned with chunks.
tokenize_table(tbls, doc, eng, batch_size=10) -> list
Tokenizes tables represented as pairs of (image, rows), batching rows for tokenization.
add_positions(d, poss) -> None
Adds positional metadata (page number, bounding box coordinates) to document dict d from poss list of tuples.
remove_contents_table(sections, eng=False) -> None
Removes table of contents or acknowledgment sections from the list of sections based on matching keywords.
make_colon_as_title(sections) -> list
Splits sections where a colon (":" or ":") indicates a title, inserting a new title section accordingly.
title_frequency(bull, sections) -> (int, list)
Calculates the frequency of section titles by bullet pattern category bull and returns:
Most common title level.
List of levels per section.
not_title(txt) -> bool
Determines if a text line is not considered a title based on regex and length heuristics.
hierarchical_merge(bull, sections, depth) -> list
Merges sections hierarchically based on bullet patterns up to a given depth.
Uses a binary search to find hierarchical relations.
Groups sections into nested chunks respecting bullet levels.
Limits chunk size based on token counts.
naive_merge(sections, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0) -> list
Performs a simple merge of text sections into chunks with a maximum token count.
Splits sections by delimiters.
Supports overlapped content between chunks to maintain context.
Uses token counts for chunk size control.
naive_merge_with_images(texts, images, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0) -> (list, list)
Same as naive_merge but handles corresponding images, concatenating images for merged chunks.
docx_question_level(p, bull=-1) -> (int, str)
Determines heading level in a docx paragraph p based on style or bullet pattern.
concat_img(img1, img2) -> Image | None
Concatenates two PIL images vertically.
Returns one image if the other is None or identical.
Creates a new image stacking img2 below img1.
naive_merge_docx(sections, chunk_token_num=128, delimiter="\n。;!?") -> (list, list)
Naively merges docx sections that have associated images, chunking by token count.
extract_between(text, start_tag, end_tag) -> list[str]
Extracts all substrings between start_tag and end_tag in text.
get_delimiters(delimiters) -> str
Processes a delimiter string that may contain backtick-quoted substrings, returning a regex pattern string representing all delimiters.
Implementation Details and Algorithms
Tokenization: Uses
rag_tokenizerfor tokenizing text and fine-grained tokenization to prepare text for retrieval-augmented generation (RAG) workflows.Bullet and Question Detection: Uses extensive regex patterns to identify bullets and question markers in multiple languages and numeral systems.
Hierarchical Merging: Implements a binary search to determine hierarchical relationships among sections based on bullet pattern levels, supporting nested document structures.
Chunking: Merges sections into chunks constrained by token limits, with optional overlapping to preserve context between chunks.
Image Handling: Supports cropping images from PDFs and concatenating images vertically for merged chunks.
Encoding Detection: Employs chardet as a first pass, with fallbacks to decoding attempts across a broad array of character encodings.
Interaction with Other System Components
rag_tokenizer: Imported tokenization utilities, presumably specialized for the RAG pipeline.
rag.utils.num_tokens_from_string: Used to count tokens for chunk size management.
deepdoc.parser.pdf_parser.RAGFlowPdfParser: Referenced in merging functions for removing tags and cropping images from PDFs.
Image (PIL): Used for image concatenation.
word2number (w2n), cn2an, roman_numbers: Provide numeral parsing from various numeral formats.
This file serves as a utility backbone for document parsing and preprocessing modules in InfiniFlow, enabling robust text chunking, tokenization, and content structure recognition, essential for downstream NLP models and search.
Visual Diagram
flowchart TD
A[__init__.py] --> B[Encoding Detection]
A --> C[Bullet & Question Detection]
A --> D[Tokenization]
A --> E[Section Merging]
A --> F[Image Handling]
A --> G[Utility Functions]
B --> find_codec
C --> has_qbullet
C --> index_int
C --> qbullets_category
C --> bullets_category
D --> tokenize
D --> tokenize_chunks
D --> tokenize_chunks_with_images
D --> tokenize_table
E --> hierarchical_merge
E --> naive_merge
E --> naive_merge_with_images
E --> naive_merge_docx
F --> concat_img
G --> is_english
G --> is_chinese
G --> add_positions
G --> remove_contents_table
G --> make_colon_as_title
G --> title_frequency
G --> not_bullet
G --> not_title
G --> extract_between
G --> get_delimiters
Usage Examples
Example 1: Detect encoding and decode text
with open("document.txt", "rb") as f:
raw = f.read()
encoding = find_codec(raw)
text = raw.decode(encoding)
print(text)
Example 2: Tokenize text chunks with images
chunks = ["Section 1 text...", "Section 2 text..."]
images = [image1, image2] # PIL.Image objects
doc = {"doc_id": "abc123"}
tokenized_docs = tokenize_chunks_with_images(chunks, doc, eng=True, images=images)
for doc in tokenized_docs:
print(doc["content_ltks"])
Example 3: Merge sections hierarchically
sections = [
("第1章 介绍", "title"),
("第1节 背景", ""),
("正文内容...", "")
]
bull, _ = bullets_category([s[0] for s in sections])
merged = hierarchical_merge(bull, sections, depth=2)
for chunk in merged:
print(chunk)