init.py


Overview

This __init__.py file is part of the InfiniFlow project, providing core utility functions and methods for text processing, document chunking, and hierarchical content merging. It is focused primarily on handling various text segmentation and tokenization tasks, especially for complex document structures such as PDFs, Word documents (docx), and multilingual content with Chinese, English, and numeral representations (including Roman numerals).

Key functionalities include:

This file acts as a foundational utility module used by document parsers and other components in the system for preparing text content for downstream NLP or retrieval tasks.


Detailed Documentation


Global Constants and Variables


Functions

find_codec(blob) -> str

Purpose:
Detects the most probable text encoding for a given byte string blob.

Parameters:

Returns:

Details:

Usage Example:

encoding = find_codec(raw_bytes)
text = raw_bytes.decode(encoding)

index_int(index_str) -> int

Purpose:
Converts various numeral representations (Arabic numbers, English words, Chinese numerals, Roman numerals) into an integer.

Parameters:

Returns:

Implementation Details:


has_qbullet(reg, box, last_box, last_index, last_bull, bull_x0_list) -> (re.Match | None, int)

Purpose:
Determines if a text box contains a question bullet matching a regex pattern and enforces spatial and ordering constraints.

Parameters:

Returns:

Usage:
Used in parsing documents to detect question bullet points and maintain bullet ordering and alignment.


qbullets_category(sections) -> (int, str)

Purpose:
Categorizes a list of text sections by matching question bullet patterns.

Parameters:

Returns:


random_choices(arr, k) -> list

Wrapper for random.choices limited by the array length.


not_bullet(line) -> bool

Checks if a line matches patterns that should NOT be considered bullets.


bullets_category(sections) -> int

Similar to qbullets_category but for general bullet patterns.


is_english(texts) -> bool

Determines if the input text or list of texts is mostly English characters.


is_chinese(text) -> bool

Determines if the text contains a significant portion of Chinese characters.


tokenize(d, t, eng) -> None

Tokenizes text t and updates dict d with token lists:

Removes certain HTML-like tags before tokenization.


tokenize_chunks(chunks, doc, eng, pdf_parser=None) -> list

Tokenizes multiple text chunks, optionally cropping images via a PDF parser, and returns a list of document dicts.


tokenize_chunks_with_images(chunks, doc, eng, images) -> list

Like tokenize_chunks but uses a provided list of images aligned with chunks.


tokenize_table(tbls, doc, eng, batch_size=10) -> list

Tokenizes tables represented as pairs of (image, rows), batching rows for tokenization.


add_positions(d, poss) -> None

Adds positional metadata (page number, bounding box coordinates) to document dict d from poss list of tuples.


remove_contents_table(sections, eng=False) -> None

Removes table of contents or acknowledgment sections from the list of sections based on matching keywords.


make_colon_as_title(sections) -> list

Splits sections where a colon (":" or ":") indicates a title, inserting a new title section accordingly.


title_frequency(bull, sections) -> (int, list)

Calculates the frequency of section titles by bullet pattern category bull and returns:


not_title(txt) -> bool

Determines if a text line is not considered a title based on regex and length heuristics.


hierarchical_merge(bull, sections, depth) -> list

Merges sections hierarchically based on bullet patterns up to a given depth.


naive_merge(sections, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0) -> list

Performs a simple merge of text sections into chunks with a maximum token count.


naive_merge_with_images(texts, images, chunk_token_num=128, delimiter="\n。;!?", overlapped_percent=0) -> (list, list)

Same as naive_merge but handles corresponding images, concatenating images for merged chunks.


docx_question_level(p, bull=-1) -> (int, str)

Determines heading level in a docx paragraph p based on style or bullet pattern.


concat_img(img1, img2) -> Image | None

Concatenates two PIL images vertically.


naive_merge_docx(sections, chunk_token_num=128, delimiter="\n。;!?") -> (list, list)

Naively merges docx sections that have associated images, chunking by token count.


extract_between(text, start_tag, end_tag) -> list[str]

Extracts all substrings between start_tag and end_tag in text.


get_delimiters(delimiters) -> str

Processes a delimiter string that may contain backtick-quoted substrings, returning a regex pattern string representing all delimiters.


Implementation Details and Algorithms


Interaction with Other System Components

This file serves as a utility backbone for document parsing and preprocessing modules in InfiniFlow, enabling robust text chunking, tokenization, and content structure recognition, essential for downstream NLP models and search.


Visual Diagram

flowchart TD
    A[__init__.py] --> B[Encoding Detection]
    A --> C[Bullet & Question Detection]
    A --> D[Tokenization]
    A --> E[Section Merging]
    A --> F[Image Handling]
    A --> G[Utility Functions]

    B --> find_codec
    C --> has_qbullet
    C --> index_int
    C --> qbullets_category
    C --> bullets_category
    D --> tokenize
    D --> tokenize_chunks
    D --> tokenize_chunks_with_images
    D --> tokenize_table
    E --> hierarchical_merge
    E --> naive_merge
    E --> naive_merge_with_images
    E --> naive_merge_docx
    F --> concat_img
    G --> is_english
    G --> is_chinese
    G --> add_positions
    G --> remove_contents_table
    G --> make_colon_as_title
    G --> title_frequency
    G --> not_bullet
    G --> not_title
    G --> extract_between
    G --> get_delimiters

Usage Examples

Example 1: Detect encoding and decode text

with open("document.txt", "rb") as f:
    raw = f.read()

encoding = find_codec(raw)
text = raw.decode(encoding)
print(text)

Example 2: Tokenize text chunks with images

chunks = ["Section 1 text...", "Section 2 text..."]
images = [image1, image2]  # PIL.Image objects
doc = {"doc_id": "abc123"}

tokenized_docs = tokenize_chunks_with_images(chunks, doc, eng=True, images=images)
for doc in tokenized_docs:
    print(doc["content_ltks"])

Example 3: Merge sections hierarchically

sections = [
    ("第1章 介绍", "title"),
    ("第1节 背景", ""),
    ("正文内容...", "")
]
bull, _ = bullets_category([s[0] for s in sections])
merged = hierarchical_merge(bull, sections, depth=2)
for chunk in merged:
    print(chunk)

End of Documentation for init.py