naive.py

Overview

The naive.py file provides a set of classes and functions designed to parse and chunk various document formats (e.g., DOCX, PDF, Markdown, Excel, JSON, HTML, TXT) into manageable text and optionally image segments. It implements "naive" chunking methods that split documents based on delimiters and token limits, supports tokenization, and incorporates basic image extraction and concatenation functionalities.

This file acts as a core utility within the InfiniFlow system for document ingestion and preprocessing, preparing content for downstream tasks such as indexing, searching, or language model consumption. It supports enhanced figure extraction using vision-based LLM bundles when available.

Classes and Functions

Class `Docx(DocxParser)`

Extends DocxParser to provide DOCX-specific parsing capabilities with enhanced image extraction and table title extraction.

Methods

init(self)
Initializes the Docx parser instance.
get_picture(self, document, paragraph) -> Image or None
Extracts and concatenates all images embedded within a given paragraph of a DOCX document.
- Parameters:
  - document (Document): The docx.Document instance.
  - paragraph (Paragraph): A paragraph element from which to extract images.
- Returns:
  - PIL.Image.Image object if images found and successfully processed; otherwise None.
- Usage example:
```
doc = Document("example.docx")
docx_parser = Docx()
image = docx_parser.get_picture(doc, doc.paragraphs[0])
if image:
    image.show()
```
__clean(self, line: str) -> str
Cleans a text line by replacing full-width spaces (\u3000) with normal spaces and stripping leading/trailing whitespace.
__get_nearest_title(self, table_index: int, filename: str) -> str
Finds the hierarchical heading titles preceding a given table index in the DOCX document. This provides context for the table by returning a breadcrumb-like string of document title and heading hierarchy.
- Parameters:
  - table_index (int): Index of the target table in the document.
  - filename (str): The document file name (used to extract document base name).
- Returns:
  - A string representing the heading hierarchy, e.g., "DocumentName > Heading1 > Heading2".
  - Returns empty string if no heading found.
- Implementation details:
  - Iterates document blocks (paragraphs and tables) in order.
  - Finds the nearest heading paragraph before the target table.
  - Recursively finds parent headings up the hierarchy (up to level 7).
  - Sorts and concatenates headings to form hierarchy.
call(self, filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000) -> (list, list)
Parses a DOCX file or binary stream and returns processed text lines with associated images and parsed tables with captions based on nearest titles.
- Parameters:
  - filename (str): Path to DOCX file.
  - binary (bytes, optional): Binary content of DOCX instead of filename.
  - from_page (int, optional): Starting page number to parse (default 0).
  - to_page (int, optional): Ending page number to parse (default very large number).
- Returns:
  - new_line: List of tuples (cleaned_text, PIL.Image or None) representing paragraphs and concatenated images.
  - tbls: List of tuples ((None, html_table_string), "") representing parsed tables as HTML with captions.
- Usage example:
```
docx_parser = Docx()
lines, tables = docx_parser("sample.docx")
for text, image in lines:
    print(text)
    if image:
        image.show()
```

Class `Pdf(PdfParser)`

Extends PdfParser to provide PDF parsing with OCR, layout analysis, table and figure extraction, and optional visual model enhancement.

Methods

init(self)
Initializes the PDF parser instance.
call(self, filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000, zoomin: int=3, callback: callable=None, separate_tables_figures: bool=False) -> tuple
Parses a PDF document and performs OCR, layout and table analysis, text merging, and optionally separates tables and figures.
- Parameters:
  - filename (str): Path to PDF file.
  - binary (bytes, optional): Binary content of PDF.
  - from_page (int): Start page number.
  - to_page (int): End page number.
  - zoomin (int): Zoom factor for OCR/image resolution.
  - callback (callable): Function to report progress/status messages.
  - separate_tables_figures (bool): Whether to extract tables and figures separately.
- Returns:
  - If separate_tables_figures is True: Tuple of (text_blocks, tables, figures).
  - Else: Tuple of (text_blocks, tables).
  - text_blocks: List of tuples (text, line_tag) representing parsed text.
  - tables and figures: Extracted table and figure data.
- Usage example:
```
pdf_parser = Pdf()
text_blocks, tables = pdf_parser("sample.pdf", from_page=0, to_page=5, callback=print)
for text, tag in text_blocks:
    print(text)
```

Class `Markdown(MarkdownParser)`

Extends MarkdownParser to parse Markdown files, extract sections, tables, and images.

Methods

get_picture_urls(self, sections) -> list
Parses Markdown text or sections to extract URLs of images embedded in the content.
- Parameters:
  - sections: Markdown text string or list of strings representing sections.
- Returns:
  - List of image URLs as strings, or empty list if none found.
get_pictures(self, text) -> list or None
Downloads or loads images from URLs found in the Markdown text, returning a list of PIL.Image objects.
- Parameters:
  - text (str): Markdown text content.
- Returns:
  - List of PIL.Image objects or None if no images found.
- Implementation details:
  - Supports HTTP/HTTPS URLs and local file paths.
  - Logs warnings/errors when images cannot be loaded.
call(self, filename: str, binary: bytes=None, separate_tables: bool=True) -> (list, list)
Parses a Markdown file or binary content and extracts text sections and tables.
- Parameters:
  - filename (str): Path to Markdown file.
  - binary (bytes, optional): Raw file content.
  - separate_tables (bool): Whether to parse and separate tables.
- Returns:
  - sections: List of tuples (section_text, "").
  - tbls: List of tuples ((None, html_table_string), "").
- Usage example:
```
md_parser = Markdown()
sections, tables = md_parser("sample.md")
for section, _ in sections:
    print(section)
```

Function `load_from_xml_v2(baseURI, rels_item_xml) -> _SerializedRelationships`

Loads and returns a _SerializedRelationships instance from given XML relationships data, filtering out invalid targets.

Parameters:
- baseURI: Base URI for relationship resolution.
- rels_item_xml: XML string or element containing relationships.
Returns:
- _SerializedRelationships instance with parsed relationship entries.

Function `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> list`

High-level function to parse and chunk various supported file types using naive chunking methods. Supports DOCX, PDF, Excel, TXT, Markdown, HTML, JSON, and legacy DOC formats.

Parameters:
- filename (str): File path.
- binary (bytes, optional): File content bytes.
- from_page (int): Starting page number (for paged formats).
- to_page (int): Ending page number.
- lang (str): Language for tokenization (English or Chinese).
- callback (callable): Callback function for progress reporting.
- **kwargs: Additional parser configuration options.
Returns:
- List of tokenized chunks suitable for downstream processing.
Implementation details:
- Determines file type by extension and dispatches to appropriate parser.
- For DOCX and PDF, attempts to enhance figure extraction using a vision model if available.
- Uses various internal parsers from deepdoc.parser and rag.nlp.
- Performs naive chunk merges and tokenization using delimiters and max token count.
- Supports optional "section only" mode to return raw chunks without tokenization.

Usage example:

def progress_callback(progress, msg):
    print(f"{progress*100:.1f}%: {msg}")

chunks = chunk("document.docx", from_page=0, to_page=10, callback=progress_callback)
for chunk_text in chunks:
    print(chunk_text)

Important Implementation Details and Algorithms

Image Extraction & Concatenation:
- The Docx class extracts images embedded in paragraphs by navigating the underlying XML elements (pic:pic), safely handling corrupted or unrecognized image streams.
- Multiple images per paragraph are concatenated horizontally using concat_img to form a single combined image.
Heading Hierarchy Extraction:
- To provide context to tables in DOCX, the nearest preceding heading paragraphs with styles matching Heading 1 to Heading 7 are found and used to build a breadcrumb title string.
PDF Parsing Workflow:
- The Pdf class implements a multi-step process:
  - OCR extraction of images and text.
  - Layout analysis to identify text blocks and tables.
  - Table analysis via transformer-based models.
  - Text merging to combine fragmented text pieces.
  - Optionally separates tables and figures.
- Progress is reported via callbacks.
Naive Chunking:
- Text is split using configurable delimiter characters and merged into chunks respecting a maximum token count.
- Tokenization leverages rag_tokenizer for both coarse and fine-grained token splits.
Visual Model Integration for Figures:
- When available, vision-based LLM bundles are used to enhance figure parsing for DOCX and PDF files.
- This is handled by VisionFigureParser which can boost figure extraction quality.
Flexible File Type Dispatch:
- The chunk function supports a wide range of file types and selects appropriate parsers based on file extension.
- Non-supported types raise NotImplementedError.

Interaction with Other System Components

Imports from other modules:
- deepdoc.parser and its submodules provide underlying parsing engines for DOCX, PDF, Excel, HTML, JSON, Markdown, and TXT.
- rag.nlp provides tokenization and chunk merging utilities.
- api.db.services.llm_service is used to instantiate vision-based LLM bundles for enhanced figure extraction.
- docx library is used for DOCX document manipulation.
- tika.parser is used for legacy DOC format parsing.
- PIL.Image is used for image processing and concatenation.
- markdown and BeautifulSoup facilitate Markdown to HTML conversion and image URL extraction.
Outputs:
- This file returns structured text chunks and associated images and tables, ready for indexing, embedding, or language model processing by downstream modules.

Visual Diagram: Class Structure

classDiagram
    class Docx {
        +__init__()
        +get_picture(document, paragraph) Image|None
        -__clean(line) str
        -__get_nearest_title(table_index, filename) str
        +__call__(filename, binary=None, from_page=0, to_page=100000) tuple
    }
    Docx --|> DocxParser

    class Pdf {
        +__init__()
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None, separate_tables_figures=False) tuple
    }
    Pdf --|> PdfParser

    class Markdown {
        +get_picture_urls(sections) list
        +get_pictures(text) list|None
        +__call__(filename, binary=None, separate_tables=True) tuple
    }
    Markdown --|> MarkdownParser

Summary

The naive.py file is a comprehensive document parsing and chunking utility tailored for multiple document formats. It combines text and image extraction, naive chunk merging based on token limits and delimiters, and optionally integrates vision-based models for enhanced figure handling. The design balances simplicity (naive chunking) with extensibility (visual model integration), providing a solid foundation for document preprocessing in the InfiniFlow ecosystem.

naive.py

Overview

Classes and Functions

Class Docx(DocxParser)

Methods

Class Pdf(PdfParser)

Methods

Class Markdown(MarkdownParser)

Methods

Function load_from_xml_v2(baseURI, rels_item_xml) -> _SerializedRelationships

Function chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> list

Important Implementation Details and Algorithms

Interaction with Other System Components

Visual Diagram: Class Structure

Summary

Class `Docx(DocxParser)`

Class `Pdf(PdfParser)`

Class `Markdown(MarkdownParser)`

Function `load_from_xml_v2(baseURI, rels_item_xml) -> _SerializedRelationships`

Function `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> list`