naive.py


Overview

The naive.py file provides a set of classes and functions designed to parse and chunk various document formats (e.g., DOCX, PDF, Markdown, Excel, JSON, HTML, TXT) into manageable text and optionally image segments. It implements "naive" chunking methods that split documents based on delimiters and token limits, supports tokenization, and incorporates basic image extraction and concatenation functionalities.

This file acts as a core utility within the InfiniFlow system for document ingestion and preprocessing, preparing content for downstream tasks such as indexing, searching, or language model consumption. It supports enhanced figure extraction using vision-based LLM bundles when available.


Classes and Functions

Class Docx(DocxParser)

Extends DocxParser to provide DOCX-specific parsing capabilities with enhanced image extraction and table title extraction.

Methods


Class Pdf(PdfParser)

Extends PdfParser to provide PDF parsing with OCR, layout analysis, table and figure extraction, and optional visual model enhancement.

Methods


Class Markdown(MarkdownParser)

Extends MarkdownParser to parse Markdown files, extract sections, tables, and images.

Methods


Function load_from_xml_v2(baseURI, rels_item_xml) -> _SerializedRelationships

Loads and returns a _SerializedRelationships instance from given XML relationships data, filtering out invalid targets.


Function chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> list

High-level function to parse and chunk various supported file types using naive chunking methods. Supports DOCX, PDF, Excel, TXT, Markdown, HTML, JSON, and legacy DOC formats.


Important Implementation Details and Algorithms


Interaction with Other System Components


Visual Diagram: Class Structure

classDiagram
    class Docx {
        +__init__()
        +get_picture(document, paragraph) Image|None
        -__clean(line) str
        -__get_nearest_title(table_index, filename) str
        +__call__(filename, binary=None, from_page=0, to_page=100000) tuple
    }
    Docx --|> DocxParser

    class Pdf {
        +__init__()
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None, separate_tables_figures=False) tuple
    }
    Pdf --|> PdfParser

    class Markdown {
        +get_picture_urls(sections) list
        +get_pictures(text) list|None
        +__call__(filename, binary=None, separate_tables=True) tuple
    }
    Markdown --|> MarkdownParser

Summary

The naive.py file is a comprehensive document parsing and chunking utility tailored for multiple document formats. It combines text and image extraction, naive chunk merging based on token limits and delimiters, and optionally integrates vision-based models for enhanced figure handling. The design balances simplicity (naive chunking) with extensibility (visual model integration), providing a solid foundation for document preprocessing in the InfiniFlow ecosystem.