naive.py
Overview
The naive.py file provides a set of classes and functions designed to parse and chunk various document formats (e.g., DOCX, PDF, Markdown, Excel, JSON, HTML, TXT) into manageable text and optionally image segments. It implements "naive" chunking methods that split documents based on delimiters and token limits, supports tokenization, and incorporates basic image extraction and concatenation functionalities.
This file acts as a core utility within the InfiniFlow system for document ingestion and preprocessing, preparing content for downstream tasks such as indexing, searching, or language model consumption. It supports enhanced figure extraction using vision-based LLM bundles when available.
Classes and Functions
Class Docx(DocxParser)
Extends DocxParser to provide DOCX-specific parsing capabilities with enhanced image extraction and table title extraction.
Methods
init(self)Initializes the
Docxparser instance.get_picture(self, document, paragraph) -> Image or NoneExtracts and concatenates all images embedded within a given paragraph of a DOCX document.
Parameters:
document(Document): The docx.Document instance.paragraph(Paragraph): A paragraph element from which to extract images.
Returns:
PIL.Image.Image object if images found and successfully processed; otherwise
None.
Usage example:
doc = Document("example.docx") docx_parser = Docx() image = docx_parser.get_picture(doc, doc.paragraphs[0]) if image: image.show()
__clean(self, line: str) -> strCleans a text line by replacing full-width spaces (
\u3000) with normal spaces and stripping leading/trailing whitespace.__get_nearest_title(self, table_index: int, filename: str) -> strFinds the hierarchical heading titles preceding a given table index in the DOCX document. This provides context for the table by returning a breadcrumb-like string of document title and heading hierarchy.
Parameters:
table_index(int): Index of the target table in the document.filename(str): The document file name (used to extract document base name).
Returns:
A string representing the heading hierarchy, e.g.,
"DocumentName > Heading1 > Heading2".Returns empty string if no heading found.
Implementation details:
Iterates document blocks (paragraphs and tables) in order.
Finds the nearest heading paragraph before the target table.
Recursively finds parent headings up the hierarchy (up to level 7).
Sorts and concatenates headings to form hierarchy.
call(self, filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000) -> (list, list)Parses a DOCX file or binary stream and returns processed text lines with associated images and parsed tables with captions based on nearest titles.
Parameters:
filename(str): Path to DOCX file.binary(bytes, optional): Binary content of DOCX instead of filename.from_page(int, optional): Starting page number to parse (default 0).to_page(int, optional): Ending page number to parse (default very large number).
Returns:
new_line: List of tuples(cleaned_text, PIL.Image or None)representing paragraphs and concatenated images.tbls: List of tuples((None, html_table_string), "")representing parsed tables as HTML with captions.
Usage example:
docx_parser = Docx() lines, tables = docx_parser("sample.docx") for text, image in lines: print(text) if image: image.show()
Class Pdf(PdfParser)
Extends PdfParser to provide PDF parsing with OCR, layout analysis, table and figure extraction, and optional visual model enhancement.
Methods
init(self)Initializes the PDF parser instance.
call(self, filename: str, binary: bytes=None, from_page: int=0, to_page: int=100000, zoomin: int=3, callback: callable=None, separate_tables_figures: bool=False) -> tupleParses a PDF document and performs OCR, layout and table analysis, text merging, and optionally separates tables and figures.
Parameters:
filename(str): Path to PDF file.binary(bytes, optional): Binary content of PDF.from_page(int): Start page number.to_page(int): End page number.zoomin(int): Zoom factor for OCR/image resolution.callback(callable): Function to report progress/status messages.separate_tables_figures(bool): Whether to extract tables and figures separately.
Returns:
If
separate_tables_figuresisTrue: Tuple of(text_blocks, tables, figures).Else: Tuple of
(text_blocks, tables).text_blocks: List of tuples(text, line_tag)representing parsed text.tablesandfigures: Extracted table and figure data.
Usage example:
pdf_parser = Pdf() text_blocks, tables = pdf_parser("sample.pdf", from_page=0, to_page=5, callback=print) for text, tag in text_blocks: print(text)
Class Markdown(MarkdownParser)
Extends MarkdownParser to parse Markdown files, extract sections, tables, and images.
Methods
get_picture_urls(self, sections) -> listParses Markdown text or sections to extract URLs of images embedded in the content.
Parameters:
sections: Markdown text string or list of strings representing sections.
Returns:
List of image URLs as strings, or empty list if none found.
get_pictures(self, text) -> list or NoneDownloads or loads images from URLs found in the Markdown text, returning a list of
PIL.Imageobjects.Parameters:
text(str): Markdown text content.
Returns:
List of
PIL.Imageobjects orNoneif no images found.
Implementation details:
Supports HTTP/HTTPS URLs and local file paths.
Logs warnings/errors when images cannot be loaded.
call(self, filename: str, binary: bytes=None, separate_tables: bool=True) -> (list, list)Parses a Markdown file or binary content and extracts text sections and tables.
Parameters:
filename(str): Path to Markdown file.binary(bytes, optional): Raw file content.separate_tables(bool): Whether to parse and separate tables.
Returns:
sections: List of tuples(section_text, "").tbls: List of tuples((None, html_table_string), "").
Usage example:
md_parser = Markdown() sections, tables = md_parser("sample.md") for section, _ in sections: print(section)
Function load_from_xml_v2(baseURI, rels_item_xml) -> _SerializedRelationships
Loads and returns a _SerializedRelationships instance from given XML relationships data, filtering out invalid targets.
Parameters:
baseURI: Base URI for relationship resolution.rels_item_xml: XML string or element containing relationships.
Returns:
_SerializedRelationshipsinstance with parsed relationship entries.
Function chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs) -> list
High-level function to parse and chunk various supported file types using naive chunking methods. Supports DOCX, PDF, Excel, TXT, Markdown, HTML, JSON, and legacy DOC formats.
Parameters:
filename(str): File path.binary(bytes, optional): File content bytes.from_page(int): Starting page number (for paged formats).to_page(int): Ending page number.lang(str): Language for tokenization (English or Chinese).callback(callable): Callback function for progress reporting.**kwargs: Additional parser configuration options.
Returns:
List of tokenized chunks suitable for downstream processing.
Implementation details:
Determines file type by extension and dispatches to appropriate parser.
For DOCX and PDF, attempts to enhance figure extraction using a vision model if available.
Uses various internal parsers from
deepdoc.parserandrag.nlp.Performs naive chunk merges and tokenization using delimiters and max token count.
Supports optional "section only" mode to return raw chunks without tokenization.
Usage example:
def progress_callback(progress, msg): print(f"{progress*100:.1f}%: {msg}") chunks = chunk("document.docx", from_page=0, to_page=10, callback=progress_callback) for chunk_text in chunks: print(chunk_text)
Important Implementation Details and Algorithms
Image Extraction & Concatenation:
The
Docxclass extracts images embedded in paragraphs by navigating the underlying XML elements (pic:pic), safely handling corrupted or unrecognized image streams.Multiple images per paragraph are concatenated horizontally using
concat_imgto form a single combined image.
Heading Hierarchy Extraction:
To provide context to tables in DOCX, the nearest preceding heading paragraphs with styles matching
Heading 1toHeading 7are found and used to build a breadcrumb title string.
PDF Parsing Workflow:
The
Pdfclass implements a multi-step process:OCR extraction of images and text.
Layout analysis to identify text blocks and tables.
Table analysis via transformer-based models.
Text merging to combine fragmented text pieces.
Optionally separates tables and figures.
Progress is reported via callbacks.
Naive Chunking:
Text is split using configurable delimiter characters and merged into chunks respecting a maximum token count.
Tokenization leverages
rag_tokenizerfor both coarse and fine-grained token splits.
Visual Model Integration for Figures:
When available, vision-based LLM bundles are used to enhance figure parsing for DOCX and PDF files.
This is handled by
VisionFigureParserwhich can boost figure extraction quality.
Flexible File Type Dispatch:
The
chunkfunction supports a wide range of file types and selects appropriate parsers based on file extension.Non-supported types raise
NotImplementedError.
Interaction with Other System Components
Imports from other modules:
deepdoc.parserand its submodules provide underlying parsing engines for DOCX, PDF, Excel, HTML, JSON, Markdown, and TXT.rag.nlpprovides tokenization and chunk merging utilities.api.db.services.llm_serviceis used to instantiate vision-based LLM bundles for enhanced figure extraction.docxlibrary is used for DOCX document manipulation.tika.parseris used for legacy DOC format parsing.PIL.Imageis used for image processing and concatenation.markdownandBeautifulSoupfacilitate Markdown to HTML conversion and image URL extraction.
Outputs:
This file returns structured text chunks and associated images and tables, ready for indexing, embedding, or language model processing by downstream modules.
Visual Diagram: Class Structure
classDiagram
class Docx {
+__init__()
+get_picture(document, paragraph) Image|None
-__clean(line) str
-__get_nearest_title(table_index, filename) str
+__call__(filename, binary=None, from_page=0, to_page=100000) tuple
}
Docx --|> DocxParser
class Pdf {
+__init__()
+__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None, separate_tables_figures=False) tuple
}
Pdf --|> PdfParser
class Markdown {
+get_picture_urls(sections) list
+get_pictures(text) list|None
+__call__(filename, binary=None, separate_tables=True) tuple
}
Markdown --|> MarkdownParser
Summary
The naive.py file is a comprehensive document parsing and chunking utility tailored for multiple document formats. It combines text and image extraction, naive chunk merging based on token limits and delimiters, and optionally integrates vision-based models for enhanced figure handling. The design balances simplicity (naive chunking) with extensibility (visual model integration), providing a solid foundation for document preprocessing in the InfiniFlow ecosystem.