one.py

Overview

The one.py file is a document parsing utility designed to handle multiple file formats including PDF, DOCX, DOC, Excel, TXT, and HTML. It provides functionality to extract and organize text and table data while maintaining the original structure and order of the content. The primary interface exposed is the chunk function, which acts as a unified entry point for parsing different document types into structured "chunks" suitable for downstream processing, such as natural language processing (NLP) or information retrieval.

A specialized Pdf class extends a base PDF parser (PdfParser) to perform detailed OCR, layout analysis, table extraction, and text merging specifically tailored for PDF documents. The file also integrates with external libraries and internal modules for tokenization, text extraction, and document parsing.

Classes and Functions

Class: `Pdf`

Inheritance

Inherits from PdfParser (imported from deepdoc.parser)

Purpose

Pdf enhances the base PDF parsing capabilities by adding a detailed OCR step, layout and table analysis, and text merging to produce structured text and table data from PDF files.

Method: `call`

__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)

Parameters

filename (str): Path to the PDF file or raw binary content if binary is provided.
binary (bytes or None): Optional raw binary content of the PDF. If provided, parsing uses this instead of loading from filename.
from_page (int): Starting page number to parse (default 0).
to_page (int): Ending page number to parse (default 100,000).
zoomin (int): Zoom factor for OCR and layout processing to improve accuracy (default 3).
callback (callable or None): Optional function to receive progress updates. Should accept either a float progress value and a message, or just a message via keyword argument msg.

Returns

Tuple:
- List of tuples (text, "") where text is extracted text or table content, maintaining order based on their positions in the document.
- None (reserved for future metadata or error information)

Description

This method performs the following workflow:

Starts an OCR process on the PDF pages, reporting progress via callback.
Performs layout analysis to identify text blocks and their positions.
Runs a table analysis job to detect and extract tables.
Merges fragmented text blocks into coherent text segments.
Extracts tables and figures, appending them to the text sections.
Sorts all extracted sections by their position within the document.

Usage Example

pdf_parser = Pdf()
sections, _ = pdf_parser("sample.pdf", from_page=0, to_page=5, callback=print)
for text, _ in sections:
    print(text)

Function: `chunk`

chunk(filename, binary=None, from_page=0, to_page=100000,
      lang="Chinese", callback=None, **kwargs)

Purpose

The chunk function is a unified document parser that supports multiple file formats: DOCX, PDF, Excel, TXT, HTML, and DOC. It extracts text content in a structured chunked format while maintaining the original order and basic layout features. It uses specialized parsers depending on the file type and provides progress updates via a callback.

Parameters

filename (str): Path to the document file.
binary (bytes or None): Optional binary content of the file.
from_page (int): Starting page number for parsing (applicable mainly to PDFs).
to_page (int): Ending page number for parsing (applicable mainly to PDFs).
lang (str): Language of the document, defaults to "Chinese". Used to determine tokenization behavior.
callback (callable or None): Function to receive progress updates. Signature typically (float progress, str message).
**kwargs: Additional keyword arguments. Recognized key:
- "parser_config" (dict): Configuration options for parsing, e.g., chunk sizes, delimiters, and layout recognition method.

Returns

List containing a single dictionary representing the parsed document with keys:
- "docnm_kwd": Original filename.
- "title_tks": Tokenized document title (filename without extension).
- "title_sm_tks": Fine-grained tokenization of the title.
- Additional tokenization results from the document text.

Detailed Behavior

Detects file type using regex on the filename extension.
For .docx: Uses rag.app.naive.Docx parser to extract sections and tables.
For .pdf: Uses Pdf class or a plain text PDF parser based on parser_config.
For .xlsx?: Uses ExcelParser to extract HTML sections.
For text files (.txt, .md, .markdown): Reads plain text and splits into lines.
For HTML files (.htm, .html): Uses HtmlParser.
For .doc: Uses Apache Tika parser to extract text content.
Raises NotImplementedError on unsupported file types.

After extraction, it tokenizes the filename and document text using rag_tokenizer and tokenize.

Usage Example

def progress_callback(progress=None, msg=""):
    if progress is not None:
        print(f"{progress*100:.1f}% - {msg}")
    else:
        print(msg)

docs = chunk("example.pdf", from_page=0, to_page=10, callback=progress_callback)
print(docs[0]["docnm_kwd"])  # Prints filename
print(docs[0]["title_tks"])  # Prints tokenized title

Important Implementation Details and Algorithms

PDF Parsing Workflow: The Pdf class method call orchestrates several steps:
- OCR with zoom factor to improve text recognition quality.
- Layout analysis to detect text blocks and their spatial positions.
- Table recognition and extraction.
- Text merging to unify fragmented text areas.
- Sorting extracted content based on positional metrics to preserve document reading order.
Chunking Strategy: The chunk function ensures that extracted text maintains its original document order and grouping, suitable for downstream NLP tasks that require coherent context windows.
Tokenization: Uses rag_tokenizer and tokenize from the rag.nlp module to generate token lists both coarse and fine-grained, optimizing for languages like Chinese and English.
File Type Handling: The function uses regex extensions to route files to the appropriate parser, ensuring extensibility for future formats.
Progress Reporting: Both the Pdf parser and the chunk function accept a callback parameter to report progress updates, which is useful for long-running parsing jobs in UI or asynchronous environments.

Interaction with Other Modules

deepdoc.parser: Provides base parsers such as PdfParser, ExcelParser, PlainParser, and HtmlParser used for specific file-type parsing.
deepdoc.parser.utils.get_text: Utility to extract raw text from files.
rag.app.naive: Contains the Docx parser used for DOCX documents.
rag.nlp: Supplies tokenizers and tokenization functions (rag_tokenizer, tokenize) used to process text chunks.
tika.parser: Apache Tika integration used for DOC file parsing.
io.BytesIO: Used for in-memory binary stream handling, especially for DOC files.

This file primarily acts as a bridge that integrates these components into a cohesive document chunking pipeline.

Visual Diagram

classDiagram
    class PdfParser {
        <<abstract>>
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
        +_layouts_rec()
        +_table_transformer_job()
        +_text_merge()
        +_extract_table_figure()
        +_concat_downward()
    }

    class Pdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
        -__images__()
    }

    class Docx {
        +__call__(filename, binary)
    }

    class ExcelParser {
        +html(binary, max_size)
    }

    class PlainParser {
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
    }

    class HtmlParser {
        +__call__(filename, binary)
    }

    Pdf --|> PdfParser
    Pdf ..> "callback: function" : uses
    Pdf ..> "OCR, Layout, Table, Text merging" : workflow

    chunk ..> Pdf : uses for PDF files
    chunk ..> Docx : uses for DOCX files
    chunk ..> ExcelParser : uses for Excel files
    chunk ..> PlainParser : optional PDF plain text parser
    chunk ..> HtmlParser : uses for HTML files
    chunk ..> tika.parser : uses for DOC files
    chunk ..> rag_tokenizer : tokenization
    chunk ..> tokenize : tokenization

Summary

one.py provides a flexible, multi-format document parsing utility.
The Pdf class extends PDF parsing with OCR and layout-aware processing.
The chunk function is the main entry point, supporting several common document formats.
It integrates multiple external and internal parsers and NLP tokenizers.
Progress callbacks support interactive or asynchronous processing.
Suitable for preprocessing documents for NLP, search indexing, or document understanding systems.

If you need additional details on any class or function, or integration examples, feel free to ask!

one.py

Overview

Classes and Functions

Class: Pdf

Inheritance

Purpose

Method: __call__

Parameters

Returns

Description

Usage Example

Function: chunk

Purpose

Parameters

Returns

Detailed Behavior

Usage Example

Important Implementation Details and Algorithms

Interaction with Other Modules

Visual Diagram

Summary

Class: `Pdf`

Method: `call`

Function: `chunk`