one.py


Overview

The one.py file is a document parsing utility designed to handle multiple file formats including PDF, DOCX, DOC, Excel, TXT, and HTML. It provides functionality to extract and organize text and table data while maintaining the original structure and order of the content. The primary interface exposed is the chunk function, which acts as a unified entry point for parsing different document types into structured "chunks" suitable for downstream processing, such as natural language processing (NLP) or information retrieval.

A specialized Pdf class extends a base PDF parser (PdfParser) to perform detailed OCR, layout analysis, table extraction, and text merging specifically tailored for PDF documents. The file also integrates with external libraries and internal modules for tokenization, text extraction, and document parsing.


Classes and Functions

Class: Pdf

Inheritance

Purpose

Pdf enhances the base PDF parsing capabilities by adding a detailed OCR step, layout and table analysis, and text merging to produce structured text and table data from PDF files.

Method: __call__

__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
Parameters
Returns
Description

This method performs the following workflow:

  1. Starts an OCR process on the PDF pages, reporting progress via callback.

  2. Performs layout analysis to identify text blocks and their positions.

  3. Runs a table analysis job to detect and extract tables.

  4. Merges fragmented text blocks into coherent text segments.

  5. Extracts tables and figures, appending them to the text sections.

  6. Sorts all extracted sections by their position within the document.

Usage Example
pdf_parser = Pdf()
sections, _ = pdf_parser("sample.pdf", from_page=0, to_page=5, callback=print)
for text, _ in sections:
    print(text)

Function: chunk

chunk(filename, binary=None, from_page=0, to_page=100000,
      lang="Chinese", callback=None, **kwargs)

Purpose

The chunk function is a unified document parser that supports multiple file formats: DOCX, PDF, Excel, TXT, HTML, and DOC. It extracts text content in a structured chunked format while maintaining the original order and basic layout features. It uses specialized parsers depending on the file type and provides progress updates via a callback.

Parameters

Returns

Detailed Behavior

After extraction, it tokenizes the filename and document text using rag_tokenizer and tokenize.

Usage Example

def progress_callback(progress=None, msg=""):
    if progress is not None:
        print(f"{progress*100:.1f}% - {msg}")
    else:
        print(msg)

docs = chunk("example.pdf", from_page=0, to_page=10, callback=progress_callback)
print(docs[0]["docnm_kwd"])  # Prints filename
print(docs[0]["title_tks"])  # Prints tokenized title

Important Implementation Details and Algorithms


Interaction with Other Modules

This file primarily acts as a bridge that integrates these components into a cohesive document chunking pipeline.


Visual Diagram

classDiagram
    class PdfParser {
        <<abstract>>
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
        +_layouts_rec()
        +_table_transformer_job()
        +_text_merge()
        +_extract_table_figure()
        +_concat_downward()
    }

    class Pdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
        -__images__()
    }

    class Docx {
        +__call__(filename, binary)
    }

    class ExcelParser {
        +html(binary, max_size)
    }

    class PlainParser {
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
    }

    class HtmlParser {
        +__call__(filename, binary)
    }

    Pdf --|> PdfParser
    Pdf ..> "callback: function" : uses
    Pdf ..> "OCR, Layout, Table, Text merging" : workflow

    chunk ..> Pdf : uses for PDF files
    chunk ..> Docx : uses for DOCX files
    chunk ..> ExcelParser : uses for Excel files
    chunk ..> PlainParser : optional PDF plain text parser
    chunk ..> HtmlParser : uses for HTML files
    chunk ..> tika.parser : uses for DOC files
    chunk ..> rag_tokenizer : tokenization
    chunk ..> tokenize : tokenization

Summary


If you need additional details on any class or function, or integration examples, feel free to ask!