manual.py


Overview

manual.py is a document parsing and chunking utility designed primarily for processing PDF and DOCX files. It implements specialized parsers to extract textual and structural content, including tables and images, and converts the content into tokenized chunks suitable for downstream NLP tasks such as retrieval-augmented generation (RAG). The file supports layout and table recognition, OCR for PDFs, and hierarchical question-answer style extraction for DOCX files. It provides a unified chunk function that serves as an entry point to parse supported document types and produce tokenized representations enriched with layout and positional metadata.


Classes and Functions

Class: Pdf(PdfParser)

Purpose

Extends the base PdfParser class from deepdoc.parser to implement manual parsing logic for PDF files, including OCR, layout analysis, table extraction, and text merging.

Methods


Class: Docx(DocxParser)

Purpose

A specialized DOCX parser extending DocxParser to extract paragraphs, images, and tables from DOCX documents while preserving question-answer style hierarchies and embedded images.

Methods


Function: chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)

Purpose

Main utility function to parse and tokenize documents of supported types (PDF and DOCX). It handles configuration, parser selection, tokenization, and layout analysis.

Parameters

Returns

Detailed Behavior

Usage Example

chunks = chunk("sample.pdf", from_page=0, to_page=5, lang="English")
for chunk in chunks:
    print(chunk["text"])

Important Implementation Details


Interactions with Other Components


Visual Diagram

classDiagram
    class Pdf {
        +__init__()
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
        -__images__(...)
        -_layouts_rec(zoomin)
        -_table_transformer_job(zoomin)
        -_text_merge()
        -_extract_table_figure(...)
        -_concat_downward()
        -_filter_forpages()
        +get_position(box, zoomin)
        -boxes: list
        -outlines: list
    }
    Pdf --|> PdfParser

    class Docx {
        +__init__()
        +get_picture(document, paragraph)
        +concat_img(img1, img2)
        +__call__(filename, binary, from_page, to_page, callback)
        -doc: Document
    }
    Docx --|> DocxParser

    class chunk {
        +chunk(filename, binary, from_page, to_page, lang, callback, **kwargs)
    }

Summary

The manual.py file provides advanced document parsing capabilities tailored for PDF and DOCX formats, focusing on extracting structured text, tables, and images with layout and semantic information. It is designed for integration into NLP pipelines that require tokenized, chunked document data enriched with positional and hierarchical metadata, supporting complex workflows like OCR, layout recognition, and question-answer extraction. The file interfaces with various modules for parsing, tokenization, and image processing and exposes a simple chunk function as the main API for document processing.