manual.py

Overview

manual.py is a document parsing and chunking utility designed primarily for processing PDF and DOCX files. It implements specialized parsers to extract textual and structural content, including tables and images, and converts the content into tokenized chunks suitable for downstream NLP tasks such as retrieval-augmented generation (RAG). The file supports layout and table recognition, OCR for PDFs, and hierarchical question-answer style extraction for DOCX files. It provides a unified chunk function that serves as an entry point to parse supported document types and produce tokenized representations enriched with layout and positional metadata.

Classes and Functions

Class: `Pdf(PdfParser)`

Purpose

Extends the base PdfParser class from deepdoc.parser to implement manual parsing logic for PDF files, including OCR, layout analysis, table extraction, and text merging.

Methods

init(self)
- Initializes the parser type as ParserType.MANUAL.
- Calls the superclass initializer.
call(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
Parameters:
- filename (str): Path to the PDF file.
- binary (bytes, optional): Binary content of the PDF, if already loaded.
- from_page (int): Starting page number for parsing (inclusive).
- to_page (int): Ending page number for parsing (exclusive).
- zoomin (int): Zoom factor for image rendering during OCR.
- callback (function): Optional progress callback accepting (progress, msg) or msg only.
Returns:
- Tuple:
  - List of tuples (text, layoutno, position) representing extracted text boxes.
  - List of extracted tables with layout information.
Functionality:
- Runs OCR on selected pages.
- Performs layout analysis and table recognition.
- Merges text boxes and filters by pages.
- Cleans extracted text by removing excessive whitespace.
- Returns structured text and tables.
Usage Example:
```
pdf_parser = Pdf()
text_boxes, tables = pdf_parser("example.pdf", from_page=0, to_page=10, callback=print)
```

Class: `Docx(DocxParser)`

Purpose

A specialized DOCX parser extending DocxParser to extract paragraphs, images, and tables from DOCX documents while preserving question-answer style hierarchies and embedded images.

Methods

init(self)
- Empty initializer, no special setup.
get_picture(self, document, paragraph)
Parameters:
- document (docx.Document): The loaded DOCX document.
- paragraph (docx.paragraph.Paragraph): The paragraph from which to extract images.
Returns:
- PIL.Image.Image object if an image is found, else None.
Description:
Extracts the first embedded image from the paragraph using XML parsing and returns it as a PIL Image.
concat_img(self, img1, img2)
Parameters:
- img1 (PIL.Image.Image or None): First image.
- img2 (PIL.Image.Image or None): Second image.
Returns:
- Combined vertical concatenation of img1 and img2, or whichever is not None.
Description:
Vertically concatenates two images, handling None cases gracefully.
call(self, filename, binary=None, from_page=0, to_page=100000, callback=None)
Parameters:
- filename (str): Path to DOCX file.
- binary (bytes, optional): Binary content of the DOCX file.
- from_page (int): Starting page number (used heuristically).
- to_page (int): Ending page number.
- callback (function): Optional progress reporting callback.
Returns:
- Tuple:
  - List of tuples (text_with_question_hierarchy, image) representing text blocks with associated images.
  - List of tables as HTML representations with placeholders.
Functionality:
- Iterates through paragraphs, identifying questions by hierarchy level using docx_question_level.
- Builds a stack of questions and accumulates answers and images.
- Extracts and converts DOCX tables to simple HTML tables.
- Returns hierarchical question-answer pairs and associated images, plus tables.
Usage Example:
```
docx_parser = Docx()
qa_list, tables = docx_parser("example.docx")
for text, image in qa_list:
    print(text)
    if image:
        image.show()
```

Function: `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)`

Purpose

Main utility function to parse and tokenize documents of supported types (PDF and DOCX). It handles configuration, parser selection, tokenization, and layout analysis.

Parameters

filename (str): Path to the document file.
binary (bytes, optional): Binary content for in-memory parsing.
from_page (int): Start page number to parse.
to_page (int): End page number to parse.
lang (str): Language of the document (default "Chinese").
callback (function): Optional callback for progress updates.
**kwargs:
- parser_config (dict): Configuration for parsing, e.g., chunk size, delimiter, layout recognition mode.

Returns

List of dictionaries representing tokenized document chunks enriched with metadata and layout info.

Detailed Behavior

Determines parser based on file extension (.pdf or .docx).
For PDF:
- Selects between Pdf (manual) or PlainParser depending on config.
- Extracts sections and tables.
- Performs title frequency analysis and bullet categorization to identify section pivots.
- Merges and chunks text based on token counts and section boundaries.
- Tokenizes tables and text chunks using RAG utilities.
For DOCX:
- Uses Docx class to extract question-answer pairs and tables.
- Tokenizes extracted text and images.
Raises NotImplementedError for unsupported types.

Usage Example

chunks = chunk("sample.pdf", from_page=0, to_page=5, lang="English")
for chunk in chunks:
    print(chunk["text"])

Important Implementation Details

PDF OCR and Layout Analysis:
The Pdf class performs OCR with zooming, layout recognition, table extraction, and merges text boxes, providing detailed positional metadata.
DOCX Question-Answer Extraction:
The Docx class uses paragraph-level heuristics to detect question hierarchy and associates following paragraphs and images as answers, supporting multi-level question nesting.
Tokenization and Chunking:
The chunk function uses advanced tokenization strategies (rag_tokenizer, tokenize_table, tokenize_chunks) to prepare document data for machine learning models, considering document structure and layout.
Table Handling:
Tables are extracted as HTML and tokenized separately to preserve tabular structure in the tokenized representation.
Progress Callbacks:
The parsers support callbacks to report progress messages and percentages, useful for UI or logging.
Image Handling:
Embedded images in DOCX paragraphs are extracted and concatenated vertically when multiple images are found in sequence.

Interactions with Other Components

Imports from other modules:
- api.db.ParserType: Enum for parser types.
- rag.nlp: NLP utilities for tokenization and document structure analysis.
- rag.utils: Utility for token counting.
- deepdoc.parser: Base classes for PDF and DOCX parsing (PdfParser, PlainParser, DocxParser).
- docx.Document: For reading DOCX files.
- PIL.Image: For image extraction and manipulation.
The parsers rely on deepdoc.parser base classes for core parsing logic and extend/customize functionality.
The tokenization and chunking process integrates tightly with rag.nlp utilities, which provide NLP-specific processing needed for RAG pipelines.

Visual Diagram

classDiagram
    class Pdf {
        +__init__()
        +__call__(filename, binary, from_page, to_page, zoomin, callback)
        -__images__(...)
        -_layouts_rec(zoomin)
        -_table_transformer_job(zoomin)
        -_text_merge()
        -_extract_table_figure(...)
        -_concat_downward()
        -_filter_forpages()
        +get_position(box, zoomin)
        -boxes: list
        -outlines: list
    }
    Pdf --|> PdfParser

    class Docx {
        +__init__()
        +get_picture(document, paragraph)
        +concat_img(img1, img2)
        +__call__(filename, binary, from_page, to_page, callback)
        -doc: Document
    }
    Docx --|> DocxParser

    class chunk {
        +chunk(filename, binary, from_page, to_page, lang, callback, **kwargs)
    }

Summary

The manual.py file provides advanced document parsing capabilities tailored for PDF and DOCX formats, focusing on extracting structured text, tables, and images with layout and semantic information. It is designed for integration into NLP pipelines that require tokenized, chunked document data enriched with positional and hierarchical metadata, supporting complex workflows like OCR, layout recognition, and question-answer extraction. The file interfaces with various modules for parsing, tokenization, and image processing and exposes a simple chunk function as the main API for document processing.

manual.py

Overview

Classes and Functions

Class: Pdf(PdfParser)

Purpose

Methods

Class: Docx(DocxParser)

Purpose

Methods

Function: chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)

Purpose

Parameters

Returns

Detailed Behavior

Usage Example

Important Implementation Details

Interactions with Other Components

Visual Diagram

Summary

Class: `Pdf(PdfParser)`

Class: `Docx(DocxParser)`

Function: `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)`