manual.py
Overview
manual.py is a document parsing and chunking utility designed primarily for processing PDF and DOCX files. It implements specialized parsers to extract textual and structural content, including tables and images, and converts the content into tokenized chunks suitable for downstream NLP tasks such as retrieval-augmented generation (RAG). The file supports layout and table recognition, OCR for PDFs, and hierarchical question-answer style extraction for DOCX files. It provides a unified chunk function that serves as an entry point to parse supported document types and produce tokenized representations enriched with layout and positional metadata.
Classes and Functions
Class: Pdf(PdfParser)
Purpose
Extends the base PdfParser class from deepdoc.parser to implement manual parsing logic for PDF files, including OCR, layout analysis, table extraction, and text merging.
Methods
init(self)Initializes the parser type as
ParserType.MANUAL.Calls the superclass initializer.
call(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)Parameters:
filename(str): Path to the PDF file.binary(bytes, optional): Binary content of the PDF, if already loaded.from_page(int): Starting page number for parsing (inclusive).to_page(int): Ending page number for parsing (exclusive).zoomin(int): Zoom factor for image rendering during OCR.callback(function): Optional progress callback accepting (progress, msg) ormsgonly.
Returns:
Tuple:
List of tuples (text, layoutno, position) representing extracted text boxes.
List of extracted tables with layout information.
Functionality:
Runs OCR on selected pages.
Performs layout analysis and table recognition.
Merges text boxes and filters by pages.
Cleans extracted text by removing excessive whitespace.
Returns structured text and tables.
Usage Example:
pdf_parser = Pdf() text_boxes, tables = pdf_parser("example.pdf", from_page=0, to_page=10, callback=print)
Class: Docx(DocxParser)
Purpose
A specialized DOCX parser extending DocxParser to extract paragraphs, images, and tables from DOCX documents while preserving question-answer style hierarchies and embedded images.
Methods
init(self)Empty initializer, no special setup.
get_picture(self, document, paragraph)Parameters:
document(docx.Document): The loaded DOCX document.paragraph(docx.paragraph.Paragraph): The paragraph from which to extract images.
Returns:
PIL.Image.Imageobject if an image is found, elseNone.
Description:
Extracts the first embedded image from the paragraph using XML parsing and returns it as a PIL Image.concat_img(self, img1, img2)Parameters:
img1(PIL.Image.ImageorNone): First image.img2(PIL.Image.ImageorNone): Second image.
Returns:
Combined vertical concatenation of
img1andimg2, or whichever is not None.
Description:
Vertically concatenates two images, handlingNonecases gracefully.call(self, filename, binary=None, from_page=0, to_page=100000, callback=None)Parameters:
filename(str): Path to DOCX file.binary(bytes, optional): Binary content of the DOCX file.from_page(int): Starting page number (used heuristically).to_page(int): Ending page number.callback(function): Optional progress reporting callback.
Returns:
Tuple:
List of tuples
(text_with_question_hierarchy, image)representing text blocks with associated images.List of tables as HTML representations with placeholders.
Functionality:
Iterates through paragraphs, identifying questions by hierarchy level using
docx_question_level.Builds a stack of questions and accumulates answers and images.
Extracts and converts DOCX tables to simple HTML tables.
Returns hierarchical question-answer pairs and associated images, plus tables.
Usage Example:
docx_parser = Docx() qa_list, tables = docx_parser("example.docx") for text, image in qa_list: print(text) if image: image.show()
Function: chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)
Purpose
Main utility function to parse and tokenize documents of supported types (PDF and DOCX). It handles configuration, parser selection, tokenization, and layout analysis.
Parameters
filename(str): Path to the document file.binary(bytes, optional): Binary content for in-memory parsing.from_page(int): Start page number to parse.to_page(int): End page number to parse.lang(str): Language of the document (default"Chinese").callback(function): Optional callback for progress updates.**kwargs:parser_config(dict): Configuration for parsing, e.g., chunk size, delimiter, layout recognition mode.
Returns
List of dictionaries representing tokenized document chunks enriched with metadata and layout info.
Detailed Behavior
Determines parser based on file extension (
.pdfor.docx).For PDF:
Selects between
Pdf(manual) orPlainParserdepending on config.Extracts sections and tables.
Performs title frequency analysis and bullet categorization to identify section pivots.
Merges and chunks text based on token counts and section boundaries.
Tokenizes tables and text chunks using RAG utilities.
For DOCX:
Uses
Docxclass to extract question-answer pairs and tables.Tokenizes extracted text and images.
Raises
NotImplementedErrorfor unsupported types.
Usage Example
chunks = chunk("sample.pdf", from_page=0, to_page=5, lang="English")
for chunk in chunks:
print(chunk["text"])
Important Implementation Details
PDF OCR and Layout Analysis:
ThePdfclass performs OCR with zooming, layout recognition, table extraction, and merges text boxes, providing detailed positional metadata.DOCX Question-Answer Extraction:
TheDocxclass uses paragraph-level heuristics to detect question hierarchy and associates following paragraphs and images as answers, supporting multi-level question nesting.Tokenization and Chunking:
Thechunkfunction uses advanced tokenization strategies (rag_tokenizer,tokenize_table,tokenize_chunks) to prepare document data for machine learning models, considering document structure and layout.Table Handling:
Tables are extracted as HTML and tokenized separately to preserve tabular structure in the tokenized representation.Progress Callbacks:
The parsers support callbacks to report progress messages and percentages, useful for UI or logging.Image Handling:
Embedded images in DOCX paragraphs are extracted and concatenated vertically when multiple images are found in sequence.
Interactions with Other Components
Imports from other modules:
api.db.ParserType: Enum for parser types.rag.nlp: NLP utilities for tokenization and document structure analysis.rag.utils: Utility for token counting.deepdoc.parser: Base classes for PDF and DOCX parsing (PdfParser,PlainParser,DocxParser).docx.Document: For reading DOCX files.PIL.Image: For image extraction and manipulation.
The parsers rely on
deepdoc.parserbase classes for core parsing logic and extend/customize functionality.The tokenization and chunking process integrates tightly with
rag.nlputilities, which provide NLP-specific processing needed for RAG pipelines.
Visual Diagram
classDiagram
class Pdf {
+__init__()
+__call__(filename, binary, from_page, to_page, zoomin, callback)
-__images__(...)
-_layouts_rec(zoomin)
-_table_transformer_job(zoomin)
-_text_merge()
-_extract_table_figure(...)
-_concat_downward()
-_filter_forpages()
+get_position(box, zoomin)
-boxes: list
-outlines: list
}
Pdf --|> PdfParser
class Docx {
+__init__()
+get_picture(document, paragraph)
+concat_img(img1, img2)
+__call__(filename, binary, from_page, to_page, callback)
-doc: Document
}
Docx --|> DocxParser
class chunk {
+chunk(filename, binary, from_page, to_page, lang, callback, **kwargs)
}
Summary
The manual.py file provides advanced document parsing capabilities tailored for PDF and DOCX formats, focusing on extracting structured text, tables, and images with layout and semantic information. It is designed for integration into NLP pipelines that require tokenized, chunked document data enriched with positional and hierarchical metadata, supporting complex workflows like OCR, layout recognition, and question-answer extraction. The file interfaces with various modules for parsing, tokenization, and image processing and exposes a simple chunk function as the main API for document processing.