book.py

Overview

The book.py file is a core document parsing utility within the InfiniFlow ecosystem, designed to handle multiple document formats (PDF, DOCX, TXT, HTML, DOC) and produce segmented, tokenized chunks of text suitable for downstream NLP tasks. It integrates advanced layout and table analysis for PDFs, and employs heuristics and natural language processing techniques to clean, merge, and structure document content. This enables efficient processing of long documents or books by extracting meaningful text sections and associated tables, optimizing for text chunking and indexing.

The file primarily exposes:

A specialized Pdf class extending PDF parsing capabilities with OCR, layout analysis, and table extraction.
A versatile chunk function that orchestrates the parsing of various document formats, performs content cleaning, segmentation, and tokenization, producing structured chunks optimized for language model consumption or document retrieval systems.

Detailed Documentation

1. Class: `Pdf`

The Pdf class extends the PdfParser class from deepdoc.parser. It specializes in PDF parsing by:

Performing OCR on specified pages.
Conducting detailed layout and table analysis.
Merging extracted text boxes intelligently.
Returning annotated text chunks alongside extracted tables.

Method: `call`

__call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)

Parameters:

filename (str): Path or identifier of the PDF file.
binary (bytes or None): Optional binary content of the PDF. If provided, this is used instead of reading from filename.
from_page (int): Start page index (inclusive) for OCR and parsing.
to_page (int): End page index (exclusive) for OCR and parsing.
zoomin (int): Zoom factor for OCR and layout analysis to improve accuracy.
callback (Callable or None): Optional progress callback function. Called with progress float or message strings.

Returns:

List[Tuple[str, str]]: List of tuples, each containing text extracted from a bounding box with appended line tags, and an optional layout number.
List: Extracted tables from the PDF.

Usage Example:

pdf_parser = Pdf()
texts, tables = pdf_parser("document.pdf", from_page=0, to_page=10, zoomin=2, callback=print)
for text, layout_no in texts:
    print(f"Layout {layout_no}: {text[:100]}...")

Implementation Details:

Uses OCR to extract image-based text from PDF pages.
Calls multiple internal methods (_layouts_rec, _table_transformer_job, _text_merge, _extract_table_figure, _naive_vertical_merge, _filter_forpages, _merge_with_same_bullet) to progressively analyze and refine layout and table data.
Incorporates timing and progress reporting via callback.
The method returns processed text "boxes" with layout metadata and a list of extracted tables.

2. Function: `chunk`

chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)

Purpose:

Parse and segment documents of various formats into meaningful, tokenized text chunks with associated tables. Supports .docx, .pdf, .txt, .html, and .doc files.

Parameters:

filename (str): Path or identifier of the document file.
binary (bytes or None): Optional binary content to parse directly.
from_page (int): Start page index for applicable formats (PDF, DOCX).
to_page (int): End page index.
lang (str): Language of the document; affects tokenization and heuristics ("Chinese" by default).
callback (Callable or None): Optional progress callback function.
**kwargs: Additional keyword arguments, expected keys:
- parser_config (dict): Configuration dict with keys:
  - chunk_token_num (int): Max tokens per chunk, default 512.
  - delimiter (str): Delimiters for splitting sentences, default "\n!?。；！？".
  - layout_recognize (str): Layout recognition method, default "DeepDOC".

Returns:

List: List of tokenized text chunks and tables, structured for downstream NLP tasks.

Usage Example:

chunks = chunk("book.pdf", from_page=1, to_page=50, lang="English", callback=print)
for chunk in chunks:
    print(chunk)

Functional Workflow:

File type detection using regex on filename extension.
Parsing based on format:
- DOCX: Uses DocxParser to extract sections and tables.
- PDF: Uses Pdf class or PlainParser per config.
- TXT: Reads plain text lines.
- HTML: Uses HtmlParser.
- DOC: Uses Apache Tika parser.
Content cleaning:
- Removes tables of contents heuristically.
- Normalizes section titles with make_colon_as_title.
Bullet and hierarchical merging:
- Detects bullet styles and merges sections hierarchically or naively.
Tokenization:
- Tokenizes tables and text chunks separately.
Language handling:
- Adjusts tokenization based on English or non-English settings.

Important Implementation Notes:

The function throws NotImplementedError if an unsupported file type is passed.
Makes use of probabilistic sampling and language detection (is_english, random_choices).
The chunking logic attempts to balance chunk size and semantic coherence.
Uses callback for progress reporting during long parsing operations.

Implementation Details and Algorithms

OCR and Layout Analysis (PDF): The Pdf class performs OCR on image representations of PDF pages, followed by recursive layout recognition to identify text blocks, tables, and figures.
Table Extraction: The method _extract_table_figure extracts tables; these are tokenized separately to preserve structure.
Text Merging: _text_merge and _naive_vertical_merge merge text boxes vertically to reconstruct paragraphs.
Bullet Detection and Hierarchical Merge: Uses detected bullet types to cluster and merge sections into hierarchical chunks, preserving document structure.
Content Cleaning: Uses heuristics to remove tables of contents and treat colon-separated lines as titles.
Tokenizer: Uses rag_tokenizer for tokenization; handles fine-grained tokenization for title keywords.
Multi-format Parsing: Integrates several parsers (DocxParser, PlainParser, HtmlParser, Tika) to support rich document types.

Interaction with Other System Components

Imports from deepdoc.parser: Utilizes parsers specialized for PDF, DOCX, plain text, and HTML.
Imports from rag.nlp: Uses NLP utilities for bullet detection, language identification, merging strategies, and tokenization.
Apache Tika: Used for DOC file parsing.
Callback Interface: Designed to integrate with UI or logging frameworks providing real-time progress updates.
Output Consumption: The tokenized chunks and tables output by chunk are intended for indexing, retrieval, or feeding into language models in the broader InfiniFlow system.

Mermaid Class Diagram

classDiagram
    class Pdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
    }

    PdfParser <|-- Pdf

Summary

book.py is a comprehensive document parsing utility that abstracts away file format differences, implements advanced PDF OCR/layout parsing, and offers an intelligent chunking mechanism. It is a foundational building block in the InfiniFlow project for transforming raw documents into structured, tokenized data for AI-powered document understanding and retrieval.

If you have any further questions or need additional examples, please let me know!

book.py

Overview

Detailed Documentation

1. Class: Pdf

Method: __call__

Parameters:

Returns:

Usage Example:

Implementation Details:

2. Function: chunk

Purpose:

Parameters:

Returns:

Usage Example:

Functional Workflow:

Important Implementation Notes:

Implementation Details and Algorithms

Interaction with Other System Components

Mermaid Class Diagram

Summary

1. Class: `Pdf`

Method: `call`

2. Function: `chunk`