book.py
Overview
The book.py file is a core document parsing utility within the InfiniFlow ecosystem, designed to handle multiple document formats (PDF, DOCX, TXT, HTML, DOC) and produce segmented, tokenized chunks of text suitable for downstream NLP tasks. It integrates advanced layout and table analysis for PDFs, and employs heuristics and natural language processing techniques to clean, merge, and structure document content. This enables efficient processing of long documents or books by extracting meaningful text sections and associated tables, optimizing for text chunking and indexing.
The file primarily exposes:
A specialized
Pdfclass extending PDF parsing capabilities with OCR, layout analysis, and table extraction.A versatile
chunkfunction that orchestrates the parsing of various document formats, performs content cleaning, segmentation, and tokenization, producing structured chunks optimized for language model consumption or document retrieval systems.
Detailed Documentation
1. Class: Pdf
The Pdf class extends the PdfParser class from deepdoc.parser. It specializes in PDF parsing by:
Performing OCR on specified pages.
Conducting detailed layout and table analysis.
Merging extracted text boxes intelligently.
Returning annotated text chunks alongside extracted tables.
Method: __call__
__call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
Parameters:
filename(str): Path or identifier of the PDF file.binary(bytesorNone): Optional binary content of the PDF. If provided, this is used instead of reading fromfilename.from_page(int): Start page index (inclusive) for OCR and parsing.to_page(int): End page index (exclusive) for OCR and parsing.zoomin(int): Zoom factor for OCR and layout analysis to improve accuracy.callback(CallableorNone): Optional progress callback function. Called with progress float or message strings.
Returns:
List[Tuple[str, str]]: List of tuples, each containing text extracted from a bounding box with appended line tags, and an optional layout number.List: Extracted tables from the PDF.
Usage Example:
pdf_parser = Pdf()
texts, tables = pdf_parser("document.pdf", from_page=0, to_page=10, zoomin=2, callback=print)
for text, layout_no in texts:
print(f"Layout {layout_no}: {text[:100]}...")
Implementation Details:
Uses OCR to extract image-based text from PDF pages.
Calls multiple internal methods (
_layouts_rec,_table_transformer_job,_text_merge,_extract_table_figure,_naive_vertical_merge,_filter_forpages,_merge_with_same_bullet) to progressively analyze and refine layout and table data.Incorporates timing and progress reporting via
callback.The method returns processed text "boxes" with layout metadata and a list of extracted tables.
2. Function: chunk
chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)
Purpose:
Parse and segment documents of various formats into meaningful, tokenized text chunks with associated tables. Supports .docx, .pdf, .txt, .html, and .doc files.
Parameters:
filename(str): Path or identifier of the document file.binary(bytesorNone): Optional binary content to parse directly.from_page(int): Start page index for applicable formats (PDF, DOCX).to_page(int): End page index.lang(str): Language of the document; affects tokenization and heuristics ("Chinese"by default).callback(CallableorNone): Optional progress callback function.**kwargs: Additional keyword arguments, expected keys:parser_config(dict): Configuration dict with keys:chunk_token_num(int): Max tokens per chunk, default 512.delimiter(str): Delimiters for splitting sentences, default"\n!?。;!?".layout_recognize(str): Layout recognition method, default"DeepDOC".
Returns:
List: List of tokenized text chunks and tables, structured for downstream NLP tasks.
Usage Example:
chunks = chunk("book.pdf", from_page=1, to_page=50, lang="English", callback=print)
for chunk in chunks:
print(chunk)
Functional Workflow:
File type detection using regex on
filenameextension.Parsing based on format:
DOCX: Uses
DocxParserto extract sections and tables.PDF: Uses
Pdfclass orPlainParserper config.TXT: Reads plain text lines.
HTML: Uses
HtmlParser.DOC: Uses Apache Tika parser.
Content cleaning:
Removes tables of contents heuristically.
Normalizes section titles with
make_colon_as_title.
Bullet and hierarchical merging:
Detects bullet styles and merges sections hierarchically or naively.
Tokenization:
Tokenizes tables and text chunks separately.
Language handling:
Adjusts tokenization based on English or non-English settings.
Important Implementation Notes:
The function throws
NotImplementedErrorif an unsupported file type is passed.Makes use of probabilistic sampling and language detection (
is_english,random_choices).The chunking logic attempts to balance chunk size and semantic coherence.
Uses callback for progress reporting during long parsing operations.
Implementation Details and Algorithms
OCR and Layout Analysis (PDF): The
Pdfclass performs OCR on image representations of PDF pages, followed by recursive layout recognition to identify text blocks, tables, and figures.Table Extraction: The method
_extract_table_figureextracts tables; these are tokenized separately to preserve structure.Text Merging:
_text_mergeand_naive_vertical_mergemerge text boxes vertically to reconstruct paragraphs.Bullet Detection and Hierarchical Merge: Uses detected bullet types to cluster and merge sections into hierarchical chunks, preserving document structure.
Content Cleaning: Uses heuristics to remove tables of contents and treat colon-separated lines as titles.
Tokenizer: Uses
rag_tokenizerfor tokenization; handles fine-grained tokenization for title keywords.Multi-format Parsing: Integrates several parsers (
DocxParser,PlainParser,HtmlParser, Tika) to support rich document types.
Interaction with Other System Components
Imports from
deepdoc.parser: Utilizes parsers specialized for PDF, DOCX, plain text, and HTML.Imports from
rag.nlp: Uses NLP utilities for bullet detection, language identification, merging strategies, and tokenization.Apache Tika: Used for DOC file parsing.
Callback Interface: Designed to integrate with UI or logging frameworks providing real-time progress updates.
Output Consumption: The tokenized chunks and tables output by
chunkare intended for indexing, retrieval, or feeding into language models in the broader InfiniFlow system.
Mermaid Class Diagram
classDiagram
class Pdf {
+__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
}
PdfParser <|-- Pdf
Summary
book.py is a comprehensive document parsing utility that abstracts away file format differences, implements advanced PDF OCR/layout parsing, and offers an intelligent chunking mechanism. It is a foundational building block in the InfiniFlow project for transforming raw documents into structured, tokenized data for AI-powered document understanding and retrieval.
If you have any further questions or need additional examples, please let me know!