paper.py
Overview
The paper.py file is a core utility module designed for extracting, parsing, and chunking content from academic papers in PDF format. It leverages Optical Character Recognition (OCR), layout analysis, table detection, and advanced natural language processing (NLP) tokenization techniques to convert raw PDF documents into structured data representations suitable for downstream tasks such as information retrieval, summarization, or semantic search.
The primary functionalities include:
Parsing PDF documents with detailed layout and table recognition.
Extracting key metadata such as title, authors, abstract, and document sections.
Tokenizing and chunking text content intelligently based on document structure and language.
Supporting callback-driven progress reporting for integration in GUI or pipeline workflows.
This file depends on external libraries and modules for OCR and parsing (deepdoc.parser), NLP tokenization (rag.nlp), and database constants (api.db).
Classes and Functions
Class Pdf
Inheritance
Inherits from
PdfParser(imported fromdeepdoc.parser)
Purpose
Extends the base PdfParser class to specialize it for parsing academic papers. It implements a callable interface to process PDF files, perform OCR, layout and table analysis, text merging, and extract structured document elements (title, authors, abstract, sections, tables).
Constructor: __init__(self)
Initializes the parser type to ParserType.PAPER.value indicating this parser is specialized for paper documents.
Calls the base class constructor.
Callable Method: __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
Description
Processes a PDF file (or binary content) from from_page to to_page with a zoom factor zoomin. Performs multiple stages including OCR, layout recognition, table detection, and text merging. Extracts and returns structured content including title, authors, abstract, sections, and tables.
Parameters
filename(str): Path to the PDF file.binary(optional): PDF content in binary form; if provided, used instead of filename.from_page(int): Starting page number to parse (default 0).to_page(int): Ending page number to parse (default 100000, effectively no limit).zoomin(int): Zoom factor for image extraction and OCR (default 3).callback(callable): Optional function to report progress and messages; called withmsgor progress float and message.
Returns
A dictionary with the following keys:
"title"(str): Extracted paper title."authors"(str): Concatenated author names."abstract"(str): Extracted abstract text."sections"(list of tuples): Each tuple contains(text, layoutno)representing body text and its layout classification."tables" (list): Extracted tables and figures data.
Usage Example
pdf_parser = Pdf()
result = pdf_parser("sample_paper.pdf", from_page=0, to_page=10, callback=print)
print(result["title"])
print(result["authors"])
Important Implementation Details
Uses time measurements to track each processing phase.
Employs regex to detect key sections like introduction, abstract, and keywords.
Extracts title and authors heuristically by scanning top boxes and validating against section keywords.
Applies column-based sorting heuristics to handle two-column layouts.
Cleans extracted text by removing excessive whitespace characters.
Extracts tables via
_extract_table_figuremethod.Handles partial page ranges differently (if
from_page > 0, returns fewer metadata fields).
Function chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)
Purpose
Processes a PDF paper file and slices its content into semantically meaningful chunks for downstream NLP processing. The abstract is preserved as a whole chunk. Supports different parsing configurations and layout recognition modes.
Parameters
filename(str): Path to the PDF file.binary(optional): PDF content as binary.from_page(int): Starting page number (default 0).to_page(int): Ending page number (default 100000).lang(str): Language of the document, affects tokenization (default"Chinese").callback(callable): Optional progress/message callback function.**kwargs: Additional keyword arguments, supporting:parser_config(dict): Configuration dict with keys like"chunk_token_num","delimiter", and"layout_recognize".
Returns
A list of dictionaries, each representing a chunk of document text with associated tokenized content and positional metadata.
Usage Example
chunks = chunk("sample_paper.pdf", from_page=0, to_page=5, lang="English")
for c in chunks:
print(c["title_tks"], c["image"])
Implementation Highlights
Selects between
PlainParserand the customPdfparser based on config.Extracts metadata and tokenizes authors, title, and tables.
Uses specialized tokenization functions from
rag.nlpfor tables and sections.Applies heuristics to detect and group sections using bullet patterns and title frequencies.
Merges text chunks based on section hierarchy to form semantically coherent chunks.
Abstract chunk is treated specially to avoid slicing it into smaller parts.
Important Implementation Details and Algorithms
OCR and Layout Analysis Pipeline: The
Pdfclass orchestrates OCR image extraction, layout analysis, and table recognition sequentially, timing each step and reporting through callbacks.Two-Column Layout Handling: Uses median column width heuristics to detect and sort text boxes appropriately when a two-column layout is detected.
Section Detection: Uses regular expressions to identify section titles, abstracts, and keywords by matching common terms in multiple languages (e.g., English and Chinese).
Tokenization and Chunking: Combines coarse and fine-grained tokenization strategies. Uses bullet categories and title frequency analysis to segment sections and merge logically related text blocks.
Deep Copying for Tokenization: Deep copies of document metadata are created for each chunk to maintain isolated tokenization states.
Callback Mechanism: Allows external code to track processing progress and display messages, facilitating integration into interactive or batch systems.
Interaction with Other System Components
api.db.ParserType: Provides enumeration constants used to specify parser types.rag.nlpmodule: Supplies tokenizer functions (rag_tokenizer.tokenize,tokenize_table,add_positions, etc.) crucial for NLP processing of extracted text.deepdoc.parsermodule: Supplies base parser classesPdfParserandPlainParserfor PDF content reading and basic parsing.numpy: Used for numerical operations, specifically median calculation for layout heuristics.Logging and Regex: For debugging and text pattern matching within the document.
The paper.py file acts as a bridge between raw PDF input and structured, tokenized output, making it a central component in the document ingestion pipeline of the InfiniFlow system.
Mermaid Class Diagram
classDiagram
class Pdf {
- model_speciess: ParserType
+ __init__()
+ __call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
}
class PdfParser {
<<abstract>>
}
Pdf --|> PdfParser
PdfextendsPdfParser.Pdfexposes a callable interface to parse and extract paper content.
Summary
The paper.py file is a specialized PDF parser and chunking utility tailored for academic papers. It combines OCR, layout detection, table extraction, and advanced NLP tokenization to transform raw PDFs into structured, tokenized document representations. The Pdf class handles detailed parsing, while the chunk function segments the paper into meaningful text blocks suitable for further NLP analysis. The file is designed for extensibility, integration, and multilingual support with robust heuristics and callback-based progress reporting.