paper.py

Overview

The paper.py file is a core utility module designed for extracting, parsing, and chunking content from academic papers in PDF format. It leverages Optical Character Recognition (OCR), layout analysis, table detection, and advanced natural language processing (NLP) tokenization techniques to convert raw PDF documents into structured data representations suitable for downstream tasks such as information retrieval, summarization, or semantic search.

The primary functionalities include:

Parsing PDF documents with detailed layout and table recognition.
Extracting key metadata such as title, authors, abstract, and document sections.
Tokenizing and chunking text content intelligently based on document structure and language.
Supporting callback-driven progress reporting for integration in GUI or pipeline workflows.

This file depends on external libraries and modules for OCR and parsing (deepdoc.parser), NLP tokenization (rag.nlp), and database constants (api.db).

Classes and Functions

Class `Pdf`

Inheritance

Inherits from PdfParser (imported from deepdoc.parser)

Purpose

Extends the base PdfParser class to specialize it for parsing academic papers. It implements a callable interface to process PDF files, perform OCR, layout and table analysis, text merging, and extract structured document elements (title, authors, abstract, sections, tables).

Constructor: `init(self)`

Initializes the parser type to ParserType.PAPER.value indicating this parser is specialized for paper documents.
Calls the base class constructor.

Callable Method: `call(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)`

Description

Processes a PDF file (or binary content) from from_page to to_page with a zoom factor zoomin. Performs multiple stages including OCR, layout recognition, table detection, and text merging. Extracts and returns structured content including title, authors, abstract, sections, and tables.

Parameters

filename (str): Path to the PDF file.
binary (optional): PDF content in binary form; if provided, used instead of filename.
from_page (int): Starting page number to parse (default 0).
to_page (int): Ending page number to parse (default 100000, effectively no limit).
zoomin (int): Zoom factor for image extraction and OCR (default 3).
callback (callable): Optional function to report progress and messages; called with msg or progress float and message.

Returns

A dictionary with the following keys:

"title" (str): Extracted paper title.
"authors" (str): Concatenated author names.
"abstract" (str): Extracted abstract text.
"sections" (list of tuples): Each tuple contains (text, layoutno) representing body text and its layout classification.
"tables" (list): Extracted tables and figures data.

Usage Example

pdf_parser = Pdf()
result = pdf_parser("sample_paper.pdf", from_page=0, to_page=10, callback=print)
print(result["title"])
print(result["authors"])

Important Implementation Details

Uses time measurements to track each processing phase.
Employs regex to detect key sections like introduction, abstract, and keywords.
Extracts title and authors heuristically by scanning top boxes and validating against section keywords.
Applies column-based sorting heuristics to handle two-column layouts.
Cleans extracted text by removing excessive whitespace characters.
Extracts tables via _extract_table_figure method.
Handles partial page ranges differently (if from_page > 0, returns fewer metadata fields).

Function `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)`

Purpose

Processes a PDF paper file and slices its content into semantically meaningful chunks for downstream NLP processing. The abstract is preserved as a whole chunk. Supports different parsing configurations and layout recognition modes.

Parameters

filename (str): Path to the PDF file.
binary (optional): PDF content as binary.
from_page (int): Starting page number (default 0).
to_page (int): Ending page number (default 100000).
lang (str): Language of the document, affects tokenization (default "Chinese").
callback (callable): Optional progress/message callback function.
**kwargs: Additional keyword arguments, supporting:
- parser_config (dict): Configuration dict with keys like "chunk_token_num", "delimiter", and "layout_recognize".

Returns

A list of dictionaries, each representing a chunk of document text with associated tokenized content and positional metadata.

Usage Example

chunks = chunk("sample_paper.pdf", from_page=0, to_page=5, lang="English")
for c in chunks:
    print(c["title_tks"], c["image"])

Implementation Highlights

Selects between PlainParser and the custom Pdf parser based on config.
Extracts metadata and tokenizes authors, title, and tables.
Uses specialized tokenization functions from rag.nlp for tables and sections.
Applies heuristics to detect and group sections using bullet patterns and title frequencies.
Merges text chunks based on section hierarchy to form semantically coherent chunks.
Abstract chunk is treated specially to avoid slicing it into smaller parts.

Important Implementation Details and Algorithms

OCR and Layout Analysis Pipeline: The Pdf class orchestrates OCR image extraction, layout analysis, and table recognition sequentially, timing each step and reporting through callbacks.
Two-Column Layout Handling: Uses median column width heuristics to detect and sort text boxes appropriately when a two-column layout is detected.
Section Detection: Uses regular expressions to identify section titles, abstracts, and keywords by matching common terms in multiple languages (e.g., English and Chinese).
Tokenization and Chunking: Combines coarse and fine-grained tokenization strategies. Uses bullet categories and title frequency analysis to segment sections and merge logically related text blocks.
Deep Copying for Tokenization: Deep copies of document metadata are created for each chunk to maintain isolated tokenization states.
Callback Mechanism: Allows external code to track processing progress and display messages, facilitating integration into interactive or batch systems.

Interaction with Other System Components

api.db.ParserType: Provides enumeration constants used to specify parser types.
rag.nlp module: Supplies tokenizer functions (rag_tokenizer.tokenize, tokenize_table, add_positions, etc.) crucial for NLP processing of extracted text.
deepdoc.parser module: Supplies base parser classes PdfParser and PlainParser for PDF content reading and basic parsing.
numpy: Used for numerical operations, specifically median calculation for layout heuristics.
Logging and Regex: For debugging and text pattern matching within the document.

The paper.py file acts as a bridge between raw PDF input and structured, tokenized output, making it a central component in the document ingestion pipeline of the InfiniFlow system.

Mermaid Class Diagram

classDiagram
    class Pdf {
        - model_speciess: ParserType
        + __init__()
        + __call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
    }

    class PdfParser {
        <<abstract>>
    }

    Pdf --|> PdfParser

Pdf extends PdfParser.
Pdf exposes a callable interface to parse and extract paper content.

Summary

The paper.py file is a specialized PDF parser and chunking utility tailored for academic papers. It combines OCR, layout detection, table extraction, and advanced NLP tokenization to transform raw PDFs into structured, tokenized document representations. The Pdf class handles detailed parsing, while the chunk function segments the paper into meaningful text blocks suitable for further NLP analysis. The file is designed for extensibility, integration, and multilingual support with robust heuristics and callback-based progress reporting.

paper.py

Overview

Classes and Functions

Class Pdf

Inheritance

Purpose

Constructor: __init__(self)

Callable Method: __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)

Description

Parameters

Returns

Usage Example

Important Implementation Details

Function chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)

Purpose

Parameters

Returns

Usage Example

Implementation Highlights

Important Implementation Details and Algorithms

Interaction with Other System Components

Mermaid Class Diagram

Summary

Class `Pdf`

Constructor: `init(self)`

Callable Method: `call(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)`

Function `chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, **kwargs)`