presentation.py

Overview

The presentation.py module provides a unified interface and implementations for parsing and extracting content from presentation and document files, specifically PowerPoint (.pptx) and PDF (.pdf) formats. The primary goal of this file is to process these documents page-by-page (or slide-by-slide), extracting both textual content and thumbnail images for each page/slide, and preparing these chunks for downstream processing such as tokenization and indexing.

The file includes specialized parser classes for PowerPoint and PDF formats, along with a generic chunk function that intelligently selects the appropriate parser based on the input file type and user configuration. It supports multiple PDF parsing strategies including layout-aware parsing (via DeepDOC), plain text extraction, and OCR-based vision parsing.

Classes and Functions

Class: `Ppt`

Inherits from PptParser.

Purpose

Handles parsing of PowerPoint .pptx files. Extracts text and generates thumbnail images for each slide within a specified page range.

Methods

call(fnm, from_page, to_page, callback=None)
- Parameters:
  - fnm (bytes or file path): The PowerPoint file content as bytes or a file path.
  - from_page (int): Starting slide index (0-based).
  - to_page (int): Ending slide index (exclusive).
  - callback (callable, optional): A callback function for progress updates. Signature: callback(progress: float, message: str).
- Returns:
  - List[Tuple[str, PIL.Image.Image]]: A list of tuples where each tuple contains the extracted text of a slide and its corresponding thumbnail image.
- Description:
  - First, calls the superclass method to extract slide texts.
  - Uses aspose.slides to render thumbnails at 10% scale (0.1 x 0.1) for each slide.
  - Validates that the number of images corresponds to the number of extracted texts.
  - Determines if the extracted text is in English using the is_english utility.
  - Reports progress via the callback.
- Usage example:
```
ppt_parser = Ppt()
slides = ppt_parser(binary_pptx_data, from_page=0, to_page=10,
                    callback=lambda p, m: print(f"{p*100}% - {m}"))
for text, image in slides:
    print(text)
    image.show()
```

Class: `Pdf`

Inherits from PdfParser.

Purpose

Parses PDF documents, extracting OCR-processed text and page images for each page in a given range.

Methods

init()
- Initializes the parser by calling the superclass constructor.
__garbage(txt: str) -> bool (private)
- Parameters:
  - txt (str): A text string to evaluate.
- Returns:
  - bool: True if the text is considered garbage (e.g., too short or only numbers/punctuation), otherwise False.
- Description:
  - Filters out text that is unlikely useful (numerical noise, very short strings).
call(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)
- Parameters:
  - filename (str or bytes): PDF file path or binary content.
  - binary (bytes, optional): If provided, the PDF content in bytes.
  - from_page (int): Starting page index (0-based).
  - to_page (int): Ending page index (exclusive).
  - zoomin (int): Zoom scale factor for OCR image rendering.
  - callback (callable, optional): Callback for progress reporting.
- Returns:
  - List[Tuple[str, PIL.Image.Image]]: List of tuples with OCR text and page image per page.
- Description:
  - Starts OCR processing of pages.
  - Extracts bounding boxes with recognized text.
  - Filters out garbage text using __garbage.
  - Associates extracted text with rendered page images.
  - Reports progress and timing statistics.
- Implementation details:
  - Uses OCR and layout analysis to parse PDFs visually.
  - Ensures one-to-one correspondence between recognized text boxes and page images.

Class: `PlainPdf`

Inherits from PlainParser.

Purpose

Performs simple plain text extraction from PDF files without OCR or layout analysis.

Methods

call(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs)
- Parameters:
  - filename (str or bytes): PDF file path or binary content.
  - binary (bytes, optional): PDF content in bytes.
  - from_page (int): Starting page index (0-based).
  - to_page (int): Ending page index (exclusive).
  - callback (callable, optional): Progress callback.
  - **kwargs: Additional ignored parameters.
- Returns:
  - List[Tuple[str, None]]: List of tuples with extracted text and None for image (no images generated).
- Description:
  - Uses PyPDF2 to extract text from each page.
  - No image extraction or OCR performed.
  - Suitable for PDFs with embedded text.

Function: `chunk`

Purpose

Main entry point for chunking documents into pages/slides with extracted text and images, preparing data for further processing such as tokenization. Supports .pptx and .pdf files with configurable parsing strategies.

Signature

chunk(filename, binary=None, from_page=0, to_page=100000,
      lang="Chinese", callback=None, parser_config=None, **kwargs)

Parameters

filename (str): File path or name of the document.
binary (bytes, optional): Raw binary content of the document.
from_page (int): Starting page/slide index (0-based).
to_page (int): Ending page/slide index (exclusive).
lang (str): Language code for tokenization ("English" or "Chinese").
callback (callable, optional): Progress callback function.
parser_config (dict, optional): Configuration dict to select PDF parsing strategy, e.g., {"layout_recognize": "DeepDOC"}.
**kwargs: Additional parameters passed to parsers.

Returns

List[dict]: A list of dictionaries, each representing a chunk (page/slide) with tokenized text, metadata, and optional image.

Description

Detects file type based on filename extension.
For .pptx files:
- Uses the Ppt parser to extract text and images.
- Prepares a dictionary for each slide with metadata and tokenized text.
For .pdf files:
- Supports three parsing modes:
  - "DeepDOC" (default): Uses the Pdf class for OCR and layout-aware parsing.
  - "Plain Text": Uses PlainParser for simple text extraction.
  - Other: Uses a vision-based OCR parser via the VisionParser class and LLM models.
- Each page is tokenized and packaged with metadata.
Raises NotImplementedError for unsupported file types.

Usage example

chunks = chunk("example.pptx", from_page=0, to_page=5, lang="English",
               callback=lambda p, m: print(f"{p*100}% - {m}"))
for chunk in chunks:
    print(chunk["page_num_int"], chunk["title_tks"])

Important Implementation Details

Slide and Page Alignment: Both Ppt and Pdf parsers ensure that the number of extracted text blocks matches the number of generated images for correct alignment.
Text Filtering: The Pdf class includes a private method __garbage that filters out text snippets likely to be noise (e.g., page numbers or very short strings) to improve quality.
Tokenization: The chunk function uses rag_tokenizer and tokenize to convert raw text into token lists, supporting downstream NLP tasks.
Callback Mechanism: All parsers and the chunk function support an optional callback argument to provide progress updates and status messages, useful for UI feedback or logging.
Flexible PDF Parsing: The module supports multiple PDF parsing strategies to handle different document types and qualities, including OCR-based vision parsing integrated with large language models (LLMs).

Interactions with Other System Components

External Libraries:
- aspose.slides and aspose.pydrawing for PowerPoint slide rendering and thumbnail generation.
- PIL.Image for image manipulation.
- PyPDF2 for plain PDF text extraction.
- deepdoc.parser and rag.nlp for document parsing and NLP tokenization.
- api.db.services.llm_service and api.db to integrate LLM-based OCR for vision parsing.
Parser Hierarchy:
- Ppt extends PptParser from deepdoc.parser.
- Pdf extends PdfParser from deepdoc.parser.
- PlainPdf extends PlainParser from deepdoc.parser.
VisionParser:
- Used for OCR-based parsing of PDFs using LLMs when selected.
Tokenization and NLP:
- The file integrates tightly with the rag.nlp module to tokenize extracted text for indexing or retrieval.

Visual Diagram

classDiagram
    class Ppt {
        +__call__(fnm, from_page, to_page, callback=None) List[Tuple[str, Image]]
        - aspose.slides.Presentation presentation
        - List[Image] imgs
        + is_english(txts) bool
    }

    class Pdf {
        +__init__()
        - __garbage(txt) bool
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) List[Tuple[str, Image]]
        - boxes
        - page_images
        - total_page
    }

    class PlainPdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs) List[Tuple[str, None]]
        - pdf (PyPDF2.PdfReader)
    }

    class chunk {
        +chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, parser_config=None, **kwargs) List[dict]
    }

    chunk --> Ppt : uses for .pptx files
    chunk --> Pdf : uses for PDF with DeepDOC layout
    chunk --> PlainPdf : uses for PDF with plain text layout
    chunk --> VisionParser : uses for PDF with Vision OCR layout

    Ppt ..> aspose.slides : uses
    Pdf ..> OCR engine : uses (implicit via PdfParser)

Summary

The presentation.py module is a core component for document ingestion in the InfiniFlow system, providing robust mechanisms to parse and chunk presentations and PDFs into manageable, tokenized pieces. It supports multiple parsing strategies, integrates advanced OCR and NLP tools, and offers a callback system for progress monitoring. This file acts as a bridge between raw document files and the system's downstream indexing, retrieval, and analysis components.

presentation.py

Overview

Classes and Functions

Class: Ppt

Purpose

Methods

Class: Pdf

Purpose

Methods

Class: PlainPdf

Purpose

Methods

Function: chunk

Purpose

Signature

Parameters

Returns

Description

Usage example

Important Implementation Details

Interactions with Other System Components

Visual Diagram

Summary

Class: `Ppt`

Class: `Pdf`

Class: `PlainPdf`

Function: `chunk`