presentation.py

Overview

The presentation.py module provides a unified interface and implementations for parsing and extracting content from presentation and document files, specifically PowerPoint (.pptx) and PDF (.pdf) formats. The primary goal of this file is to process these documents page-by-page (or slide-by-slide), extracting both textual content and thumbnail images for each page/slide, and preparing these chunks for downstream processing such as tokenization and indexing.

The file includes specialized parser classes for PowerPoint and PDF formats, along with a generic chunk function that intelligently selects the appropriate parser based on the input file type and user configuration. It supports multiple PDF parsing strategies including layout-aware parsing (via DeepDOC), plain text extraction, and OCR-based vision parsing.


Classes and Functions

Class: Ppt

Inherits from PptParser.

Purpose

Handles parsing of PowerPoint .pptx files. Extracts text and generates thumbnail images for each slide within a specified page range.

Methods


Class: Pdf

Inherits from PdfParser.

Purpose

Parses PDF documents, extracting OCR-processed text and page images for each page in a given range.

Methods


Class: PlainPdf

Inherits from PlainParser.

Purpose

Performs simple plain text extraction from PDF files without OCR or layout analysis.

Methods


Function: chunk

Purpose

Main entry point for chunking documents into pages/slides with extracted text and images, preparing data for further processing such as tokenization. Supports .pptx and .pdf files with configurable parsing strategies.

Signature

chunk(filename, binary=None, from_page=0, to_page=100000,
      lang="Chinese", callback=None, parser_config=None, **kwargs)

Parameters

Returns

Description

Usage example

chunks = chunk("example.pptx", from_page=0, to_page=5, lang="English",
               callback=lambda p, m: print(f"{p*100}% - {m}"))
for chunk in chunks:
    print(chunk["page_num_int"], chunk["title_tks"])

Important Implementation Details


Interactions with Other System Components


Visual Diagram

classDiagram
    class Ppt {
        +__call__(fnm, from_page, to_page, callback=None) List[Tuple[str, Image]]
        - aspose.slides.Presentation presentation
        - List[Image] imgs
        + is_english(txts) bool
    }

    class Pdf {
        +__init__()
        - __garbage(txt) bool
        +__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) List[Tuple[str, Image]]
        - boxes
        - page_images
        - total_page
    }

    class PlainPdf {
        +__call__(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs) List[Tuple[str, None]]
        - pdf (PyPDF2.PdfReader)
    }

    class chunk {
        +chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, parser_config=None, **kwargs) List[dict]
    }

    chunk --> Ppt : uses for .pptx files
    chunk --> Pdf : uses for PDF with DeepDOC layout
    chunk --> PlainPdf : uses for PDF with plain text layout
    chunk --> VisionParser : uses for PDF with Vision OCR layout

    Ppt ..> aspose.slides : uses
    Pdf ..> OCR engine : uses (implicit via PdfParser)

Summary

The presentation.py module is a core component for document ingestion in the InfiniFlow system, providing robust mechanisms to parse and chunk presentations and PDFs into manageable, tokenized pieces. It supports multiple parsing strategies, integrates advanced OCR and NLP tools, and offers a callback system for progress monitoring. This file acts as a bridge between raw document files and the system's downstream indexing, retrieval, and analysis components.