presentation.py
Overview
The presentation.py module provides a unified interface and implementations for parsing and extracting content from presentation and document files, specifically PowerPoint (.pptx) and PDF (.pdf) formats. The primary goal of this file is to process these documents page-by-page (or slide-by-slide), extracting both textual content and thumbnail images for each page/slide, and preparing these chunks for downstream processing such as tokenization and indexing.
The file includes specialized parser classes for PowerPoint and PDF formats, along with a generic chunk function that intelligently selects the appropriate parser based on the input file type and user configuration. It supports multiple PDF parsing strategies including layout-aware parsing (via DeepDOC), plain text extraction, and OCR-based vision parsing.
Classes and Functions
Class: Ppt
Inherits from PptParser.
Purpose
Handles parsing of PowerPoint .pptx files. Extracts text and generates thumbnail images for each slide within a specified page range.
Methods
call(fnm, from_page, to_page, callback=None)Parameters:
fnm(bytesor file path): The PowerPoint file content as bytes or a file path.from_page(int): Starting slide index (0-based).to_page(int): Ending slide index (exclusive).callback(callable, optional): A callback function for progress updates. Signature:callback(progress: float, message: str).
Returns:
List[Tuple[str, PIL.Image.Image]]: A list of tuples where each tuple contains the extracted text of a slide and its corresponding thumbnail image.
Description:
First, calls the superclass method to extract slide texts.
Uses
aspose.slidesto render thumbnails at 10% scale (0.1 x 0.1) for each slide.Validates that the number of images corresponds to the number of extracted texts.
Determines if the extracted text is in English using the
is_englishutility.Reports progress via the callback.
Usage example:
ppt_parser = Ppt() slides = ppt_parser(binary_pptx_data, from_page=0, to_page=10, callback=lambda p, m: print(f"{p*100}% - {m}")) for text, image in slides: print(text) image.show()
Class: Pdf
Inherits from PdfParser.
Purpose
Parses PDF documents, extracting OCR-processed text and page images for each page in a given range.
Methods
init()Initializes the parser by calling the superclass constructor.
__garbage(txt: str) -> bool(private)Parameters:
txt(str): A text string to evaluate.
Returns:
bool:Trueif the text is considered garbage (e.g., too short or only numbers/punctuation), otherwiseFalse.
Description:
Filters out text that is unlikely useful (numerical noise, very short strings).
call(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None)Parameters:
filename(str or bytes): PDF file path or binary content.binary(bytes, optional): If provided, the PDF content in bytes.from_page(int): Starting page index (0-based).to_page(int): Ending page index (exclusive).zoomin(int): Zoom scale factor for OCR image rendering.callback(callable, optional): Callback for progress reporting.
Returns:
List[Tuple[str, PIL.Image.Image]]: List of tuples with OCR text and page image per page.
Description:
Starts OCR processing of pages.
Extracts bounding boxes with recognized text.
Filters out garbage text using
__garbage.Associates extracted text with rendered page images.
Reports progress and timing statistics.
Implementation details:
Uses OCR and layout analysis to parse PDFs visually.
Ensures one-to-one correspondence between recognized text boxes and page images.
Class: PlainPdf
Inherits from PlainParser.
Purpose
Performs simple plain text extraction from PDF files without OCR or layout analysis.
Methods
call(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs)Parameters:
filename(str or bytes): PDF file path or binary content.binary(bytes, optional): PDF content in bytes.from_page(int): Starting page index (0-based).to_page(int): Ending page index (exclusive).callback(callable, optional): Progress callback.**kwargs: Additional ignored parameters.
Returns:
List[Tuple[str, None]]: List of tuples with extracted text andNonefor image (no images generated).
Description:
Uses
PyPDF2to extract text from each page.No image extraction or OCR performed.
Suitable for PDFs with embedded text.
Function: chunk
Purpose
Main entry point for chunking documents into pages/slides with extracted text and images, preparing data for further processing such as tokenization. Supports .pptx and .pdf files with configurable parsing strategies.
Signature
chunk(filename, binary=None, from_page=0, to_page=100000,
lang="Chinese", callback=None, parser_config=None, **kwargs)
Parameters
filename(str): File path or name of the document.binary(bytes, optional): Raw binary content of the document.from_page(int): Starting page/slide index (0-based).to_page(int): Ending page/slide index (exclusive).lang(str): Language code for tokenization ("English" or "Chinese").callback(callable, optional): Progress callback function.parser_config(dict, optional): Configuration dict to select PDF parsing strategy, e.g.,{"layout_recognize": "DeepDOC"}.**kwargs: Additional parameters passed to parsers.
Returns
List[dict]: A list of dictionaries, each representing a chunk (page/slide) with tokenized text, metadata, and optional image.
Description
Detects file type based on filename extension.
For
.pptxfiles:Uses the
Pptparser to extract text and images.Prepares a dictionary for each slide with metadata and tokenized text.
For
.pdffiles:Supports three parsing modes:
"DeepDOC"(default): Uses thePdfclass for OCR and layout-aware parsing."Plain Text": UsesPlainParserfor simple text extraction.Other: Uses a vision-based OCR parser via the
VisionParserclass and LLM models.
Each page is tokenized and packaged with metadata.
Raises
NotImplementedErrorfor unsupported file types.
Usage example
chunks = chunk("example.pptx", from_page=0, to_page=5, lang="English",
callback=lambda p, m: print(f"{p*100}% - {m}"))
for chunk in chunks:
print(chunk["page_num_int"], chunk["title_tks"])
Important Implementation Details
Slide and Page Alignment: Both
PptandPdfparsers ensure that the number of extracted text blocks matches the number of generated images for correct alignment.Text Filtering: The
Pdfclass includes a private method__garbagethat filters out text snippets likely to be noise (e.g., page numbers or very short strings) to improve quality.Tokenization: The
chunkfunction usesrag_tokenizerandtokenizeto convert raw text into token lists, supporting downstream NLP tasks.Callback Mechanism: All parsers and the
chunkfunction support an optionalcallbackargument to provide progress updates and status messages, useful for UI feedback or logging.Flexible PDF Parsing: The module supports multiple PDF parsing strategies to handle different document types and qualities, including OCR-based vision parsing integrated with large language models (LLMs).
Interactions with Other System Components
External Libraries:
aspose.slidesandaspose.pydrawingfor PowerPoint slide rendering and thumbnail generation.PIL.Imagefor image manipulation.PyPDF2for plain PDF text extraction.deepdoc.parserandrag.nlpfor document parsing and NLP tokenization.api.db.services.llm_serviceandapi.dbto integrate LLM-based OCR for vision parsing.
Parser Hierarchy:
PptextendsPptParserfromdeepdoc.parser.PdfextendsPdfParserfromdeepdoc.parser.PlainPdfextendsPlainParserfromdeepdoc.parser.
VisionParser:
Used for OCR-based parsing of PDFs using LLMs when selected.
Tokenization and NLP:
The file integrates tightly with the
rag.nlpmodule to tokenize extracted text for indexing or retrieval.
Visual Diagram
classDiagram
class Ppt {
+__call__(fnm, from_page, to_page, callback=None) List[Tuple[str, Image]]
- aspose.slides.Presentation presentation
- List[Image] imgs
+ is_english(txts) bool
}
class Pdf {
+__init__()
- __garbage(txt) bool
+__call__(filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None) List[Tuple[str, Image]]
- boxes
- page_images
- total_page
}
class PlainPdf {
+__call__(filename, binary=None, from_page=0, to_page=100000, callback=None, **kwargs) List[Tuple[str, None]]
- pdf (PyPDF2.PdfReader)
}
class chunk {
+chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", callback=None, parser_config=None, **kwargs) List[dict]
}
chunk --> Ppt : uses for .pptx files
chunk --> Pdf : uses for PDF with DeepDOC layout
chunk --> PlainPdf : uses for PDF with plain text layout
chunk --> VisionParser : uses for PDF with Vision OCR layout
Ppt ..> aspose.slides : uses
Pdf ..> OCR engine : uses (implicit via PdfParser)
Summary
The presentation.py module is a core component for document ingestion in the InfiniFlow system, providing robust mechanisms to parse and chunk presentations and PDFs into manageable, tokenized pieces. It supports multiple parsing strategies, integrates advanced OCR and NLP tools, and offers a callback system for progress monitoring. This file acts as a bridge between raw document files and the system's downstream indexing, retrieval, and analysis components.