ppt_parser.py

Overview

The ppt_parser.py file provides functionality to parse Microsoft PowerPoint (.pptx) presentations and extract textual content from slides within a specified page range. It is designed to handle various PowerPoint shape types—such as text frames, tables, and grouped shapes—and to preserve the hierarchical structure of bulleted lists in the extracted text.

The core component is the RAGFlowPptParser class, which supports reading slide content in order, extracting and formatting slide text while handling potential shape complexities. This makes the parser useful for downstream applications like text summarization, content indexing, or any automated processing of presentation data.

Class: RAGFlowPptParser

A parser class that extracts and processes textual content from PowerPoint presentation slides. It supports extracting bulleted text, tables, grouped shapes, and normal text frames while maintaining text hierarchy and ordering.

Usage Example

from ppt_parser import RAGFlowPptParser

parser = RAGFlowPptParser()
# Extract text from slides 0 to 5 in a presentation file 'sample.pptx'
slides_text = parser('sample.pptx', from_page=0, to_page=5)

for i, slide_text in enumerate(slides_text):
    print(f"Slide {i} content:\n{slide_text}\n")

Methods

`init(self)`

Constructor initializes the parser instance. Currently, no arguments are required, and no initialization parameters are set.

`__get_bulleted_text(self, paragraph) -> str`

Private method

Processes a paragraph object and determines if it is part of a bulleted or numbered list in the slide. If it is bulleted, it adds indentation and numbering to reflect the paragraph's level in the hierarchy.

Parameters:
- paragraph (pptx.text.paragraph.Paragraph): A paragraph object from a PowerPoint shape.
Returns:
- str: Formatted paragraph text with bullet level indentation if applicable.
Details:
- Uses XML xpath queries to check for bullet characters (buChar), auto-numbers (buAutoNum), or bullet images (buBlip).
- Indents with two spaces per bullet level and prefixes with a dot (.) to indicate bullet hierarchy.
Example:

formatted_text = parser._RAGFlowPptParser__get_bulleted_text(paragraph)
print(formatted_text)
# Output might be: " 1. Bullet text here"  (if paragraph.level == 1)

`__extract(self, shape) -> str`

Private method

Extracts text content from a given PowerPoint shape. This method intelligently handles different shape types including text frames, tables, and grouped shapes.

Parameters:
- shape (pptx.shapes.base.BaseShape): A shape object from a slide.
Returns:
- str: Extracted textual content from the shape. Returns an empty string if no text is found or errors occur.
Implementation Details:
- Checks if the shape has a text frame and extracts paragraphs with bulleted text formatting.
- Safely attempts to read the shape_type attribute to differentiate shape kinds.
- For tables (shape_type == 19), extracts cell content row-wise, skipping the header row and concatenating column headers with cell values.
- For grouped shapes (shape_type == 6), recursively extracts text from each sub-shape, sorting them by position to maintain reading order.
- Handles exceptions gracefully, logging errors without stopping the extraction process.
Example:

text = parser._RAGFlowPptParser__extract(shape)
print(text)

`call(self, fnm, from_page, to_page, callback=None) -> list[str]`

Allows an instance of RAGFlowPptParser to be called like a function. Parses a PowerPoint file or file-like object, extracting textual content from slides in the specified page range.

Parameters:
- fnm (str or bytes-like): File path to a .pptx file or bytes object representing the file content.
- from_page (int): Starting slide index (inclusive) from which to extract text.
- to_page (int): Ending slide index (exclusive) up to which to extract text.
- callback (optional, callable): Currently not used. Reserved for future functionality (e.g., progress updates).
Returns:
- list[str]: A list where each element is the concatenated text content of a single slide.
Implementation Details:
- Loads the presentation using pptx.Presentation.
- Iterates over slides based on the page range.
- Sorts shapes within slides by vertical and horizontal position to preserve reading order.
- Extracts text from each shape using __extract.
- Joins shape texts with newline separators to form slide text.
- Stores total number of slides in self.total_page.
Example:

parser = RAGFlowPptParser()
slide_texts = parser('example.pptx', 0, 3)

for idx, text in enumerate(slide_texts):
    print(f"Slide {idx}:")
    print(text)
    print("-----")

Important Implementation Details

Shape Type Constants:
- The parser checks for specific shape_type numeric codes:
  - 6 corresponds to Group Shapes — collections of multiple shapes.
  - 19 corresponds to Tables.
- These numeric codes come from the pptx library's internal enumeration of shape types.
Text Extraction Strategy:
- Bulleted and numbered paragraphs are detected through XML structure inspection using XPath.
- Group shapes are recursively processed to ensure text is fully extracted even from nested elements.
- Table extraction assumes the first row contains headers and concatenates these with cell values for each subsequent row.
Error Handling:
- The parser employs try-except blocks around shape processing to avoid interruption due to unexpected shape formats.
- Errors are logged using the standard logging module.
Sorting of Shapes:
- Shapes on each slide are sorted primarily by vertical position (top) and secondarily by horizontal position (left), to reflect natural reading order.

Interaction with Other Parts of the System

Dependencies:
- Relies on the python-pptx library (pptx.Presentation) for reading and navigating PowerPoint files.
- Uses the logging module for error reporting.
- Can be integrated into larger document processing or information retrieval pipelines, e.g., within the InfiniFlow project.
Input/Output:
- Input is a PowerPoint .pptx file path or byte stream.
- Output is a list of strings representing slide texts, ready for further processing like indexing, NLP, or display.
Potential Integration:
- The parser can be called by a document ingestion service.
- Extracted slides’ text can be fed into summarization, question-answering, or search indexing subsystems.

Diagram: Class Structure of RAGFlowPptParser

classDiagram
    class RAGFlowPptParser {
        +__init__()
        -__get_bulleted_text(paragraph) : str
        -__extract(shape) : str
        +__call__(fnm, from_page, to_page, callback=None) : list~str~
        +total_page: int
    }

init: Initializes parser instance.
__get_bulleted_text: Formats bullet/numbered paragraph text.
__extract: Extracts text from various shape types.
call: Main method to extract text from multiple slides.
total_page: Stores total number of slides in the presentation after loading.

Summary

The ppt_parser.py file defines a robust PowerPoint parsing class, RAGFlowPptParser, optimized for extracting ordered, structured text from complex slide content. It handles bulleted lists, tables, grouped shapes, and plain text with careful ordering and error resilience, enabling its seamless use in content processing pipelines that require slide textual data.