ppt_parser.py


Overview

The ppt_parser.py file provides functionality to parse Microsoft PowerPoint (.pptx) presentations and extract textual content from slides within a specified page range. It is designed to handle various PowerPoint shape types—such as text frames, tables, and grouped shapes—and to preserve the hierarchical structure of bulleted lists in the extracted text.

The core component is the RAGFlowPptParser class, which supports reading slide content in order, extracting and formatting slide text while handling potential shape complexities. This makes the parser useful for downstream applications like text summarization, content indexing, or any automated processing of presentation data.


Class: RAGFlowPptParser

A parser class that extracts and processes textual content from PowerPoint presentation slides. It supports extracting bulleted text, tables, grouped shapes, and normal text frames while maintaining text hierarchy and ordering.

Usage Example

from ppt_parser import RAGFlowPptParser

parser = RAGFlowPptParser()
# Extract text from slides 0 to 5 in a presentation file 'sample.pptx'
slides_text = parser('sample.pptx', from_page=0, to_page=5)

for i, slide_text in enumerate(slides_text):
    print(f"Slide {i} content:\n{slide_text}\n")

Methods

__init__(self)

Constructor initializes the parser instance. Currently, no arguments are required, and no initialization parameters are set.


__get_bulleted_text(self, paragraph) -> str

Private method

Processes a paragraph object and determines if it is part of a bulleted or numbered list in the slide. If it is bulleted, it adds indentation and numbering to reflect the paragraph's level in the hierarchy.

formatted_text = parser._RAGFlowPptParser__get_bulleted_text(paragraph)
print(formatted_text)
# Output might be: " 1. Bullet text here"  (if paragraph.level == 1)

__extract(self, shape) -> str

Private method

Extracts text content from a given PowerPoint shape. This method intelligently handles different shape types including text frames, tables, and grouped shapes.

text = parser._RAGFlowPptParser__extract(shape)
print(text)

__call__(self, fnm, from_page, to_page, callback=None) -> list[str]

Allows an instance of RAGFlowPptParser to be called like a function. Parses a PowerPoint file or file-like object, extracting textual content from slides in the specified page range.

parser = RAGFlowPptParser()
slide_texts = parser('example.pptx', 0, 3)

for idx, text in enumerate(slide_texts):
    print(f"Slide {idx}:")
    print(text)
    print("-----")

Important Implementation Details


Interaction with Other Parts of the System


Diagram: Class Structure of RAGFlowPptParser

classDiagram
    class RAGFlowPptParser {
        +__init__()
        -__get_bulleted_text(paragraph) : str
        -__extract(shape) : str
        +__call__(fnm, from_page, to_page, callback=None) : list~str~
        +total_page: int
    }

Summary

The ppt_parser.py file defines a robust PowerPoint parsing class, RAGFlowPptParser, optimized for extracting ordered, structured text from complex slide content. It handles bulleted lists, tables, grouped shapes, and plain text with careful ordering and error resilience, enabling its seamless use in content processing pipelines that require slide textual data.