ppt_parser.py
Overview
The ppt_parser.py file provides functionality to parse Microsoft PowerPoint (.pptx) presentations and extract textual content from slides within a specified page range. It is designed to handle various PowerPoint shape types—such as text frames, tables, and grouped shapes—and to preserve the hierarchical structure of bulleted lists in the extracted text.
The core component is the RAGFlowPptParser class, which supports reading slide content in order, extracting and formatting slide text while handling potential shape complexities. This makes the parser useful for downstream applications like text summarization, content indexing, or any automated processing of presentation data.
Class: RAGFlowPptParser
A parser class that extracts and processes textual content from PowerPoint presentation slides. It supports extracting bulleted text, tables, grouped shapes, and normal text frames while maintaining text hierarchy and ordering.
Usage Example
from ppt_parser import RAGFlowPptParser
parser = RAGFlowPptParser()
# Extract text from slides 0 to 5 in a presentation file 'sample.pptx'
slides_text = parser('sample.pptx', from_page=0, to_page=5)
for i, slide_text in enumerate(slides_text):
print(f"Slide {i} content:\n{slide_text}\n")
Methods
__init__(self)
Constructor initializes the parser instance. Currently, no arguments are required, and no initialization parameters are set.
__get_bulleted_text(self, paragraph) -> str
Private method
Processes a paragraph object and determines if it is part of a bulleted or numbered list in the slide. If it is bulleted, it adds indentation and numbering to reflect the paragraph's level in the hierarchy.
Parameters:
paragraph(pptx.text.paragraph.Paragraph): A paragraph object from a PowerPoint shape.
Returns:
str: Formatted paragraph text with bullet level indentation if applicable.
Details:
Uses XML xpath queries to check for bullet characters (
buChar), auto-numbers (buAutoNum), or bullet images (buBlip).Indents with two spaces per bullet level and prefixes with a dot (
.) to indicate bullet hierarchy.
Example:
formatted_text = parser._RAGFlowPptParser__get_bulleted_text(paragraph)
print(formatted_text)
# Output might be: " 1. Bullet text here" (if paragraph.level == 1)
__extract(self, shape) -> str
Private method
Extracts text content from a given PowerPoint shape. This method intelligently handles different shape types including text frames, tables, and grouped shapes.
Parameters:
shape(pptx.shapes.base.BaseShape): A shape object from a slide.
Returns:
str: Extracted textual content from the shape. Returns an empty string if no text is found or errors occur.
Implementation Details:
Checks if the shape has a text frame and extracts paragraphs with bulleted text formatting.
Safely attempts to read the
shape_typeattribute to differentiate shape kinds.For tables (
shape_type == 19), extracts cell content row-wise, skipping the header row and concatenating column headers with cell values.For grouped shapes (
shape_type == 6), recursively extracts text from each sub-shape, sorting them by position to maintain reading order.Handles exceptions gracefully, logging errors without stopping the extraction process.
Example:
text = parser._RAGFlowPptParser__extract(shape)
print(text)
__call__(self, fnm, from_page, to_page, callback=None) -> list[str]
Allows an instance of RAGFlowPptParser to be called like a function. Parses a PowerPoint file or file-like object, extracting textual content from slides in the specified page range.
Parameters:
fnm(strorbytes-like): File path to a.pptxfile or bytes object representing the file content.from_page(int): Starting slide index (inclusive) from which to extract text.to_page(int): Ending slide index (exclusive) up to which to extract text.callback(optional, callable): Currently not used. Reserved for future functionality (e.g., progress updates).
Returns:
list[str]: A list where each element is the concatenated text content of a single slide.
Implementation Details:
Loads the presentation using
pptx.Presentation.Iterates over slides based on the page range.
Sorts shapes within slides by vertical and horizontal position to preserve reading order.
Extracts text from each shape using
__extract.Joins shape texts with newline separators to form slide text.
Stores total number of slides in
self.total_page.
Example:
parser = RAGFlowPptParser()
slide_texts = parser('example.pptx', 0, 3)
for idx, text in enumerate(slide_texts):
print(f"Slide {idx}:")
print(text)
print("-----")
Important Implementation Details
Shape Type Constants:
The parser checks for specific
shape_typenumeric codes:6corresponds to Group Shapes — collections of multiple shapes.19corresponds to Tables.
These numeric codes come from the
pptxlibrary's internal enumeration of shape types.
Text Extraction Strategy:
Bulleted and numbered paragraphs are detected through XML structure inspection using XPath.
Group shapes are recursively processed to ensure text is fully extracted even from nested elements.
Table extraction assumes the first row contains headers and concatenates these with cell values for each subsequent row.
Error Handling:
The parser employs try-except blocks around shape processing to avoid interruption due to unexpected shape formats.
Errors are logged using the standard
loggingmodule.
Sorting of Shapes:
Shapes on each slide are sorted primarily by vertical position (
top) and secondarily by horizontal position (left), to reflect natural reading order.
Interaction with Other Parts of the System
Dependencies:
Relies on the
python-pptxlibrary (pptx.Presentation) for reading and navigating PowerPoint files.Uses the
loggingmodule for error reporting.Can be integrated into larger document processing or information retrieval pipelines, e.g., within the InfiniFlow project.
Input/Output:
Input is a PowerPoint
.pptxfile path or byte stream.Output is a list of strings representing slide texts, ready for further processing like indexing, NLP, or display.
Potential Integration:
The parser can be called by a document ingestion service.
Extracted slides’ text can be fed into summarization, question-answering, or search indexing subsystems.
Diagram: Class Structure of RAGFlowPptParser
classDiagram
class RAGFlowPptParser {
+__init__()
-__get_bulleted_text(paragraph) : str
-__extract(shape) : str
+__call__(fnm, from_page, to_page, callback=None) : list~str~
+total_page: int
}
init: Initializes parser instance.__get_bulleted_text: Formats bullet/numbered paragraph text.__extract: Extracts text from various shape types.call: Main method to extract text from multiple slides.total_page: Stores total number of slides in the presentation after loading.
Summary
The ppt_parser.py file defines a robust PowerPoint parsing class, RAGFlowPptParser, optimized for extracting ordered, structured text from complex slide content. It handles bulleted lists, tables, grouped shapes, and plain text with careful ordering and error resilience, enabling its seamless use in content processing pipelines that require slide textual data.