figure_parser.py

Overview

The figure_parser.py file is a utility module designed to process and describe figure images using a vision-based language model within the InfiniFlow system. This file primarily focuses on parsing figure data (images possibly accompanied by positional metadata), invoking a vision-language model to generate descriptive text for each figure, and organizing the parsed and enriched figure data for downstream use.

Key functionalities include:

Wrapping raw figure image data into a standardized format.
Extracting and validating figure images, descriptions, and optional positional information.
Concurrently invoking a vision-language model to generate descriptive captions for each figure image.
Aggregating and assembling the processed figure data with descriptions for further processing or display.

Detailed Explanation

Function: `vision_figure_parser_figure_data_wrapper`

def vision_figure_parser_figure_data_wrapper(figures_data_without_positions):
    ...

Purpose:
Transforms a list of figure data tuples without positional information into a standardized format expected by the VisionFigureParser class. It filters only those entries where the figure data contains a valid PIL Image.Image object.

Parameters:

figures_data_without_positions (List[Tuple]): A list of tuples, each containing figure data where the first element is a description list, and the second element is expected to be an image.

Returns:

List[Tuple]: A list of tuples where each tuple consists of:
- A tuple (Image.Image, List) representing the image and its descriptions.
- A list with a single position tuple (0, 0, 0, 0, 0) as placeholder positional metadata.

Usage Example:

raw_figures = [
    (["Description 1"], pil_image_obj1),
    (["Description 2"], pil_image_obj2)
]
wrapped_figures = vision_figure_parser_figure_data_wrapper(raw_figures)

Class: `VisionFigureParser`

This is the core class that manages parsing, describing, and assembling figure data.

Constructor: `init`

def __init__(self, vision_model, figures_data, *args, **kwargs):
    ...

Purpose:
Initialize the parser with a given vision model and figure data. It extracts figures, descriptions, and positional metadata from the input data and performs consistency checks.

Parameters:

vision_model (object): An instance of a vision-language model used for generating figure descriptions.
figures_data (List[Tuple]): A list of figure data tuples, possibly including positional metadata.
*args, **kwargs: Additional arguments (not explicitly used).

Behavior:

Calls _extract_figures_info to parse input data.
Validates that the counts of figures, descriptions, and positions (if any) are consistent.

Method: `_extract_figures_info`

def _extract_figures_info(self, figures_data):
    ...

Purpose:
Internal method that parses raw figure data to extract three key attributes:

self.figures: List of PIL.Image.Image objects.
self.descriptions: List of lists of description strings.
self.positions: List of positional metadata tuples or empty if none provided.

Parameters:

figures_data (List[Tuple]): Raw input figure data.

Implementation Details:

Supports two data formats:
1. With positions: ((Image, [descriptions]), [(x1, y1, x2, y2, confidence)])
2. Without positions: (Image, [descriptions])
Validates input structure with assert statements.

Example:

figures_data = [
    ((image_obj, ["desc1"]), [(0, 0, 100, 100, 0.9)]),
    (image_obj2, ["desc2"])
]
parser._extract_figures_info(figures_data)

Method: `_assemble`

def _assemble(self):
    ...

Purpose:
Assembles the parsed figures, descriptions, and positional data into a unified list of tuples for output.

Returns:

List[Tuple]: Each element is either:
- ((Image, [descriptions]), [positions]) if positions exist.
- ((Image, [descriptions]),) if no positions exist.

Implementation Details:

Checks whether positional data is present.
Creates a list combining image, descriptions, and optionally positions.

Method: `call`

def __call__(self, **kwargs):
    ...

Purpose:
Enables the instance to be called as a function to process all figures concurrently, augment descriptions by querying the vision-language model, and assemble final results.

Parameters (optional):

callback (Callable): A function to receive progress updates with signature (progress: float, message: str). Defaults to a no-op lambda.

Process Workflow:

Defines an inner process function decorated with a timeout of 30 seconds (retries 3 times) which:
- Uses picture_vision_llm_chunk to generate descriptive text from the binary image.
- Uses a predefined prompt vision_llm_figure_describe_prompt.
- Calls the callback with progress updates.
Submits concurrent tasks to a shared thread pool executor (shared_executor) to process each figure image.
Upon completion, updates the corresponding descriptions by prepending the newly generated text.
Calls _assemble to package the final data.

Returns:

The assembled list of figure data tuples with updated descriptions.

Usage Example:

parser = VisionFigureParser(vision_model, figures_data)
result = parser(callback=lambda prog, msg: print(f"{prog*100}% - {msg}"))

Important Implementation Details

Concurrency: Uses ThreadPoolExecutor with a max of 10 workers to process figure descriptions in parallel to improve performance.
Timeout Handling: The process function has a timeout decorator that limits execution to 30 seconds and retries up to 3 times if needed, ensuring robustness.
Flexible Input Handling: The parser supports both figures with and without positional metadata, enabling it to be used in multiple contexts.
Integration with Vision-Language Models: Relies on external components:
- picture_vision_llm_chunk from rag.app.picture to generate figure descriptions.
- vision_llm_figure_describe_prompt from rag.prompts to provide prompt templates.
- timeout decorator from api.utils.api_utils for resilient processing.

Interaction with Other System Components

Vision Model: The vision_model instance injected into the parser is the core AI model that processes images and generates descriptive text. This model is abstracted and supplied externally.
API Utilities: Uses the timeout decorator from api.utils.api_utils to manage execution timeouts and retries.
RAG Module (rag):
- rag.app.picture.vision_llm_chunk handles the chunked processing of images with the vision model.
- rag.prompts.vision_llm_figure_describe_prompt provides the prompt guiding the model's description generation.

This file acts as a bridge between raw figure data and enriched, textual figure descriptions that can be used by other parts of the InfiniFlow system, likely downstream modules handling document understanding, report generation, or user interface rendering.

Visual Diagram

classDiagram
    class VisionFigureParser {
        -vision_model
        -figures: List[Image.Image]
        -descriptions: List[List[str]]
        -positions: List[Tuple]
        -assembled: List[Tuple]
        +__init__(vision_model, figures_data, *args, **kwargs)
        -_extract_figures_info(figures_data)
        -_assemble() List[Tuple]
        +__call__(**kwargs) List[Tuple]
    }

    VisionFigureParser ..> "vision_model" : uses
    VisionFigureParser ..> Image.Image : manages list of
    VisionFigureParser ..> ThreadPoolExecutor : uses shared_executor

Summary

figure_parser.py provides a robust, concurrent-capable parser class that extracts images and metadata, enriches figure descriptions via a vision-language model, and assembles the data for further processing. It is designed with concurrency, timeout resilience, and flexible input handling, making it a key utility in the InfiniFlow system's figure understanding pipeline.

figure_parser.py

Overview

Detailed Explanation

Function: vision_figure_parser_figure_data_wrapper

Class: VisionFigureParser

Constructor: __init__

Method: _extract_figures_info

Method: _assemble

Method: __call__

Important Implementation Details

Interaction with Other System Components

Visual Diagram

Summary

Function: `vision_figure_parser_figure_data_wrapper`

Class: `VisionFigureParser`

Constructor: `init`

Method: `_extract_figures_info`

Method: `_assemble`

Method: `call`