figure_parser.py


Overview

The figure_parser.py file is a utility module designed to process and describe figure images using a vision-based language model within the InfiniFlow system. This file primarily focuses on parsing figure data (images possibly accompanied by positional metadata), invoking a vision-language model to generate descriptive text for each figure, and organizing the parsed and enriched figure data for downstream use.

Key functionalities include:


Detailed Explanation

Function: vision_figure_parser_figure_data_wrapper

def vision_figure_parser_figure_data_wrapper(figures_data_without_positions):
    ...

Purpose:
Transforms a list of figure data tuples without positional information into a standardized format expected by the VisionFigureParser class. It filters only those entries where the figure data contains a valid PIL Image.Image object.

Parameters:

Returns:

Usage Example:

raw_figures = [
    (["Description 1"], pil_image_obj1),
    (["Description 2"], pil_image_obj2)
]
wrapped_figures = vision_figure_parser_figure_data_wrapper(raw_figures)

Class: VisionFigureParser

This is the core class that manages parsing, describing, and assembling figure data.

Constructor: __init__

def __init__(self, vision_model, figures_data, *args, **kwargs):
    ...

Purpose:
Initialize the parser with a given vision model and figure data. It extracts figures, descriptions, and positional metadata from the input data and performs consistency checks.

Parameters:

Behavior:


Method: _extract_figures_info

def _extract_figures_info(self, figures_data):
    ...

Purpose:
Internal method that parses raw figure data to extract three key attributes:

Parameters:

Implementation Details:

Example:

figures_data = [
    ((image_obj, ["desc1"]), [(0, 0, 100, 100, 0.9)]),
    (image_obj2, ["desc2"])
]
parser._extract_figures_info(figures_data)

Method: _assemble

def _assemble(self):
    ...

Purpose:
Assembles the parsed figures, descriptions, and positional data into a unified list of tuples for output.

Returns:

Implementation Details:


Method: __call__

def __call__(self, **kwargs):
    ...

Purpose:
Enables the instance to be called as a function to process all figures concurrently, augment descriptions by querying the vision-language model, and assemble final results.

Parameters (optional):

Process Workflow:

  1. Defines an inner process function decorated with a timeout of 30 seconds (retries 3 times) which:

    • Uses picture_vision_llm_chunk to generate descriptive text from the binary image.

    • Uses a predefined prompt vision_llm_figure_describe_prompt.

    • Calls the callback with progress updates.

  2. Submits concurrent tasks to a shared thread pool executor (shared_executor) to process each figure image.

  3. Upon completion, updates the corresponding descriptions by prepending the newly generated text.

  4. Calls _assemble to package the final data.

Returns:

Usage Example:

parser = VisionFigureParser(vision_model, figures_data)
result = parser(callback=lambda prog, msg: print(f"{prog*100}% - {msg}"))

Important Implementation Details


Interaction with Other System Components

This file acts as a bridge between raw figure data and enriched, textual figure descriptions that can be used by other parts of the InfiniFlow system, likely downstream modules handling document understanding, report generation, or user interface rendering.


Visual Diagram

classDiagram
    class VisionFigureParser {
        -vision_model
        -figures: List[Image.Image]
        -descriptions: List[List[str]]
        -positions: List[Tuple]
        -assembled: List[Tuple]
        +__init__(vision_model, figures_data, *args, **kwargs)
        -_extract_figures_info(figures_data)
        -_assemble() List[Tuple]
        +__call__(**kwargs) List[Tuple]
    }

    VisionFigureParser ..> "vision_model" : uses
    VisionFigureParser ..> Image.Image : manages list of
    VisionFigureParser ..> ThreadPoolExecutor : uses shared_executor

Summary

figure_parser.py provides a robust, concurrent-capable parser class that extracts images and metadata, enriches figure descriptions via a vision-language model, and assembles the data for further processing. It is designed with concurrency, timeout resilience, and flexible input handling, making it a key utility in the InfiniFlow system's figure understanding pipeline.