figure_parser.py
Overview
The figure_parser.py file is a utility module designed to process and describe figure images using a vision-based language model within the InfiniFlow system. This file primarily focuses on parsing figure data (images possibly accompanied by positional metadata), invoking a vision-language model to generate descriptive text for each figure, and organizing the parsed and enriched figure data for downstream use.
Key functionalities include:
Wrapping raw figure image data into a standardized format.
Extracting and validating figure images, descriptions, and optional positional information.
Concurrently invoking a vision-language model to generate descriptive captions for each figure image.
Aggregating and assembling the processed figure data with descriptions for further processing or display.
Detailed Explanation
Function: vision_figure_parser_figure_data_wrapper
def vision_figure_parser_figure_data_wrapper(figures_data_without_positions):
...
Purpose:
Transforms a list of figure data tuples without positional information into a standardized format expected by the VisionFigureParser class. It filters only those entries where the figure data contains a valid PIL Image.Image object.
Parameters:
figures_data_without_positions(List[Tuple]): A list of tuples, each containing figure data where the first element is a description list, and the second element is expected to be an image.
Returns:
List[Tuple]: A list of tuples where each tuple consists of:A tuple
(Image.Image, List)representing the image and its descriptions.A list with a single position tuple
(0, 0, 0, 0, 0)as placeholder positional metadata.
Usage Example:
raw_figures = [
(["Description 1"], pil_image_obj1),
(["Description 2"], pil_image_obj2)
]
wrapped_figures = vision_figure_parser_figure_data_wrapper(raw_figures)
Class: VisionFigureParser
This is the core class that manages parsing, describing, and assembling figure data.
Constructor: __init__
def __init__(self, vision_model, figures_data, *args, **kwargs):
...
Purpose:
Initialize the parser with a given vision model and figure data. It extracts figures, descriptions, and positional metadata from the input data and performs consistency checks.
Parameters:
vision_model(object): An instance of a vision-language model used for generating figure descriptions.figures_data(List[Tuple]): A list of figure data tuples, possibly including positional metadata.*args,**kwargs: Additional arguments (not explicitly used).
Behavior:
Calls
_extract_figures_infoto parse input data.Validates that the counts of figures, descriptions, and positions (if any) are consistent.
Method: _extract_figures_info
def _extract_figures_info(self, figures_data):
...
Purpose:
Internal method that parses raw figure data to extract three key attributes:
self.figures: List ofPIL.Image.Imageobjects.self.descriptions: List of lists of description strings.self.positions: List of positional metadata tuples or empty if none provided.
Parameters:
figures_data(List[Tuple]): Raw input figure data.
Implementation Details:
Supports two data formats:
With positions:
((Image, [descriptions]), [(x1, y1, x2, y2, confidence)])Without positions:
(Image, [descriptions])
Validates input structure with
assertstatements.
Example:
figures_data = [
((image_obj, ["desc1"]), [(0, 0, 100, 100, 0.9)]),
(image_obj2, ["desc2"])
]
parser._extract_figures_info(figures_data)
Method: _assemble
def _assemble(self):
...
Purpose:
Assembles the parsed figures, descriptions, and positional data into a unified list of tuples for output.
Returns:
List[Tuple]: Each element is either:((Image, [descriptions]), [positions])if positions exist.((Image, [descriptions]),)if no positions exist.
Implementation Details:
Checks whether positional data is present.
Creates a list combining image, descriptions, and optionally positions.
Method: __call__
def __call__(self, **kwargs):
...
Purpose:
Enables the instance to be called as a function to process all figures concurrently, augment descriptions by querying the vision-language model, and assemble final results.
Parameters (optional):
callback(Callable): A function to receive progress updates with signature(progress: float, message: str). Defaults to a no-op lambda.
Process Workflow:
Defines an inner
processfunction decorated with a timeout of 30 seconds (retries 3 times) which:Uses
picture_vision_llm_chunkto generate descriptive text from the binary image.Uses a predefined prompt
vision_llm_figure_describe_prompt.Calls the
callbackwith progress updates.
Submits concurrent tasks to a shared thread pool executor (
shared_executor) to process each figure image.Upon completion, updates the corresponding descriptions by prepending the newly generated text.
Calls
_assembleto package the final data.
Returns:
The assembled list of figure data tuples with updated descriptions.
Usage Example:
parser = VisionFigureParser(vision_model, figures_data)
result = parser(callback=lambda prog, msg: print(f"{prog*100}% - {msg}"))
Important Implementation Details
Concurrency: Uses
ThreadPoolExecutorwith a max of 10 workers to process figure descriptions in parallel to improve performance.Timeout Handling: The
processfunction has a timeout decorator that limits execution to 30 seconds and retries up to 3 times if needed, ensuring robustness.Flexible Input Handling: The parser supports both figures with and without positional metadata, enabling it to be used in multiple contexts.
Integration with Vision-Language Models: Relies on external components:
picture_vision_llm_chunkfromrag.app.pictureto generate figure descriptions.vision_llm_figure_describe_promptfromrag.promptsto provide prompt templates.timeoutdecorator fromapi.utils.api_utilsfor resilient processing.
Interaction with Other System Components
Vision Model: The
vision_modelinstance injected into the parser is the core AI model that processes images and generates descriptive text. This model is abstracted and supplied externally.API Utilities: Uses the
timeoutdecorator fromapi.utils.api_utilsto manage execution timeouts and retries.RAG Module (
rag):rag.app.picture.vision_llm_chunkhandles the chunked processing of images with the vision model.rag.prompts.vision_llm_figure_describe_promptprovides the prompt guiding the model's description generation.
This file acts as a bridge between raw figure data and enriched, textual figure descriptions that can be used by other parts of the InfiniFlow system, likely downstream modules handling document understanding, report generation, or user interface rendering.
Visual Diagram
classDiagram
class VisionFigureParser {
-vision_model
-figures: List[Image.Image]
-descriptions: List[List[str]]
-positions: List[Tuple]
-assembled: List[Tuple]
+__init__(vision_model, figures_data, *args, **kwargs)
-_extract_figures_info(figures_data)
-_assemble() List[Tuple]
+__call__(**kwargs) List[Tuple]
}
VisionFigureParser ..> "vision_model" : uses
VisionFigureParser ..> Image.Image : manages list of
VisionFigureParser ..> ThreadPoolExecutor : uses shared_executor
Summary
figure_parser.py provides a robust, concurrent-capable parser class that extracts images and metadata, enriches figure descriptions via a vision-language model, and assembles the data for further processing. It is designed with concurrency, timeout resilience, and flexible input handling, making it a key utility in the InfiniFlow system's figure understanding pipeline.