picture.py


Overview

The picture.py file is part of the InfiniFlow project and provides utility functions for processing images to extract and generate textual descriptions. It primarily focuses on applying Optical Character Recognition (OCR) and leveraging Vision Language Models (VLMs) or specialized Computer Vision (CV) large language models (LLMs) to convert images into meaningful markdown or tokenized text formats.

The file contains two main functions:

These utilities are designed to integrate with the larger InfiniFlow system, handling document chunking and enriching image data for downstream Natural Language Processing (NLP) or retrieval-augmented generation (RAG) tasks.


Detailed Documentation

Imports and Dependencies


Global Objects


Functions

chunk(filename, binary, tenant_id, lang, callback=None, **kwargs)

Processes an image file to generate tokenized text chunks, using OCR first and optionally augmenting with a CV LLM description.

Parameters:
Returns:
Usage Example:
def progress_callback(prog, msg):
    print(f"Progress {prog*100}%: {msg}")

with open('sample.jpg', 'rb') as f:
    binary_data = f.read()

docs = chunk("sample.jpg", binary_data, tenant_id="tenant123", lang="english", callback=progress_callback)
for doc in docs:
    print(doc['title_tks'])
Implementation Details:
  1. Opens the image from binary data and converts it to RGB.

  2. Prepares a document dictionary with:

    • Filename-derived tokens (title tokens).

    • Raw image object.

    • Document type marked as "image".

  3. Runs OCR on the image, extracting bounding boxes and text snippets.

  4. Concatenates OCR text snippets into a single string.

  5. Reports progress: OCR finished.

  6. Checks if the extracted text is sufficiently long (more than 32 words for English or 32 characters for others).

    • If yes, tokenizes the OCR text and returns the document.

  7. If OCR text is short, attempts to call a CV LLM (via LLMBundle) to generate a description of the image.

    • Converts the image to JPEG bytes and sends it to the LLM.

    • Appends the LLM-generated description to OCR text.

    • Tokenizes the combined text.

    • Returns the document.

  8. On exceptions during LLM call, invokes the callback with error and returns an empty list.


vision_llm_chunk(binary, vision_model, prompt=None, callback=None)

Generates markdown-formatted textual descriptions for an image using a Vision Language Model (VLM).

Parameters:
Returns:
Usage Example:
from PIL import Image

def progress_callback(prog, msg):
    print(f"Progress {prog*100}%: {msg}")

img = Image.open('sample.jpg')
markdown_text = vision_llm_chunk(img, vision_model=my_vlm_model, prompt="Describe the scene", callback=progress_callback)
print(markdown_text)
Implementation Details:
  1. Uses an in-memory bytes buffer to save the image as JPEG.

  2. Calls the VLM's .describe_with_prompt() method passing the image bytes and optional prompt.

  3. Cleans the returned markdown block using clean_markdown_block.

  4. Returns the cleaned text.

  5. On exception, reports error via callback and returns an empty string.


Interaction with Other System Components

This file acts as a bridge between raw image data and the NLP/RAG pipeline by converting images into tokenized or markdown textual data.


Important Implementation Notes


Visual Diagram

classDiagram
    class picture.py {
        +chunk(filename:str, binary:bytes, tenant_id:str, lang:str, callback:callable=None, **kwargs) List[dict]
        +vision_llm_chunk(binary:PIL.Image.Image, vision_model:object, prompt:str=None, callback:callable=None) str
    }

    class OCR {
        +__call__(image_array:np.ndarray) -> List[Tuple[Tuple[int,int,int,int], Tuple[str,float]]]
    }

    class LLMBundle {
        +__init__(tenant_id:str, llm_type:LLMType, lang:str)
        +describe(image_bytes:bytes) -> str
        +describe_with_prompt(image_bytes:bytes, prompt:str) -> str
    }

    picture.py ..> OCR : uses
    picture.py ..> LLMBundle : uses
    picture.py ..> rag.nlp.tokenize : uses
    picture.py ..> rag.utils.clean_markdown_block : uses

Summary

The picture.py file provides essential utilities for transforming image data into tokenized or markdown text, integrating OCR and Vision Language Models to enhance textual representation of images. It plays a critical role in the InfiniFlow system's document processing and multimodal understanding pipeline.