picture.py

Overview

The picture.py file is part of the InfiniFlow project and provides utility functions for processing images to extract and generate textual descriptions. It primarily focuses on applying Optical Character Recognition (OCR) and leveraging Vision Language Models (VLMs) or specialized Computer Vision (CV) large language models (LLMs) to convert images into meaningful markdown or tokenized text formats.

The file contains two main functions:

chunk: Converts an image file into tokenized text chunks by first performing OCR and then optionally augmenting the text with a CV LLM description if the OCR text is short.
vision_llm_chunk: Wraps usage of a Vision Language Model to generate markdown-style textual descriptions directly from an image, optionally using a prompt.

These utilities are designed to integrate with the larger InfiniFlow system, handling document chunking and enriching image data for downstream Natural Language Processing (NLP) or retrieval-augmented generation (RAG) tasks.

Detailed Documentation

Imports and Dependencies

io: For in-memory byte stream handling.
re: Regular expressions for filename processing.
numpy: To convert PIL images into arrays for OCR.
PIL.Image: For image loading and manipulation.
api.db: Provides LLMType enum to identify language model types.
api.db.services.llm_service.LLMBundle: Interface to instantiate LLM services.
deepdoc.vision.OCR: OCR model to extract text and bounding boxes from images.
rag.nlp.tokenize: Function to tokenize text and bind it into document metadata.
rag.utils.clean_markdown_block: Cleans markdown blocks generated by VLM.
rag.nlp.rag_tokenizer: Tokenizer for document titles and filenames.

Global Objects

ocr: An instance of the OCR class from deepdoc.vision, used for extracting text from images.

Functions

`chunk(filename, binary, tenant_id, lang, callback=None, **kwargs)`

Processes an image file to generate tokenized text chunks, using OCR first and optionally augmenting with a CV LLM description.

Parameters:

filename (str): Name of the image file, used for generating document metadata.
binary (bytes): Raw bytes of the image file.
tenant_id (str): Identifier for the tenant requesting the operation, used to select the appropriate LLM model.
lang (str): Language code (e.g., 'english') to guide tokenization and LLM selection.
callback (callable, optional): Function accepting (progress: float, message: str) to report operation progress/status.
**kwargs: Additional arguments (not used internally but allows extensibility).

Returns:

List[dict]: A list containing a single document dictionary with keys:
- "docnm_kwd": Document name keyword derived from the filename.
- "title_tks": Tokenized title derived from the filename.
- "image": PIL Image object.
- "doc_type_kwd": Fixed string "image".
- Additional keys added by tokenize() corresponding to tokenized OCR+LLM text.
Returns an empty list if processing fails.

Usage Example:

def progress_callback(prog, msg):
    print(f"Progress {prog*100}%: {msg}")

with open('sample.jpg', 'rb') as f:
    binary_data = f.read()

docs = chunk("sample.jpg", binary_data, tenant_id="tenant123", lang="english", callback=progress_callback)
for doc in docs:
    print(doc['title_tks'])

Implementation Details:

Opens the image from binary data and converts it to RGB.
Prepares a document dictionary with:
- Filename-derived tokens (title tokens).
- Raw image object.
- Document type marked as "image".
Runs OCR on the image, extracting bounding boxes and text snippets.
Concatenates OCR text snippets into a single string.
Reports progress: OCR finished.
Checks if the extracted text is sufficiently long (more than 32 words for English or 32 characters for others).
- If yes, tokenizes the OCR text and returns the document.
If OCR text is short, attempts to call a CV LLM (via LLMBundle) to generate a description of the image.
- Converts the image to JPEG bytes and sends it to the LLM.
- Appends the LLM-generated description to OCR text.
- Tokenizes the combined text.
- Returns the document.
On exceptions during LLM call, invokes the callback with error and returns an empty list.

`vision_llm_chunk(binary, vision_model, prompt=None, callback=None)`

Generates markdown-formatted textual descriptions for an image using a Vision Language Model (VLM).

Parameters:

binary (PIL.Image.Image): PIL Image object representing the image data.
vision_model (object): Vision Language Model instance with a .describe_with_prompt() method.
prompt (str, optional): Optional prompt text to guide the VLM description.
callback (callable, optional): Function accepting (progress: float, message: str) to report operation progress/status.

Returns:

str: The markdown text output generated by the VLM.
Empty string "" if an error occurs.

Usage Example:

from PIL import Image

def progress_callback(prog, msg):
    print(f"Progress {prog*100}%: {msg}")

img = Image.open('sample.jpg')
markdown_text = vision_llm_chunk(img, vision_model=my_vlm_model, prompt="Describe the scene", callback=progress_callback)
print(markdown_text)

Implementation Details:

Uses an in-memory bytes buffer to save the image as JPEG.
Calls the VLM's .describe_with_prompt() method passing the image bytes and optional prompt.
Cleans the returned markdown block using clean_markdown_block.
Returns the cleaned text.
On exception, reports error via callback and returns an empty string.

Interaction with Other System Components

OCR (deepdoc.vision.OCR): Provides the foundational text extraction from images.
LLMBundle (api.db.services.llm_service): Interfaces with LLM backends, specifically CV LLMs for image-to-text generation.
rag.nlp.tokenize & rag_tokenizer: Handle tokenization of text for downstream NLP tasks.
rag.utils.clean_markdown_block: Cleans markdown output from VLMs for consistent formatting.
Database/LLM Types (api.db.LLMType): Used to specify LLM model types for instantiation.

This file acts as a bridge between raw image data and the NLP/RAG pipeline by converting images into tokenized or markdown textual data.

Important Implementation Notes

The chunk function applies a heuristic to decide whether to augment OCR results with a CV LLM description based on text length.
Error handling is implemented via callbacks to provide asynchronous-like progress/error reporting.
The file assumes that the LLM service and OCR are properly configured and accessible.
The vision_llm_chunk function expects a VLM object with a .describe_with_prompt() method, indicating a pluggable model interface.

Visual Diagram

classDiagram
    class picture.py {
        +chunk(filename:str, binary:bytes, tenant_id:str, lang:str, callback:callable=None, **kwargs) List[dict]
        +vision_llm_chunk(binary:PIL.Image.Image, vision_model:object, prompt:str=None, callback:callable=None) str
    }

    class OCR {
        +__call__(image_array:np.ndarray) -> List[Tuple[Tuple[int,int,int,int], Tuple[str,float]]]
    }

    class LLMBundle {
        +__init__(tenant_id:str, llm_type:LLMType, lang:str)
        +describe(image_bytes:bytes) -> str
        +describe_with_prompt(image_bytes:bytes, prompt:str) -> str
    }

    picture.py ..> OCR : uses
    picture.py ..> LLMBundle : uses
    picture.py ..> rag.nlp.tokenize : uses
    picture.py ..> rag.utils.clean_markdown_block : uses

Summary

The picture.py file provides essential utilities for transforming image data into tokenized or markdown text, integrating OCR and Vision Language Models to enhance textual representation of images. It plays a critical role in the InfiniFlow system's document processing and multimodal understanding pipeline.