cv_model.py

Overview

cv_model.py is a comprehensive module providing a unified interface and multiple implementations for computer vision (CV) models integrated with large language models (LLMs) to perform image understanding and vision-language tasks. This file enables describing images, conducting vision-based chat conversations, and generating detailed textual interpretations of visual content using various AI providers and APIs.

The core functionality revolves around a base abstract class Base that defines key methods and utilities for interacting with vision-capable LLMs. Several concrete subclasses implement support for different vendors and platforms (e.g., OpenAI GPT-4 Vision, Azure OpenAI, Google Gemini, NVIDIA, Anthropic, local models, and others). These classes abstract away provider-specific details, allowing easy switching or extension of CV backends.

Key features include:

Encoding images to base64 data URLs compatible with multiple providers.
Forming chat histories with optional embedded images.
Handling synchronous and streaming chat completions.
Supporting customizable prompts and generation configurations.
Adapting to multiple vendor APIs and authentication mechanisms.

Class and Function Details

Class: `Base` (abstract)

Foundation class defining the interface and common utilities for CV LLM models.

Constructor

Base(**kwargs)

max_retries (int): max retry attempts for API calls (default from env or 5).
retry_interval (float): base delay between retries (default from env or 2.0).
max_rounds (int): max conversation rounds (default 5).
is_tools (bool): flag for tools usage (default False).
tools (list): tool references (default empty).
toolcall_sessions (dict): session tracking for tool calls.

Methods

describe(image) -> (str, int)
- Abstract method to generate a textual description of the input image.
- Parameters:
  - image: image input (bytes, PIL Image, base64 string, etc.)
- Returns:
  - Tuple of description string and token usage count.
- Usage: Override in subclasses.
describe_with_prompt(image, prompt=None) -> (str, int)
- Abstract method to describe an image guided by a custom prompt.
- Parameters:
  - prompt (optional string): description prompt.
- Returns:
  - Tuple of description string and token usage count.
- Usage: Override in subclasses.
_form_history(system, history, images=[]) -> list
- Prepares a chat message history including an optional system prompt and embedded images.
- Parameters:
  - system (str): system prompt text.
  - history (list of dict): previous messages with roles.
  - images (list[str or bytes]): images to embed.
- Returns:
  - List of messages formatted for model consumption.
_image_prompt(text, images) -> str or list
- Embeds images into a prompt text by converting images to data URLs.
- Parameters:
  - text (str): textual prompt.
  - images (str, bytes, or list): images as base64 or raw bytes.
- Returns:
  - Combined prompt with embedded image URLs or base64.
chat(system, history, gen_conf, images=[], **kwargs) -> (str, int)
- Conducts a synchronous chat completion with the model.
- Parameters:
  - system (str): system prompt.
  - history (list): conversation history.
  - gen_conf (dict): generation configuration options.
  - images (list): optional images for context.
- Returns:
  - Generated response string and total token usage.
chat_streamly(system, history, gen_conf, images=[], **kwargs) -> generator
- Streams chat completion tokens incrementally.
- Yields partial responses progressively, ending with token count.
image2base64(image) -> str
- Converts an image (bytes, BytesIO, or PIL Image) into a base64 data URL.
- Handles JPEG/PNG detection via magic numbers.
- Usage example:
```
b64_img = Base.image2base64(pil_image)
```
prompt(b64) -> list
- Returns a default prompt message embedding the image base64 for description.
- Supports English and Chinese prompts based on self.lang.
vision_llm_prompt(b64, prompt=None) -> list
- Returns a vision LLM prompt embedding the image and an optional custom prompt.

Class: `GptV4(Base)`

OpenAI GPT-4 Vision implementation.

Constructor:
- key: OpenAI API key.
- model_name: e.g., "gpt-4-vision-preview" (default).
- lang: default "Chinese".
- base_url: OpenAI API base URL.
Implements describe and describe_with_prompt calling OpenAI API.

Example usage:

gptv4 = GptV4(key="sk-xxx")
desc, tokens = gptv4.describe(image)

Other Vendor-Specific Implementations

These subclasses inherit GptV4 or Base and override constructors or methods to support their specific APIs and authentication:

AzureGptV4: Azure OpenAI service.
xAICV: xAI API.
QWenCV: Tongyi-Qianwen model.
HunyuanCV: Tencent Hunyuan AI.
Zhipu4V: ZHIPU AI.
StepFunCV: StepFun API.
LmStudioCV: LM-Studio local LLMs.
OpenAI_APICV: OpenAI API compatible models.
TogetherAICV: TogetherAI platform.
YiCV: 01.AI.
SILICONFLOWCV: SiliconFlow API.
OpenRouterCV: OpenRouter API.
LocalAICV: Local AI servers.
XinferenceCV: Xinference API.
GPUStackCV: GPUStack API.

Each class adapts initialization and API calls as needed.

Class: `LocalCV(Base)`

Placeholder for local model integration. Currently non-functional (empty describe method).

Class: `OllamaCV(Base)`

Supports Ollama models with local or remote hosts.

Uses Ollama Python client.
Handles image cleaning for embedded base64.
Supports chat and streaming chat.
Converts generation config options to Ollama-compatible parameters.

Class: `GeminiCV(Base)`

Google Gemini model integration.

Uses Google Generative AI SDK.
Supports image description and chat with images.
Handles history formatting with image data.
Supports streaming responses.

Class: `NvidiaCV(Base)`

NVIDIA AI vision language model integration.

Uses REST API calls with requests.
Supports image description, chat, and streaming.
Handles Authorization headers and JSON payload formatting.

Class: `AnthropicCV(Base)`

Integration with Anthropic AI models.

Uses Anthropic Python client.
Handles image embedding in Anthropic message format.
Supports chat with special token handling and streaming.
Filters generation config to provider constraints.

Class: `GoogleCV(AnthropicCV, GeminiCV)`

Hybrid class supporting Google Cloud's Vertex AI and Claude models.

Dynamically chooses Anthropic or Gemini behavior based on model name.
Handles authentication with Google service account keys.
Supports describe, describe_with_prompt, chat, and streaming.

Important Implementation Details and Algorithms

Image Encoding: The method image2base64 converts images into base64 data URLs with MIME sniffing to accurately label JPEG or PNG formats, ensuring compatibility with various providers.
Message History Formation: _form_history methods format chat history messages for different providers, embedding images inline or as URLs depending on API requirements.
Streaming Chat: chat_streamly methods in several classes use generator patterns to yield partial response chunks progressively, enabling real-time UI updates.
Prompt Localization: Built-in Chinese and English prompts adapt descriptions to language preference.
Provider Abstraction: The file abstracts multiple CV LLM providers behind a consistent interface, easing integration and future extensions.

Interaction With Other System Components

External Libraries/SDKs: Uses APIs and SDKs from OpenAI, Azure, ZhipuAI, Google Generative AI, Anthropic, NVIDIA, Ollama, and others.
RAG Module: Imports is_english, vision_llm_describe_prompt and num_tokens_from_string from rag package for language detection, default prompts, and token counting.
Environment Variables: Reads retry and delay parameters from environment variables for configurable robustness.
Image Input: Accepts various image formats (raw bytes, PIL images, base64 strings) for description and chat.
LLM Clients: Each subclass initializes a client object to interact with the respective external API or local server.

Usage Examples

from cv_model import GptV4

# Initialize GPT-4 Vision model
model = GptV4(key="your_openai_api_key")

# Describe an image (PIL Image or bytes)
description, token_count = model.describe(image)

print(f"Description: {description} (Tokens: {token_count})")

# Describe with custom prompt
prompt = "Please describe the objects and colors in this image."
desc_custom, tokens_custom = model.describe_with_prompt(image, prompt=prompt)

# Chat with history and images
system_prompt = "You are a helpful assistant."
history = [{"role": "user", "content": "What is in this image?"}]
gen_conf = {"temperature": 0.5}

response, usage = model.chat(system_prompt, history, gen_conf, images=[image])
print(response)

Mermaid Class Diagram

classDiagram
    class Base {
        -max_retries: int
        -base_delay: float
        -max_rounds: int
        -is_tools: bool
        -tools: list
        -toolcall_sessions: dict
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
        +image2base64(image)
        +_form_history(system, history, images)
        +_image_prompt(text, images)
        +prompt(b64)
        +vision_llm_prompt(b64, prompt)
    }

    class GptV4 {
        -client: OpenAI
        -model_name: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
    }

    Base <|-- GptV4

    class AzureGptV4 {
        -client: AzureOpenAI
        +__init__(key, model_name, lang, **kwargs)
    }
    GptV4 <|-- AzureGptV4

    class OllamaCV {
        -client: Ollama.Client
        -model_name: str
        -lang: str
        -keep_alive: int
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- OllamaCV

    class GeminiCV {
        -model: GenerativeModel
        -client: Google client
        -model_name: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- GeminiCV

    class NvidiaCV {
        -key: str
        -base_url: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- NvidiaCV

    class AnthropicCV {
        -client: Anthropic
        -model_name: str
        -system: str
        -max_tokens: int
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
    }
    Base <|-- AnthropicCV

    class GoogleCV {
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
    }
    AnthropicCV <|-- GoogleCV
    GeminiCV <|-- GoogleCV

Summary

cv_model.py is a modular and extensible framework enabling advanced vision-language interactions via multiple AI backend providers. It abstracts image encoding, chat history management, and API integration details, providing a consistent interface for image description and chat-based vision tasks. This supports rapid development and experimentation with vision LLMs across cloud and local deployments.