cv_model.py


Overview

cv_model.py is a comprehensive module providing a unified interface and multiple implementations for computer vision (CV) models integrated with large language models (LLMs) to perform image understanding and vision-language tasks. This file enables describing images, conducting vision-based chat conversations, and generating detailed textual interpretations of visual content using various AI providers and APIs.

The core functionality revolves around a base abstract class Base that defines key methods and utilities for interacting with vision-capable LLMs. Several concrete subclasses implement support for different vendors and platforms (e.g., OpenAI GPT-4 Vision, Azure OpenAI, Google Gemini, NVIDIA, Anthropic, local models, and others). These classes abstract away provider-specific details, allowing easy switching or extension of CV backends.

Key features include:


Class and Function Details

Class: Base (abstract)

Foundation class defining the interface and common utilities for CV LLM models.

Constructor

Base(**kwargs)

Methods


Class: GptV4(Base)

OpenAI GPT-4 Vision implementation.

Example usage:

gptv4 = GptV4(key="sk-xxx")
desc, tokens = gptv4.describe(image)

Other Vendor-Specific Implementations

These subclasses inherit GptV4 or Base and override constructors or methods to support their specific APIs and authentication:

Each class adapts initialization and API calls as needed.


Class: LocalCV(Base)

Placeholder for local model integration. Currently non-functional (empty describe method).


Class: OllamaCV(Base)

Supports Ollama models with local or remote hosts.


Class: GeminiCV(Base)

Google Gemini model integration.


Class: NvidiaCV(Base)

NVIDIA AI vision language model integration.


Class: AnthropicCV(Base)

Integration with Anthropic AI models.


Class: GoogleCV(AnthropicCV, GeminiCV)

Hybrid class supporting Google Cloud's Vertex AI and Claude models.


Important Implementation Details and Algorithms


Interaction With Other System Components


Usage Examples

from cv_model import GptV4

# Initialize GPT-4 Vision model
model = GptV4(key="your_openai_api_key")

# Describe an image (PIL Image or bytes)
description, token_count = model.describe(image)

print(f"Description: {description} (Tokens: {token_count})")

# Describe with custom prompt
prompt = "Please describe the objects and colors in this image."
desc_custom, tokens_custom = model.describe_with_prompt(image, prompt=prompt)

# Chat with history and images
system_prompt = "You are a helpful assistant."
history = [{"role": "user", "content": "What is in this image?"}]
gen_conf = {"temperature": 0.5}

response, usage = model.chat(system_prompt, history, gen_conf, images=[image])
print(response)

Mermaid Class Diagram

classDiagram
    class Base {
        -max_retries: int
        -base_delay: float
        -max_rounds: int
        -is_tools: bool
        -tools: list
        -toolcall_sessions: dict
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
        +image2base64(image)
        +_form_history(system, history, images)
        +_image_prompt(text, images)
        +prompt(b64)
        +vision_llm_prompt(b64, prompt)
    }

    class GptV4 {
        -client: OpenAI
        -model_name: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
    }

    Base <|-- GptV4

    class AzureGptV4 {
        -client: AzureOpenAI
        +__init__(key, model_name, lang, **kwargs)
    }
    GptV4 <|-- AzureGptV4

    class OllamaCV {
        -client: Ollama.Client
        -model_name: str
        -lang: str
        -keep_alive: int
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- OllamaCV

    class GeminiCV {
        -model: GenerativeModel
        -client: Google client
        -model_name: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- GeminiCV

    class NvidiaCV {
        -key: str
        -base_url: str
        -lang: str
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
    }
    Base <|-- NvidiaCV

    class AnthropicCV {
        -client: Anthropic
        -model_name: str
        -system: str
        -max_tokens: int
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
    }
    Base <|-- AnthropicCV

    class GoogleCV {
        +describe(image)
        +describe_with_prompt(image, prompt)
        +chat(system, history, gen_conf, images)
        +chat_streamly(system, history, gen_conf, images)
    }
    AnthropicCV <|-- GoogleCV
    GeminiCV <|-- GoogleCV

Summary

cv_model.py is a modular and extensible framework enabling advanced vision-language interactions via multiple AI backend providers. It abstracts image encoding, chat history management, and API integration details, providing a consistent interface for image description and chat-based vision tasks. This supports rapid development and experimentation with vision LLMs across cloud and local deployments.