chat_model.py
Overview
The chat_model.py file is a comprehensive module designed to provide a unified interface and multiple implementations for interacting with various Large Language Models (LLMs) and AI chat services. It abstracts away provider-specific APIs and handles complex features such as:
Chat completions and streaming responses.
Tool integration and tool-call workflows.
Error classification and automatic retry with exponential backoff.
Dynamic context window size calculation.
Support for a wide variety of LLM providers, including OpenAI, Azure, HuggingFace, Baichuan, ZhipuAI, Google Cloud, and many others.
Support for specialized local, cloud, and third-party services.
Streaming and non-streaming chat capabilities.
Token counting and response truncation handling.
This module acts as a key integration layer in the system’s Retrieval-Augmented Generation (RAG) pipeline or any application requiring interaction with diverse LLM backends, normalizing their APIs and managing their lifecycle.
Detailed Documentation
Enumerations
LLMErrorCode (inherits from StrEnum)
Defines string constants representing various error codes for LLM interactions, used for error classification and handling retries.
Error Code | Description |
|---|---|
| Request rate limit exceeded |
| Authentication failure |
| Invalid request parameters |
| Server error or unavailability |
| Request timed out |
| Connection/network issues |
| Model-related errors |
| Exceeded maximum interaction rounds |
| Content filtered by safety policies |
| Quota exceeded |
| Maximum retry attempts exceeded |
| Generic or unknown errors |
ReActMode (inherits from StrEnum)
Defines modes for ReAct-style behavior in chat interactions:
FUNCTION_CALL— Use function call mode.REACT — Use ReAct mode.
Protocols
ToolCallSession
A protocol interface that defines the contract for a tool call session with a required method:
tool_call(name: str, arguments: dict[str, Any]) -> str:
Executes a tool call by name with provided arguments and returns the result as a string.
Classes
Base (abstract base class)
The foundational class for all chat model implementations. It wraps an LLM API client and provides common functionality such as error handling, retries, chat interaction, and tool integration.
Constructor
Base(key, model_name, base_url, **kwargs)
key(str): API key or credential token.model_name(str): LLM model identifier.base_url(str): Base URL for the LLM API endpoint.kwargs: Optional parameters:max_retries(int): Maximum retry attempts (default from envLLM_MAX_RETRIESor 5).retry_interval(float): Base delay for retries in seconds (default from envLLM_BASE_DELAYor 2.0).max_rounds(int): Max rounds of tool call interaction (default 5).
Key Methods
_get_delay() -> float:
Returns a randomized delay time for retry backoff._classify_error(error: Exception) -> LLMErrorCode:
Parses error messages and returns a classifiedLLMErrorCode._clean_conf(gen_conf: dict) -> dict:
Sanitizes generation config dict by removing unsupported keys._chat(history: list, gen_conf: dict, **kwargs) -> (str, int):
Performs a synchronous chat request returning(answer, token_count)._chat_streamly(history: list, gen_conf: dict, **kwargs) -> Generator:
Yields streaming chat responses as(partial_answer, token_count)tuples._length_stop(ans: str) -> str:
Appends a language-appropriate truncation notice if the answer is cut off._exceptions(e: Exception, attempt: int) -> Optional[str]:
Handles exceptions, logs, classifies, and decides on retry or error return._verbose_tool_use(name: str, args: dict, res: str) -> str:
Formats tool call usage info for insertion into chat history._append_history(hist: list, tool_call, tool_res: Any) -> list:
Adds tool call and tool response to the chat history.bind_tools(toolcall_session: ToolCallSession, tools: list):
Binds a tool call session and tool definitions for use during chat.chat_with_tools(system: str, history: list, gen_conf: dict = {}) -> (str, int):
Chat method that supports tool calls with retries and multiple rounds.chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> (str, int):
Standard chat method without tools, with retries.chat_streamly_with_tools(system: str, history: list, gen_conf: dict = {}) -> Generator[str or int]:
Streaming chat with tools, yielding partial results and token counts.chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs) -> Generator[str or int]:
Streaming chat without tools.total_token_count(resp) -> int:
Attempts to extract total token usage from API response._calculate_dynamic_ctx(history: list) -> int:
Calculates a dynamic context window size based on token counts and buffer.
Usage Example
base = Base(key="my_api_key", model_name="gpt-4", base_url="https://api.openai.com/v1")
answer, tokens = base.chat(system="You are a helpful assistant.", history=[{"role": "user", "content": "Hello"}])
print(answer)
Provider-Specific Subclasses
Each subclass extends Base or LiteLLMBase and configures provider-specific client initialization, parameters, and overrides _chat or streaming methods if needed.
Key subclasses include (non-exhaustive):
GptTurbo(OpenAI GPT-3.5 Turbo)AzureChat(Azure OpenAI)BaiChuanChat(Baichuan AI)ZhipuChat(Zhipu AI)LocalAIChat(Local AI deployment)LocalLLM(Local LLM via Jina client)VolcEngineChat(VolcEngine platform)MiniMaxChat(MiniMax AI)MistralChat(Mistral AI)OpenRouterChat(OpenRouter API)StepFunChat(StepFun AI)LmStudioChat(LM Studio local API)OpenAI_APIChat(VLLM or OpenAI API compatible)PPIOChatLeptonAIChatPerfXCloudChatUpstageChatNovitaAIChatSILICONFLOWChatYiChatGiteeChatReplicateChatHunyuanChat(Tencent Hunyuan)SparkChat(XunFei Spark)BaiduYiyanChatGoogleChat(Google Cloud Vertex and Anthropic)GPUStackChatAi302ChatTokenPonyChatMeituanChat
Each subclass handles:
Client initialization with credentials.
API endpoint and model name normalization.
Configuration sanitization.
Overriding
_chator streaming methods as per provider API.Handling provider-specific error handling or request construction.
LiteLLMBase (abstract base class)
A specialized base class for lightweight LLM providers accessed via the litellm library. It manages provider-specific auth, API base URLs, and request construction.
Key additions compared to Base:
Uses
litellm.completion()for chat completions.Manages provider-specific fields (e.g., AWS credentials for Bedrock).
Implements
_construct_completion_args()to build provider-specific request parameters.Reimplements tool call integration and streaming support adapted for
litellm.Similar error classification, retry logic, and token counting as
Base.
Important Implementation Details
Retry Strategy:
Implements exponential backoff with randomized delays for handling transient errors like rate limits and server errors.Error Classification:
Uses regex keyword matching on error messages to classify errors into standard categories, driving retry or fail behavior.Tool Call Integration:
Supports ReAct-style tool call workflows where the LLM can invoke external tools during chat, with multiple rounds of interaction and history appending.Streaming Support:
Both streaming and non-streaming chat completions are supported, with incremental yielding of tokens or partial answers.Dynamic Context Calculation:
The_calculate_dynamic_ctx()method estimates required context window size based on token counts, with special handling for non-ASCII characters.Provider Abstraction:
Unified interface masks differences between various LLM providers, enabling easy switching or multi-provider support.Token Counting:
Attempts to extract token usage from provider responses or estimates it based on string length and character types.
Interaction with Other System Components
RAG Pipeline:
Likely used in retrieval-augmented generation frameworks (rag.llmandrag.nlpmodules are imported).Tool Call Sessions:
Integrates with tool call sessions implementing theToolCallSessionprotocol, allowing external APIs or functions to be called dynamically by the LLM.Token Utilities:
Usesnum_tokens_from_stringutility for token estimation, important for context window management and cost control.Provider SDKs:
Relies on third-party SDKs such as OpenAI Python SDK, AzureOpenAI, ZhipuAI, litellm, and others for provider communication.Configuration via Environment Variables:
Supports environment-driven configuration for timeouts, retry counts, and delays.
Mermaid Class Diagram
classDiagram
class Base {
- client
- model_name: str
- max_retries: int
- base_delay: float
- max_rounds: int
- is_tools: bool
- tools: list
- toolcall_session: ToolCallSession
+ chat(system, history, gen_conf, **kwargs) str, int
+ chat_with_tools(system, history, gen_conf) str, int
+ chat_streamly(system, history, gen_conf, **kwargs) generator
+ chat_streamly_with_tools(system, history, gen_conf) generator
+ bind_tools(toolcall_session, tools)
+ total_token_count(resp) int
# _chat(history, gen_conf, **kwargs) str, int
# _chat_streamly(history, gen_conf, **kwargs) generator
# _exceptions(e, attempt) Optional[str]
# _clean_conf(gen_conf) dict
# _classify_error(error) LLMErrorCode
# _get_delay() float
# _append_history(hist, tool_call, tool_res) list
# _verbose_tool_use(name, args, res) str
# _length_stop(ans) str
# _calculate_dynamic_ctx(history) int
}
class LiteLLMBase {
- timeout: int
- provider: str
- prefix: str
- api_key: str
- base_url: str
- max_retries: int
- base_delay: float
- max_rounds: int
- is_tools: bool
- tools: list
- toolcall_session: ToolCallSession
+ chat(system, history, gen_conf, **kwargs) str, int
+ chat_with_tools(system, history, gen_conf) str, int
+ chat_streamly(system, history, gen_conf, **kwargs) generator
+ chat_streamly_with_tools(system, history, gen_conf) generator
+ bind_tools(toolcall_session, tools)
# _chat(history, gen_conf, **kwargs) str, int
# _chat_streamly(history, gen_conf, **kwargs) generator
# _exceptions(e, attempt) Optional[str]
# _clean_conf(gen_conf) dict
# _classify_error(error) LLMErrorCode
# _get_delay() float
# _append_history(hist, tool_call, tool_res) list
# _verbose_tool_use(name, args, res) str
# _length_stop(ans) str
# _construct_completion_args(history, stream, tools, **kwargs) dict
# _calculate_dynamic_ctx(history) int
}
Base <|-- GptTurbo
Base <|-- AzureChat
Base <|-- BaiChuanChat
Base <|-- ZhipuChat
Base <|-- LocalAIChat
Base <|-- LocalLLM
Base <|-- VolcEngineChat
Base <|-- MiniMaxChat
Base <|-- MistralChat
Base <|-- OpenRouterChat
Base <|-- StepFunChat
Base <|-- LmStudioChat
Base <|-- OpenAI_APIChat
Base <|-- PPIOChat
Base <|-- LeptonAIChat
Base <|-- PerfXCloudChat
Base <|-- UpstageChat
Base <|-- NovitaAIChat
Base <|-- SILICONFLOWChat
Base <|-- YiChat
Base <|-- GiteeChat
Base <|-- ReplicateChat
Base <|-- HunyuanChat
Base <|-- SparkChat
Base <|-- BaiduYiyanChat
Base <|-- GoogleChat
Base <|-- GPUStackChat
Base <|-- Ai302Chat
Base <|-- TokenPonyChat
Base <|-- MeituanChat
LiteLLMBase <|-- (Various LiteLLM implementations if any)
class LLMErrorCode {
<<enumeration>>
+ ERROR_RATE_LIMIT
+ ERROR_AUTHENTICATION
+ ERROR_INVALID_REQUEST
+ ERROR_SERVER
+ ERROR_TIMEOUT
+ ERROR_CONNECTION
+ ERROR_MODEL
+ ERROR_MAX_ROUNDS
+ ERROR_CONTENT_FILTER
+ ERROR_QUOTA
+ ERROR_MAX_RETRIES
+ ERROR_GENERIC
}
Summary
chat_model.py is a critical abstraction layer that encapsulates multiple LLM providers under a common interface, managing the complexity of different APIs, error handling, streaming, and tool interactions. It is designed for flexibility and robustness in production AI chat applications, enabling seamless integration of new LLM providers and advanced features like tool-assisted conversations.
This module is essential for systems that require:
Multi-provider LLM support.
Reliable chat completion with retries and error classification.
Streaming responses for responsive user experiences.
Tool integration for enhanced capabilities.
Token management for cost and context control.
If you need additional integration or usage examples for specific subclasses or methods, please let me know.