llm_service.py
Overview
The llm_service.py module is a core component of the InfiniFlow system responsible for managing and interfacing with Large Language Models (LLMs) tailored for tenant-specific usage. It provides services to initialize tenant LLM configurations, wrap LLM models with extended functionalities, handle token usage tracking, and facilitate various LLM-powered capabilities such as encoding, similarity scoring, image description, transcription, text-to-speech (TTS), and conversational chat.
This file acts as a bridge between raw LLM models and the tenant-aware application logic, ensuring proper usage accounting, support for model tools, multi-modal inputs, and integration with telemetry and tracing systems like Langfuse.
Classes and Functions
Class: LLMService
Inheritance: Inherits from
CommonServicePurpose: Provides database service operations related to the
LLMmodel.Attributes:
model: Assigned toLLMdatabase model.
Usage:LLMService is a simple service class primarily used for querying and manipulating LLM records from the database. It abstracts common DB operations for LLM entities.
Function: get_init_tenant_llm(user_id)
Purpose: Initialize and retrieve a list of tenant-specific LLM configurations based on global factory settings and built-in embedding models.
Parameters:
user_id(int or str): Unique identifier of the tenant/user.
Returns: List of dictionaries representing unique LLM configurations per tenant. Each dictionary contains:
tenant_id: Tenant identifierllm_factory: Factory name string that provides the LLMllm_name: Name of the LLM modelmodel_type: Type of model (e.g., embedding)api_key: API key string for the factoryapi_base: Base URL for API endpointmax_tokens: Maximum tokens allowed for usage
Implementation Details:
Collects unique factory configurations from several global settings (
CHAT_CFG,EMBEDDING_CFG, etc.).Queries the
LLMServicefor all LLMs associated with each factory.Adds built-in embedding models conditionally based on the LIGHTEN flag.
Ensures uniqueness of tenant-LLM entries by (tenant_id, llm_factory, llm_name) key.
Usage Example:
tenant_llms = get_init_tenant_llm(user_id="tenant123")
for llm_conf in tenant_llms:
print(llm_conf["llm_name"], llm_conf["llm_factory"])
Class: LLMBundle
Inheritance: Inherits from
LLM4TenantPurpose: Encapsulates a tenant-specific LLM instance with extended functionalities such as tool binding, encoding, chatting, streaming, and multi-modal processing.
Initialization:
LLMBundle(tenant_id, llm_type, llm_name=None, lang="Chinese", **kwargs)tenant_id(str/int): Tenant identifierllm_type(str): Type of LLM (e.g., "chat", "embedding")llm_name(str, optional): Specific model namelang(str): Language preference, default is"Chinese"**kwargs: Additional parameters passed to parent class
Methods
bind_tools(toolcall_session, tools)
Purpose: Bind tool call capabilities to the underlying model if supported.
Parameters:
toolcall_session: Session/context for tool invocationtools(list): List of tool definitions or instances
Behavior: Logs a warning if tools are assigned but the model does not support tool calls.
encode(texts: list) -> Tuple[List[float], int]
Purpose: Generate embeddings for a list of input texts.
Parameters:
texts(list[str]): Text strings to encode
Returns: Tuple containing:
embeddings(list of floats): Generated embedding vectorsused_tokens(int): Number of tokens consumed during encoding
Special: Tracks token usage via
TenantLLMService.
encode_queries(query: str) -> Tuple[List[float], int]
Similar to
encodebut optimized for query-style embeddings.Parameters and return values follow the same pattern.
similarity(query: str, texts: list) -> Tuple[List[float], int]
Purpose: Compute similarity scores between a query and a list of texts.
Parameters:
query(str): Query stringtexts(list[str]): List of candidate texts
Returns: Tuple:
sim(list[float]): Similarity scoresused_tokens(int): Tokens consumed
Tracks token usage specifically for reranking tasks.
describe(image, max_tokens=300) -> str
Purpose: Generate a textual description for an input image.
Parameters:
image: Image object or data streammax_tokens(int): Maximum tokens allowed for description (default 300)
Returns: Description string.
describe_with_prompt(image, prompt) -> str
Same as
describebut conditions description on a specific textual prompt.
transcription(audio) -> str
Purpose: Transcribe audio input into text.
Parameters:
audio: Audio data or stream
Returns: Transcribed text string.
tts(text: str) -> Generator[bytes, None, None]
Purpose: Generate streaming audio (TTS) bytes from input text.
Parameters:
text(str): Text to convert to speech
Returns: Generator yielding audio byte chunks.
Tracks token usage during streaming.
_remove_reasoning_content(txt: str) -> str
Purpose: Removes XML-like
<think>reasoning tags from output text.Parameters:
txt(str): Input text possibly containing<think>tags
Returns: Cleaned text without reasoning content.
_clean_param(chat_partial, **kwargs) -> dict
Purpose: Cleans and filters keyword arguments to match the signature of the provided partial function.
Parameters:
chat_partial:functools.partialfunction wrapping a chat method**kwargs: Arbitrary keyword arguments to filter
Returns: Filtered dictionary of keyword arguments compatible with the function signature.
chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> str
Purpose: Perform a full chat interaction, returning the final text response.
Parameters:
system(str): System prompt or instructionhistory(list): Conversation history as list of messagesgen_conf(dict): Generation configuration options**kwargs: Additional parameters forwarded to the chat model
Returns: Generated chat response string.
chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs)
Purpose: Stream chat responses incrementally as they are generated.
Parameters: Same as
chat.Returns: Generator yielding partial chat outputs (strings) progressively.
Tracks token usage upon completion.
Important Implementation Details
Token Usage Tracking:
Nearly all LLM interaction methods update token usage statistics viaTenantLLMService.increase_usage. This is critical for tenant quota enforcement and billing.Langfuse Integration:
Optional telemetry and tracing vialangfuseis wrapped around generation methods to capture input/output and usage details.Tool Call Support:
Methods check for tool call capabilities and bind external tools dynamically if supported by the model, enhancing extensibility.Partial Function Parameter Cleaning:
The_clean_paramstatic method introspects the wrapped chat methods to filter extra keyword arguments, ensuring only supported parameters are passed.Reasoning Content Filtering:
The system optionally removes internal reasoning segments embedded in<think>tags and tool call XML tags from outputs to present clean user-facing text.Streaming Chat:
chat_streamlymethod yields incremental responses, handling token counting and stream termination gracefully.
Interaction with Other System Components
Database Models:
UtilizesLLMdatabase model and services (CommonService) for persistent LLM configuration and metadata.Tenant LLM Management:
ExtendsLLM4Tenantand leveragesTenantLLMServicefor tenant-specific model instantiation and usage tracking.Global Settings:
Reads LLM factory configurations fromapi.settingsfor factory and API key management.Telemetry:
Integrates with Langfuse for monitoring and contextual tracing of LLM usage.Tool Invocation System:
Interfaces with external tools callable by models, supporting chained or augmented LLM functionalities.
Visual Diagram
classDiagram
class LLMService {
+model: LLM
}
class LLMBundle {
-tenant_id
-llm_type
-llm_name
-lang
-mdl
-langfuse
+__init__(tenant_id, llm_type, llm_name=None, lang="Chinese", **kwargs)
+bind_tools(toolcall_session, tools)
+encode(texts: list) -> (list, int)
+encode_queries(query: str) -> (list, int)
+similarity(query: str, texts: list) -> (list, int)
+describe(image, max_tokens=300) -> str
+describe_with_prompt(image, prompt) -> str
+transcription(audio) -> str
+tts(text: str) -> Generator[bytes, None, None]
-_remove_reasoning_content(txt: str) -> str
-_clean_param(chat_partial, **kwargs) -> dict
+chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> str
+chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs)
}
LLMService <|-- LLMBundle
LLMBundle ..> TenantLLMService : uses
LLMBundle ..> LLM4Tenant : inherits
Summary
llm_service.py is a pivotal module that facilitates tenant-aware LLM management within InfiniFlow. It bridges database models, tenant services, and LLM models, wrapping them in a rich API that supports embedding, similarity, multi-modal inputs, transcription, TTS, and interactive chat—with careful tracking of usage and telemetry integration. Its design emphasizes extensibility (tool support), robustness (parameter validation), and observability (Langfuse), making it a cornerstone for scalable and auditable LLM-powered applications in a multi-tenant environment.