llm_service.py

Overview

The llm_service.py module is a core component of the InfiniFlow system responsible for managing and interfacing with Large Language Models (LLMs) tailored for tenant-specific usage. It provides services to initialize tenant LLM configurations, wrap LLM models with extended functionalities, handle token usage tracking, and facilitate various LLM-powered capabilities such as encoding, similarity scoring, image description, transcription, text-to-speech (TTS), and conversational chat.

This file acts as a bridge between raw LLM models and the tenant-aware application logic, ensuring proper usage accounting, support for model tools, multi-modal inputs, and integration with telemetry and tracing systems like Langfuse.

Classes and Functions

Class: `LLMService`

Inheritance: Inherits from CommonService
Purpose: Provides database service operations related to the LLM model.
Attributes:
- model: Assigned to LLM database model.

Usage:
LLMService is a simple service class primarily used for querying and manipulating LLM records from the database. It abstracts common DB operations for LLM entities.

Function: `get_init_tenant_llm(user_id)`

Purpose: Initialize and retrieve a list of tenant-specific LLM configurations based on global factory settings and built-in embedding models.
Parameters:
- user_id (int or str): Unique identifier of the tenant/user.
Returns: List of dictionaries representing unique LLM configurations per tenant. Each dictionary contains:
- tenant_id: Tenant identifier
- llm_factory: Factory name string that provides the LLM
- llm_name: Name of the LLM model
- model_type: Type of model (e.g., embedding)
- api_key: API key string for the factory
- api_base: Base URL for API endpoint
- max_tokens: Maximum tokens allowed for usage

Implementation Details:

Collects unique factory configurations from several global settings (CHAT_CFG, EMBEDDING_CFG, etc.).
Queries the LLMService for all LLMs associated with each factory.
Adds built-in embedding models conditionally based on the LIGHTEN flag.
Ensures uniqueness of tenant-LLM entries by (tenant_id, llm_factory, llm_name) key.

Usage Example:

tenant_llms = get_init_tenant_llm(user_id="tenant123")
for llm_conf in tenant_llms:
    print(llm_conf["llm_name"], llm_conf["llm_factory"])

Class: `LLMBundle`

Inheritance: Inherits from LLM4Tenant
Purpose: Encapsulates a tenant-specific LLM instance with extended functionalities such as tool binding, encoding, chatting, streaming, and multi-modal processing.
Initialization:
```
LLMBundle(tenant_id, llm_type, llm_name=None, lang="Chinese", **kwargs)
```
- tenant_id (str/int): Tenant identifier
- llm_type (str): Type of LLM (e.g., "chat", "embedding")
- llm_name (str, optional): Specific model name
- lang (str): Language preference, default is "Chinese"
- **kwargs: Additional parameters passed to parent class

Methods

`bind_tools(toolcall_session, tools)`

Purpose: Bind tool call capabilities to the underlying model if supported.
Parameters:
- toolcall_session: Session/context for tool invocation
- tools (list): List of tool definitions or instances
Behavior: Logs a warning if tools are assigned but the model does not support tool calls.

`encode(texts: list) -> Tuple[List[float], int]`

Purpose: Generate embeddings for a list of input texts.
Parameters:
- texts (list[str]): Text strings to encode
Returns: Tuple containing:
- embeddings (list of floats): Generated embedding vectors
- used_tokens (int): Number of tokens consumed during encoding
Special: Tracks token usage via TenantLLMService.

`encode_queries(query: str) -> Tuple[List[float], int]`

Similar to encode but optimized for query-style embeddings.
Parameters and return values follow the same pattern.

`similarity(query: str, texts: list) -> Tuple[List[float], int]`

Purpose: Compute similarity scores between a query and a list of texts.
Parameters:
- query (str): Query string
- texts (list[str]): List of candidate texts
Returns: Tuple:
- sim (list[float]): Similarity scores
- used_tokens (int): Tokens consumed
Tracks token usage specifically for reranking tasks.

`describe(image, max_tokens=300) -> str`

Purpose: Generate a textual description for an input image.
Parameters:
- image: Image object or data stream
- max_tokens (int): Maximum tokens allowed for description (default 300)
Returns: Description string.

`describe_with_prompt(image, prompt) -> str`

Same as describe but conditions description on a specific textual prompt.

`transcription(audio) -> str`

Purpose: Transcribe audio input into text.
Parameters:
- audio: Audio data or stream
Returns: Transcribed text string.

`tts(text: str) -> Generator[bytes, None, None]`

Purpose: Generate streaming audio (TTS) bytes from input text.
Parameters:
- text (str): Text to convert to speech
Returns: Generator yielding audio byte chunks.
Tracks token usage during streaming.

`_remove_reasoning_content(txt: str) -> str`

Purpose: Removes XML-like <think> reasoning tags from output text.
Parameters:
- txt (str): Input text possibly containing <think> tags
Returns: Cleaned text without reasoning content.

`_clean_param(chat_partial, **kwargs) -> dict`

Purpose: Cleans and filters keyword arguments to match the signature of the provided partial function.
Parameters:
- chat_partial: functools.partial function wrapping a chat method
- **kwargs: Arbitrary keyword arguments to filter
Returns: Filtered dictionary of keyword arguments compatible with the function signature.

`chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> str`

Purpose: Perform a full chat interaction, returning the final text response.
Parameters:
- system (str): System prompt or instruction
- history (list): Conversation history as list of messages
- gen_conf (dict): Generation configuration options
- **kwargs: Additional parameters forwarded to the chat model
Returns: Generated chat response string.

`chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs)`

Purpose: Stream chat responses incrementally as they are generated.
Parameters: Same as chat.
Returns: Generator yielding partial chat outputs (strings) progressively.
Tracks token usage upon completion.

Important Implementation Details

Token Usage Tracking:
Nearly all LLM interaction methods update token usage statistics via TenantLLMService.increase_usage. This is critical for tenant quota enforcement and billing.
Langfuse Integration:
Optional telemetry and tracing via langfuse is wrapped around generation methods to capture input/output and usage details.
Tool Call Support:
Methods check for tool call capabilities and bind external tools dynamically if supported by the model, enhancing extensibility.
Partial Function Parameter Cleaning:
The _clean_param static method introspects the wrapped chat methods to filter extra keyword arguments, ensuring only supported parameters are passed.
Reasoning Content Filtering:
The system optionally removes internal reasoning segments embedded in <think> tags and tool call XML tags from outputs to present clean user-facing text.
Streaming Chat:
chat_streamly method yields incremental responses, handling token counting and stream termination gracefully.

Interaction with Other System Components

Database Models:
Utilizes LLM database model and services (CommonService) for persistent LLM configuration and metadata.
Tenant LLM Management:
Extends LLM4Tenant and leverages TenantLLMService for tenant-specific model instantiation and usage tracking.
Global Settings:
Reads LLM factory configurations from api.settings for factory and API key management.
Telemetry:
Integrates with Langfuse for monitoring and contextual tracing of LLM usage.
Tool Invocation System:
Interfaces with external tools callable by models, supporting chained or augmented LLM functionalities.

Visual Diagram

classDiagram
    class LLMService {
        +model: LLM
    }

    class LLMBundle {
        -tenant_id
        -llm_type
        -llm_name
        -lang
        -mdl
        -langfuse
        +__init__(tenant_id, llm_type, llm_name=None, lang="Chinese", **kwargs)
        +bind_tools(toolcall_session, tools)
        +encode(texts: list) -> (list, int)
        +encode_queries(query: str) -> (list, int)
        +similarity(query: str, texts: list) -> (list, int)
        +describe(image, max_tokens=300) -> str
        +describe_with_prompt(image, prompt) -> str
        +transcription(audio) -> str
        +tts(text: str) -> Generator[bytes, None, None]
        -_remove_reasoning_content(txt: str) -> str
        -_clean_param(chat_partial, **kwargs) -> dict
        +chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> str
        +chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs)
    }

    LLMService <|-- LLMBundle
    LLMBundle ..> TenantLLMService : uses
    LLMBundle ..> LLM4Tenant : inherits

Summary

llm_service.py is a pivotal module that facilitates tenant-aware LLM management within InfiniFlow. It bridges database models, tenant services, and LLM models, wrapping them in a rich API that supports embedding, similarity, multi-modal inputs, transcription, TTS, and interactive chat—with careful tracking of usage and telemetry integration. Its design emphasizes extensibility (tool support), robustness (parameter validation), and observability (Langfuse), making it a cornerstone for scalable and auditable LLM-powered applications in a multi-tenant environment.

llm_service.py

Overview

Classes and Functions

Class: LLMService

Function: get_init_tenant_llm(user_id)

Class: LLMBundle

Methods

bind_tools(toolcall_session, tools)

encode(texts: list) -> Tuple[List[float], int]

encode_queries(query: str) -> Tuple[List[float], int]

similarity(query: str, texts: list) -> Tuple[List[float], int]

describe(image, max_tokens=300) -> str

describe_with_prompt(image, prompt) -> str

transcription(audio) -> str

tts(text: str) -> Generator[bytes, None, None]

_remove_reasoning_content(txt: str) -> str

_clean_param(chat_partial, **kwargs) -> dict

chat(system: str, history: list, gen_conf: dict = {}, **kwargs) -> str

chat_streamly(system: str, history: list, gen_conf: dict = {}, **kwargs)