tts_model.py
Overview
tts_model.py is a core module in the InfiniFlow project that provides a unified interface and multiple backend implementations for Text-to-Speech (TTS) synthesis services. It abstracts various TTS providers' APIs into consistent Python classes, enabling the generation of audio streams from input text across different cloud services and models.
The file defines:
Data models for TTS requests and reference audios.
An abstract base class
Basefor TTS providers.Multiple concrete TTS provider classes implementing different APIs, including Fish Audio, Tongyi-Qianwen, OpenAI, XunFei Spark, Xinference, Ollama, GPUStack, SiliconFlow, and DeepInfra.
Each provider class manages authentication, request formatting, streaming audio response handling, and error management.
This modular design facilitates easy extension for new TTS providers and seamless integration into applications requiring flexible TTS capabilities.
Classes and Functions
Data Models
ServeReferenceAudio(BaseModel)
Purpose: Represents a reference audio sample used for in-context learning during TTS synthesis.
Properties:
audio: bytes— Raw audio data bytes.text: str— Text transcription corresponding to the audio.
ServeTTSRequest(BaseModel)
Purpose: Defines the structure of a TTS request payload.
Properties:
text: str— The text to be converted to speech.chunk_length: int— Length of each audio chunk in milliseconds (between 100 and 300, default 200).format: Literal["wav", "pcm", "mp3"] — Desired audio format. Default:
"mp3".mp3_bitrate: Literal[64, 128, 192] — Bitrate for MP3 audio. Default:
128.references: list[ServeReferenceAudio] — List of reference audios for in-context learning.
reference_id: Optional[str] — Optional reference ID (e.g., from an external service).
normalize: bool— Whether to normalize text input (defaultTrue).latency: Literal["normal", "balanced"] — Latency mode affecting speed and stability (default
"normal").
Abstract Base Class
Base(ABC)
Purpose: Abstract base class defining minimal interface and utility for TTS provider implementations.
Constructor Parameters:
key— API key or credentials for the provider.model_name— Model identifier.base_url— Base URL for API endpoint.
Methods:
tts(audio)— Abstract placeholder for TTS synthesis method.normalize_text(text: str) -> str — Cleans input text by removing special markup tokens (
**,##\d+$$,#) to improve stability.
Concrete TTS Provider Classes
Each provider class implements a .tts() method that accepts text input and returns a generator yielding audio chunks as bytes. Most classes also handle authentication, request formatting, streaming response parsing, and error checking.
FishAudioTTS(Base)
Factory Name:
"Fish Audio"Initialization:
Accepts JSON-encoded
keycontainingfish_audio_ak(API key) andfish_audio_refid(reference audio ID).Sets default base URL to
https://api.fish.audio/v1/tts.
Method
tts(text: str)Sends a POST request with serialized
ServeTTSRequestdata usingormsgpack.Streams audio bytes from the response.
Yields chunks of audio bytes followed by the token count of the input text.
Error Handling: Raises
RuntimeErroron HTTP errors.
Example Usage:
fish_tts = FishAudioTTS(key_json, model_name="fish-model")
for chunk in fish_tts.tts("Hello world"):
process_audio(chunk)
QwenTTS(Base)
Factory Name:
"Tongyi-Qianwen"Initialization:
Sets API key via
dashscope.api_key.Requires
dashscopedependency.
Method
tts(text: str)Uses
dashscopeSDK'sSpeechSynthesizerwith a customCallbackclass collecting audio frames in a deque.Streams audio frames as they arrive.
Yields audio frame chunks followed by token count.
Error Handling: Converts SDK errors to
RuntimeError.
OpenAITTS(Base)
Factory Name:
"OpenAI"Initialization:
Default base URL:
https://api.openai.com/v1.Stores API key and model name.
Method
tts(text: str, voice="alloy")Sends POST request to
/audio/speechwith JSON payload.Streams response content chunks.
Error Handling: Raises
Exceptionon non-200 status.
SparkTTS(Base)
Factory Name:
"XunFei Spark"Initialization:
Expects JSON
keywithspark_app_id,spark_api_secret,spark_api_key.Prepares common arguments for API calls.
Method
create_url()Creates a WebSocket URL with HMAC-SHA256 signature for authentication.
Method
tts(text: str)Uses WebSocket to send TTS request and receive audio frames.
Implements WebSocket callbacks for message handling, error, open, close.
Audio frames are base64-decoded and put into a thread-safe queue.
Yields audio chunks from the queue until end signal.
Error Handling: Raises exceptions on errors or invalid credentials.
XinferenceTTS(Base)
Factory Name:
"Xinference"Initialization: Accepts base URL via kwargs.
Method
tts(text: str, voice="中文女", stream=True)Sends POST request to
/v1/audio/speech.Streams audio data chunks.
Error Handling: Raises exceptions on API errors.
OllamaTTS(Base)
Initialization:
Optional API key for Bearer auth.
Default base URL:
https://api.ollama.ai/v1.
Method
tts(text: str, voice="standard-voice")Sends POST request to
/audio/tts.Streams response audio content.
Error Handling: Raises exceptions on errors.
GPUStackTTS(Base)
Factory Name:
"GPUStack"Initialization: Accepts base URL via kwargs, requires API key.
Method
tts(text: str, voice="Chinese Female", stream=True)Sends POST request to
/v1/audio/speech.Streams audio chunks.
Error Handling: Raises exceptions on errors.
SILICONFLOWTTS(Base)
Factory Name:
"SILICONFLOW"Initialization:
Default model:
"FunAudioLLM/CosyVoice2-0.5B"Default base URL:
https://api.siliconflow.cn/v1
Method
tts(text: str, voice="anna")Sends POST request to
/audio/speechwith detailed payload including speed, gain, and sample rate.Streams MP3 audio response.
Error Handling: Raises exceptions on errors.
DeepInfraTTS(OpenAITTS)
Factory Name:
"DeepInfra"Inheritance: Extends
OpenAITTS.Initialization:
Default base URL:
https://api.deepinfra.com/v1/openai.Uses OpenAI-compatible API.
Functionality: Same as
OpenAITTS, but targeting DeepInfra endpoint.
Important Implementation Details
Streaming Audio: All TTS classes use streaming HTTP responses or WebSocket streams to yield audio chunks progressively rather than waiting for the full audio file.
Text Normalization: The
normalize_textmethod is used to clean input text, removing markdown-like tokens that could affect TTS quality.Authentication: Providers use various auth mechanisms including API keys in headers, HMAC signatures (SparkTTS), and bearer tokens.
Error Handling: HTTP and API errors are captured and raised as Python exceptions with meaningful messages.
Dependencies: Some providers require external SDKs (
dashscopefor QwenTTS) or third-party libraries (httpx,websocket-client,requests,ormsgpack).Concurrency:
SparkTTSemploys threading and queues to manage asynchronous WebSocket audio frames.
Integration and Interaction
This file serves as a core abstraction layer for TTS services in the InfiniFlow system.
Other components can instantiate specific TTS classes based on user configuration or environment.
The consistent
.tts()interface allows calling code to stream audio data transparently regardless of backend.The
ServeTTSRequestandServeReferenceAudiomodels standardize request payloads for providers that support advanced features like reference audios or normalization.Utility functions like
num_tokens_from_string(imported externally) are used to provide token counts alongside audio streaming for analytics or cost estimation.
Usage Example
from tts_model import FishAudioTTS
# Initialize TTS client with JSON credentials
key_json = '{"fish_audio_ak": "your_api_key", "fish_audio_refid": "ref123"}'
tts_client = FishAudioTTS(key_json, "model_name")
# Generate audio stream from text
text = "Hello, this is a test."
audio_stream = tts_client.tts(text)
# Process chunks (e.g., save to file)
with open("output.mp3", "wb") as f:
for chunk in audio_stream:
f.write(chunk)
Mermaid Class Diagram
classDiagram
class Base {
<<abstract>>
+__init__(key, model_name, base_url, **kwargs)
+tts(audio)
+normalize_text(text) string
}
class ServeReferenceAudio {
+audio: bytes
+text: str
}
class ServeTTSRequest {
+text: str
+chunk_length: int
+format: str
+mp3_bitrate: int
+references: list~ServeReferenceAudio~
+reference_id: str
+normalize: bool
+latency: str
}
class FishAudioTTS {
+tts(text) generator
-headers: dict
-ref_id: str
-base_url: str
}
class QwenTTS {
+tts(text) generator
-model_name: str
}
class OpenAITTS {
+tts(text, voice) generator
-api_key: str
-model_name: str
-base_url: str
-headers: dict
}
class SparkTTS {
+tts(text) generator
-APPID: str
-APISecret: str
-APIKey: str
-model_name: str
-audio_queue: Queue
+create_url() string
}
class XinferenceTTS {
+tts(text, voice, stream) generator
-base_url: str
-model_name: str
-headers: dict
}
class OllamaTTS {
+tts(text, voice) generator
-model_name: str
-base_url: str
-headers: dict
}
class GPUStackTTS {
+tts(text, voice, stream) generator
-base_url: str
-api_key: str
-model_name: str
-headers: dict
}
class SILICONFLOWTTS {
+tts(text, voice) generator
-api_key: str
-model_name: str
-base_url: str
-headers: dict
}
class DeepInfraTTS {
+tts(text, voice) generator
}
Base <|-- FishAudioTTS
Base <|-- QwenTTS
Base <|-- OpenAITTS
Base <|-- SparkTTS
Base <|-- XinferenceTTS
Base <|-- GPUStackTTS
Base <|-- SILICONFLOWTTS
OpenAITTS <|-- DeepInfraTTS
Summary
tts_model.py is a highly modular and extensible TTS abstraction layer providing consistent access to multiple speech synthesis backends. Its design leverages Python's OOP and streaming capabilities to provide efficient, real-time audio generation. This file is a foundational piece in InfiniFlow's voice synthesis functionality and can be extended to support additional TTS providers with minimal effort.