tts_model.py

Overview

tts_model.py is a core module in the InfiniFlow project that provides a unified interface and multiple backend implementations for Text-to-Speech (TTS) synthesis services. It abstracts various TTS providers' APIs into consistent Python classes, enabling the generation of audio streams from input text across different cloud services and models.

The file defines:

Data models for TTS requests and reference audios.
An abstract base class Base for TTS providers.
Multiple concrete TTS provider classes implementing different APIs, including Fish Audio, Tongyi-Qianwen, OpenAI, XunFei Spark, Xinference, Ollama, GPUStack, SiliconFlow, and DeepInfra.
Each provider class manages authentication, request formatting, streaming audio response handling, and error management.

This modular design facilitates easy extension for new TTS providers and seamless integration into applications requiring flexible TTS capabilities.

Classes and Functions

Data Models

`ServeReferenceAudio(BaseModel)`

Purpose: Represents a reference audio sample used for in-context learning during TTS synthesis.
Properties:
- audio: bytes — Raw audio data bytes.
- text: str — Text transcription corresponding to the audio.

`ServeTTSRequest(BaseModel)`

Purpose: Defines the structure of a TTS request payload.
Properties:
- text: str — The text to be converted to speech.
- chunk_length: int — Length of each audio chunk in milliseconds (between 100 and 300, default 200).
- format: Literal["wav", "pcm", "mp3"] — Desired audio format. Default: "mp3".
- mp3_bitrate: Literal[64, 128, 192] — Bitrate for MP3 audio. Default: 128.
- references: list[ServeReferenceAudio] — List of reference audios for in-context learning.
- reference_id: Optional[str] — Optional reference ID (e.g., from an external service).
- normalize: bool — Whether to normalize text input (default True).
- latency: Literal["normal", "balanced"] — Latency mode affecting speed and stability (default "normal").

Abstract Base Class

`Base(ABC)`

Purpose: Abstract base class defining minimal interface and utility for TTS provider implementations.
Constructor Parameters:
- key — API key or credentials for the provider.
- model_name — Model identifier.
- base_url — Base URL for API endpoint.
Methods:
- tts(audio) — Abstract placeholder for TTS synthesis method.
- normalize_text(text: str) -> str — Cleans input text by removing special markup tokens (**, ##\d+$$, #) to improve stability.

Concrete TTS Provider Classes

Each provider class implements a .tts() method that accepts text input and returns a generator yielding audio chunks as bytes. Most classes also handle authentication, request formatting, streaming response parsing, and error checking.

`FishAudioTTS(Base)`

Factory Name: "Fish Audio"
Initialization:
- Accepts JSON-encoded key containing fish_audio_ak (API key) and fish_audio_refid (reference audio ID).
- Sets default base URL to https://api.fish.audio/v1/tts.
Method tts(text: str)
- Sends a POST request with serialized ServeTTSRequest data using ormsgpack.
- Streams audio bytes from the response.
- Yields chunks of audio bytes followed by the token count of the input text.
Error Handling: Raises RuntimeError on HTTP errors.

Example Usage:

fish_tts = FishAudioTTS(key_json, model_name="fish-model")
for chunk in fish_tts.tts("Hello world"):
    process_audio(chunk)

`QwenTTS(Base)`

Factory Name: "Tongyi-Qianwen"
Initialization:
- Sets API key via dashscope.api_key.
- Requires dashscope dependency.
Method tts(text: str)
- Uses dashscope SDK's SpeechSynthesizer with a custom Callback class collecting audio frames in a deque.
- Streams audio frames as they arrive.
- Yields audio frame chunks followed by token count.
Error Handling: Converts SDK errors to RuntimeError.

`OpenAITTS(Base)`

Factory Name: "OpenAI"
Initialization:
- Default base URL: https://api.openai.com/v1.
- Stores API key and model name.
Method tts(text: str, voice="alloy")
- Sends POST request to /audio/speech with JSON payload.
- Streams response content chunks.
Error Handling: Raises Exception on non-200 status.

`SparkTTS(Base)`

Factory Name: "XunFei Spark"
Initialization:
- Expects JSON key with spark_app_id, spark_api_secret, spark_api_key.
- Prepares common arguments for API calls.
Method create_url()
- Creates a WebSocket URL with HMAC-SHA256 signature for authentication.
Method tts(text: str)
- Uses WebSocket to send TTS request and receive audio frames.
- Implements WebSocket callbacks for message handling, error, open, close.
- Audio frames are base64-decoded and put into a thread-safe queue.
- Yields audio chunks from the queue until end signal.
Error Handling: Raises exceptions on errors or invalid credentials.

`XinferenceTTS(Base)`

Factory Name: "Xinference"
Initialization: Accepts base URL via kwargs.
Method tts(text: str, voice="中文女", stream=True)
- Sends POST request to /v1/audio/speech.
- Streams audio data chunks.
Error Handling: Raises exceptions on API errors.

`OllamaTTS(Base)`

Initialization:
- Optional API key for Bearer auth.
- Default base URL: https://api.ollama.ai/v1.
Method tts(text: str, voice="standard-voice")
- Sends POST request to /audio/tts.
- Streams response audio content.
Error Handling: Raises exceptions on errors.

`GPUStackTTS(Base)`

Factory Name: "GPUStack"
Initialization: Accepts base URL via kwargs, requires API key.
Method tts(text: str, voice="Chinese Female", stream=True)
- Sends POST request to /v1/audio/speech.
- Streams audio chunks.
Error Handling: Raises exceptions on errors.

`SILICONFLOWTTS(Base)`

Factory Name: "SILICONFLOW"
Initialization:
- Default model: "FunAudioLLM/CosyVoice2-0.5B"
- Default base URL: https://api.siliconflow.cn/v1
Method tts(text: str, voice="anna")
- Sends POST request to /audio/speech with detailed payload including speed, gain, and sample rate.
- Streams MP3 audio response.
Error Handling: Raises exceptions on errors.

`DeepInfraTTS(OpenAITTS)`

Factory Name: "DeepInfra"
Inheritance: Extends OpenAITTS.
Initialization:
- Default base URL: https://api.deepinfra.com/v1/openai.
- Uses OpenAI-compatible API.
Functionality: Same as OpenAITTS, but targeting DeepInfra endpoint.

Important Implementation Details

Streaming Audio: All TTS classes use streaming HTTP responses or WebSocket streams to yield audio chunks progressively rather than waiting for the full audio file.
Text Normalization: The normalize_text method is used to clean input text, removing markdown-like tokens that could affect TTS quality.
Authentication: Providers use various auth mechanisms including API keys in headers, HMAC signatures (SparkTTS), and bearer tokens.
Error Handling: HTTP and API errors are captured and raised as Python exceptions with meaningful messages.
Dependencies: Some providers require external SDKs (dashscope for QwenTTS) or third-party libraries (httpx, websocket-client, requests, ormsgpack).
Concurrency: SparkTTS employs threading and queues to manage asynchronous WebSocket audio frames.

Integration and Interaction

This file serves as a core abstraction layer for TTS services in the InfiniFlow system.
Other components can instantiate specific TTS classes based on user configuration or environment.
The consistent .tts() interface allows calling code to stream audio data transparently regardless of backend.
The ServeTTSRequest and ServeReferenceAudio models standardize request payloads for providers that support advanced features like reference audios or normalization.
Utility functions like num_tokens_from_string (imported externally) are used to provide token counts alongside audio streaming for analytics or cost estimation.

Usage Example

from tts_model import FishAudioTTS

# Initialize TTS client with JSON credentials
key_json = '{"fish_audio_ak": "your_api_key", "fish_audio_refid": "ref123"}'
tts_client = FishAudioTTS(key_json, "model_name")

# Generate audio stream from text
text = "Hello, this is a test."
audio_stream = tts_client.tts(text)

# Process chunks (e.g., save to file)
with open("output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Mermaid Class Diagram

classDiagram
    class Base {
        <<abstract>>
        +__init__(key, model_name, base_url, **kwargs)
        +tts(audio)
        +normalize_text(text) string
    }

    class ServeReferenceAudio {
        +audio: bytes
        +text: str
    }

    class ServeTTSRequest {
        +text: str
        +chunk_length: int
        +format: str
        +mp3_bitrate: int
        +references: list~ServeReferenceAudio~
        +reference_id: str
        +normalize: bool
        +latency: str
    }

    class FishAudioTTS {
        +tts(text) generator
        -headers: dict
        -ref_id: str
        -base_url: str
    }

    class QwenTTS {
        +tts(text) generator
        -model_name: str
    }

    class OpenAITTS {
        +tts(text, voice) generator
        -api_key: str
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class SparkTTS {
        +tts(text) generator
        -APPID: str
        -APISecret: str
        -APIKey: str
        -model_name: str
        -audio_queue: Queue
        +create_url() string
    }

    class XinferenceTTS {
        +tts(text, voice, stream) generator
        -base_url: str
        -model_name: str
        -headers: dict
    }

    class OllamaTTS {
        +tts(text, voice) generator
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class GPUStackTTS {
        +tts(text, voice, stream) generator
        -base_url: str
        -api_key: str
        -model_name: str
        -headers: dict
    }

    class SILICONFLOWTTS {
        +tts(text, voice) generator
        -api_key: str
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class DeepInfraTTS {
        +tts(text, voice) generator
    }

    Base <|-- FishAudioTTS
    Base <|-- QwenTTS
    Base <|-- OpenAITTS
    Base <|-- SparkTTS
    Base <|-- XinferenceTTS
    Base <|-- GPUStackTTS
    Base <|-- SILICONFLOWTTS
    OpenAITTS <|-- DeepInfraTTS

Summary

tts_model.py is a highly modular and extensible TTS abstraction layer providing consistent access to multiple speech synthesis backends. Its design leverages Python's OOP and streaming capabilities to provide efficient, real-time audio generation. This file is a foundational piece in InfiniFlow's voice synthesis functionality and can be extended to support additional TTS providers with minimal effort.