tts_model.py


Overview

tts_model.py is a core module in the InfiniFlow project that provides a unified interface and multiple backend implementations for Text-to-Speech (TTS) synthesis services. It abstracts various TTS providers' APIs into consistent Python classes, enabling the generation of audio streams from input text across different cloud services and models.

The file defines:

This modular design facilitates easy extension for new TTS providers and seamless integration into applications requiring flexible TTS capabilities.


Classes and Functions

Data Models

ServeReferenceAudio(BaseModel)

ServeTTSRequest(BaseModel)


Abstract Base Class

Base(ABC)


Concrete TTS Provider Classes

Each provider class implements a .tts() method that accepts text input and returns a generator yielding audio chunks as bytes. Most classes also handle authentication, request formatting, streaming response parsing, and error checking.


FishAudioTTS(Base)

Example Usage:

fish_tts = FishAudioTTS(key_json, model_name="fish-model")
for chunk in fish_tts.tts("Hello world"):
    process_audio(chunk)

QwenTTS(Base)


OpenAITTS(Base)


SparkTTS(Base)


XinferenceTTS(Base)


OllamaTTS(Base)


GPUStackTTS(Base)


SILICONFLOWTTS(Base)


DeepInfraTTS(OpenAITTS)


Important Implementation Details


Integration and Interaction


Usage Example

from tts_model import FishAudioTTS

# Initialize TTS client with JSON credentials
key_json = '{"fish_audio_ak": "your_api_key", "fish_audio_refid": "ref123"}'
tts_client = FishAudioTTS(key_json, "model_name")

# Generate audio stream from text
text = "Hello, this is a test."
audio_stream = tts_client.tts(text)

# Process chunks (e.g., save to file)
with open("output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Mermaid Class Diagram

classDiagram
    class Base {
        <<abstract>>
        +__init__(key, model_name, base_url, **kwargs)
        +tts(audio)
        +normalize_text(text) string
    }

    class ServeReferenceAudio {
        +audio: bytes
        +text: str
    }

    class ServeTTSRequest {
        +text: str
        +chunk_length: int
        +format: str
        +mp3_bitrate: int
        +references: list~ServeReferenceAudio~
        +reference_id: str
        +normalize: bool
        +latency: str
    }

    class FishAudioTTS {
        +tts(text) generator
        -headers: dict
        -ref_id: str
        -base_url: str
    }

    class QwenTTS {
        +tts(text) generator
        -model_name: str
    }

    class OpenAITTS {
        +tts(text, voice) generator
        -api_key: str
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class SparkTTS {
        +tts(text) generator
        -APPID: str
        -APISecret: str
        -APIKey: str
        -model_name: str
        -audio_queue: Queue
        +create_url() string
    }

    class XinferenceTTS {
        +tts(text, voice, stream) generator
        -base_url: str
        -model_name: str
        -headers: dict
    }

    class OllamaTTS {
        +tts(text, voice) generator
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class GPUStackTTS {
        +tts(text, voice, stream) generator
        -base_url: str
        -api_key: str
        -model_name: str
        -headers: dict
    }

    class SILICONFLOWTTS {
        +tts(text, voice) generator
        -api_key: str
        -model_name: str
        -base_url: str
        -headers: dict
    }

    class DeepInfraTTS {
        +tts(text, voice) generator
    }

    Base <|-- FishAudioTTS
    Base <|-- QwenTTS
    Base <|-- OpenAITTS
    Base <|-- SparkTTS
    Base <|-- XinferenceTTS
    Base <|-- GPUStackTTS
    Base <|-- SILICONFLOWTTS
    OpenAITTS <|-- DeepInfraTTS

Summary

tts_model.py is a highly modular and extensible TTS abstraction layer providing consistent access to multiple speech synthesis backends. Its design leverages Python's OOP and streaming capabilities to provide efficient, real-time audio generation. This file is a foundational piece in InfiniFlow's voice synthesis functionality and can be extended to support additional TTS providers with minimal effort.