sequence2txt_model.py


Overview

sequence2txt_model.py is a module within the InfiniFlow project that provides a standardized interface and multiple implementations for converting audio sequences into text transcriptions. It serves as a collection of adapter classes that interface with various speech-to-text (STT) services and APIs, such as OpenAI Whisper, Tencent Cloud ASR, Azure OpenAI, and others.

The file defines an abstract base class Base with common utility methods and transcription patterns, then extends it with concrete subclasses for different providers. Each subclass handles provider-specific authentication, API calls, request formatting, and response parsing.

This modular design allows the InfiniFlow system to flexibly switch between or integrate multiple STT backends by instantiating the corresponding class. The transcription results are returned consistently as text along with a token count (likely for usage or cost tracking).


Classes and Methods

1. Base (Abstract Base Class)

An abstract base class that defines the interface and some utility functions common to all sequence-to-text implementations.

Methods


2. GPTSeq2txt (OpenAI Whisper API Client)

Implements transcription via OpenAI's Whisper API using the official OpenAI Python SDK.

Initialization

GPTSeq2txt(key, model_name="whisper-1", base_url="https://api.openai.com/v1")
gpt = GPTSeq2txt("my_openai_key")
text, tokens = gpt.transcription("audio.wav")

3. QWenSeq2txt (Tongyi-Qianwen API Client)

Interfaces with the Tongyi-Qianwen's audio ASR API via the dashscope library.

Initialization

QWenSeq2txt(key, model_name="qwen-audio-asr")
qwen = QWenSeq2txt("my_dashscope_key")
text, tokens = qwen.transcription("audio.wav")

4. AzureSeq2txt (Azure OpenAI Whisper Client)

Adapter for Azure OpenAI speech-to-text service.

Initialization

AzureSeq2txt(key, model_name, lang="Chinese", base_url=..., **kwargs)

5. XinferenceSeq2txt (Xinference API Client)

Client for Xinference ASR service via HTTP POST.

Initialization

XinferenceSeq2txt(key, model_name="whisper-small", base_url=...)
xinference = XinferenceSeq2txt("api_key", base_url="https://xinference.example.com")
text, tokens = xinference.transcription("audio.wav")

6. TencentCloudSeq2txt (Tencent Cloud ASR Client)

Adapter for Tencent Cloud's automatic speech recognition.

Initialization

TencentCloudSeq2txt(key, model_name="16k_zh", base_url="https://asr.tencentcloudapi.com")
key_json = '{"tencent_cloud_sid": "sid", "tencent_cloud_sk": "sk"}'
tencent = TencentCloudSeq2txt(key_json)
text, tokens = tencent.transcription("audio.wav")

7. GPUStackSeq2txt (GPUStack API Client)

Client for GPUStack ASR service.

Initialization

GPUStackSeq2txt(key, model_name, base_url)

8. GiteeSeq2txt (Gitee AI Client)

Client for Gitee AI speech-to-text service.

Initialization

GiteeSeq2txt(key, model_name="whisper-1", base_url="https://ai.gitee.com/v1/")

9. DeepInfraSeq2txt (DeepInfra API Client)

Client for DeepInfra ASR service.

Initialization

DeepInfraSeq2txt(key, model_name, base_url="https://api.deepinfra.com/v1/openai", **kwargs)

Important Implementation Details


Interaction with Other Parts of the System


Usage Summary

Typical usage involves:

  1. Instantiating the desired class with API key and configuration.

  2. Calling the transcription(audio_path) method with the audio file.

  3. Receiving the transcribed text and token count.

Example:

from sequence2txt_model import GPTSeq2txt

gpt_client = GPTSeq2txt(key="your_api_key")
text, token_count = gpt_client.transcription("path/to/audio.wav")
print(f"Transcribed text: {text}")
print(f"Token count: {token_count}")

Mermaid Class Diagram

classDiagram
    class Base {
        +__init__(key, model_name, **kwargs)
        +transcription(audio_path, **kwargs) string, int
        +audio2base64(audio) string
    }

    class GPTSeq2txt {
        +__init__(key, model_name="whisper-1", base_url)
    }
    class QWenSeq2txt {
        +__init__(key, model_name="qwen-audio-asr")
        +transcription(audio_path) string, int
    }
    class AzureSeq2txt {
        +__init__(key, model_name, lang="Chinese", **kwargs)
    }
    class XinferenceSeq2txt {
        +__init__(key, model_name="whisper-small", **kwargs)
        +transcription(audio, language="zh", prompt=None, response_format="json", temperature=0.7) string, int
    }
    class TencentCloudSeq2txt {
        +__init__(key, model_name="16k_zh", base_url)
        +transcription(audio, max_retries=60, retry_interval=5) string, int
    }
    class GPUStackSeq2txt {
        +__init__(key, model_name, base_url)
    }
    class GiteeSeq2txt {
        +__init__(key, model_name="whisper-1", base_url)
    }
    class DeepInfraSeq2txt {
        +__init__(key, model_name, base_url, **kwargs)
    }

    GPTSeq2txt --|> Base
    QWenSeq2txt --|> Base
    AzureSeq2txt --|> Base
    XinferenceSeq2txt --|> Base
    TencentCloudSeq2txt --|> Base
    GPUStackSeq2txt --|> Base
    GiteeSeq2txt --|> Base
    DeepInfraSeq2txt --|> Base

Summary

This module abstracts multiple speech-to-text backends behind a common interface, providing robust, extensible, and provider-agnostic transcription capabilities for the InfiniFlow system. It encapsulates API intricacies, error handling, and token counting, enabling seamless integration of diverse STT services.