sequence2txt_model.py
Overview
sequence2txt_model.py is a module within the InfiniFlow project that provides a standardized interface and multiple implementations for converting audio sequences into text transcriptions. It serves as a collection of adapter classes that interface with various speech-to-text (STT) services and APIs, such as OpenAI Whisper, Tencent Cloud ASR, Azure OpenAI, and others.
The file defines an abstract base class Base with common utility methods and transcription patterns, then extends it with concrete subclasses for different providers. Each subclass handles provider-specific authentication, API calls, request formatting, and response parsing.
This modular design allows the InfiniFlow system to flexibly switch between or integrate multiple STT backends by instantiating the corresponding class. The transcription results are returned consistently as text along with a token count (likely for usage or cost tracking).
Classes and Methods
1. Base (Abstract Base Class)
An abstract base class that defines the interface and some utility functions common to all sequence-to-text implementations.
Methods
init(self, key, model_name, **kwargs)Purpose: Constructor placeholder; actual initialization is deferred to subclasses.
Parameters:
key(str): API key or credentials for the service.model_name(str): Name/ID of the model to use.kwargs: Additional keyword arguments for subclasses.
Returns: None
transcription(self, audio_path, **kwargs) -> (str, int)Purpose: Transcribes audio from a given file path using the configured client.
Parameters:
audio_path(str): Path to the audio file to transcribe.kwargs: Additional parameters for subclasses.
Returns: Tuple of
(transcription_text, token_count)transcription_text(str): The transcribed text.token_count(int): Number of tokens in the transcribed text.
Usage:
text, tokens = base_instance.transcription("audio.wav")audio2base64(self, audio) -> strPurpose: Converts audio data in bytes or BytesIO to a base64-encoded string.
Parameters:
audio(bytes or io.BytesIO): Audio data.
Returns: Base64 encoded string representation of the audio.
Raises:
TypeErrorif input format is unsupported.Usage:
b64_str = base_instance.audio2base64(audio_bytes)
2. GPTSeq2txt (OpenAI Whisper API Client)
Implements transcription via OpenAI's Whisper API using the official OpenAI Python SDK.
Initialization
GPTSeq2txt(key, model_name="whisper-1", base_url="https://api.openai.com/v1")
Parameters:
key(str): OpenAI API key.model_name(str, optional): Whisper model name; defaults to"whisper-1".base_url(str, optional): API base URL; defaults to OpenAI's.
Usage Example:
gpt = GPTSeq2txt("my_openai_key")
text, tokens = gpt.transcription("audio.wav")
Implementation Notes:
Uses
OpenAIclient from theopenaipackage.Calls
client.audio.transcriptions.create()to perform transcription.
3. QWenSeq2txt (Tongyi-Qianwen API Client)
Interfaces with the Tongyi-Qianwen's audio ASR API via the dashscope library.
Initialization
QWenSeq2txt(key, model_name="qwen-audio-asr")
Parameters:
key(str): Dashscope API key.model_name(str): Model identifier.
transcription(self, audio_path) -> (str, int)
Uses
dashscope.MultiModalConversationto send audio for transcription.Handles streaming responses and aggregates text.
Returns error if unsupported model is specified.
Usage Example:
qwen = QWenSeq2txt("my_dashscope_key")
text, tokens = qwen.transcription("audio.wav")
Implementation Details:
Rejects models containing
"paraformer"or"sensevoice"as unsupported.Converts local audio path to URI format.
Streams partial transcription results and concatenates them.
Exception handling wraps errors with
"ERROR"prefix.
4. AzureSeq2txt (Azure OpenAI Whisper Client)
Adapter for Azure OpenAI speech-to-text service.
Initialization
AzureSeq2txt(key, model_name, lang="Chinese", base_url=..., **kwargs)
Parameters:
key(str): Azure API key.model_name(str): Model name.lang(str): Language for transcription (default:"Chinese").base_url/azure_endpoint(str): Azure service endpoint.
Implementation Notes:
Uses
AzureOpenAIclient class.Initialization expects
base_urlas a keyword argument named"base_url".
5. XinferenceSeq2txt (Xinference API Client)
Client for Xinference ASR service via HTTP POST.
Initialization
XinferenceSeq2txt(key, model_name="whisper-small", base_url=...)
Parameters:
key(str): API key.model_name(str): Model to use.base_url(str): API endpoint URL.
transcription(self, audio, language="zh", prompt=None, response_format="json", temperature=0.7) -> (str, int)
Supports input as either a file path or raw audio bytes.
Sends multipart POST request with audio file and parameters.
Parses JSON response for
"text"field.Returns error string and zero tokens if request fails.
Usage Example:
xinference = XinferenceSeq2txt("api_key", base_url="https://xinference.example.com")
text, tokens = xinference.transcription("audio.wav")
6. TencentCloudSeq2txt (Tencent Cloud ASR Client)
Adapter for Tencent Cloud's automatic speech recognition.
Initialization
TencentCloudSeq2txt(key, model_name="16k_zh", base_url="https://asr.tencentcloudapi.com")
Parameters:
key(str): JSON string containing Tencent Cloud credentials (tencent_cloud_sid,tencent_cloud_sk).model_name(str): Model identifier.base_url(str): API endpoint URL (default provided).
transcription(self, audio, max_retries=60, retry_interval=5) -> (str, int)
Converts audio to base64.
Creates transcription task via
CreateRecTask.Polls task status with exponential backoff up to
max_retries.Returns cleaned text by removing timestamps.
Returns error if task fails or retries exhausted.
Usage Example:
key_json = '{"tencent_cloud_sid": "sid", "tencent_cloud_sk": "sk"}'
tencent = TencentCloudSeq2txt(key_json)
text, tokens = tencent.transcription("audio.wav")
Implementation Details:
Uses Tencent Cloud SDK for Python.
Handles exceptions from SDK and general exceptions gracefully.
7. GPUStackSeq2txt (GPUStack API Client)
Client for GPUStack ASR service.
Initialization
GPUStackSeq2txt(key, model_name, base_url)
Parameters:
key(str): API key.model_name(str): Model identifier.base_url(str): Base URL for the API; appends/v1if missing.
Notes:
Does not implement
transcriptionmethod in this snippet; likely implemented elsewhere or expected to be extended.
8. GiteeSeq2txt (Gitee AI Client)
Client for Gitee AI speech-to-text service.
Initialization
GiteeSeq2txt(key, model_name="whisper-1", base_url="https://ai.gitee.com/v1/")
Parameters:
key(str): API key.model_name(str): Model name.base_url(str): API base URL.
Implementation:
Uses OpenAI client class pointing to Gitee AI endpoints.
transcriptionmethod inherited fromBase.
9. DeepInfraSeq2txt (DeepInfra API Client)
Client for DeepInfra ASR service.
Initialization
DeepInfraSeq2txt(key, model_name, base_url="https://api.deepinfra.com/v1/openai", **kwargs)
Parameters:
key(str): API key.model_name(str): Model name.base_url(str): API base URL.
Implementation:
Uses OpenAI client targeting DeepInfra API.
transcriptionmethod inherited fromBase.
Important Implementation Details
Token Counting: All transcriptions return a tuple containing the transcription text and the token count. Token counting uses the utility function
num_tokens_from_stringfromrag.utils, presumably to measure input size for cost or quota purposes.Error Handling: Subclasses often return transcription strings prefixed with
"ERROR:"and zero tokens when exceptions or API failures occur, enabling calling code to detect failures gracefully.Audio Input Flexibility: Some classes accept either file paths or raw bytes for audio input, converting as needed.
Polling Mechanism:
TencentCloudSeq2txtimplements a polling loop with retries to handle asynchronous transcription tasks.API Clients: The module uses multiple third-party SDKs and HTTP clients:
openaiofficial SDK for OpenAI and similar APIs.dashscopefor Tongyi-Qianwen.requestsfor Xinference HTTP requests.Tencent Cloud Python SDK.
Azure OpenAI SDK wrapper.
Interaction with Other Parts of the System
The module depends on
rag.utils.num_tokens_from_stringfor token counting, indicating integration with a broader retrieval-augmented generation (RAG) system or token management utility.Each class acts as a factory with a
_FACTORY_NAMEattribute, which suggests that these classes may be dynamically instantiated based on configuration or factory patterns elsewhere in InfiniFlow.The classes abstract away provider-specific details, allowing the rest of the application to invoke
transcription()uniformly regardless of backend.
Usage Summary
Typical usage involves:
Instantiating the desired class with API key and configuration.
Calling the
transcription(audio_path)method with the audio file.Receiving the transcribed text and token count.
Example:
from sequence2txt_model import GPTSeq2txt
gpt_client = GPTSeq2txt(key="your_api_key")
text, token_count = gpt_client.transcription("path/to/audio.wav")
print(f"Transcribed text: {text}")
print(f"Token count: {token_count}")
Mermaid Class Diagram
classDiagram
class Base {
+__init__(key, model_name, **kwargs)
+transcription(audio_path, **kwargs) string, int
+audio2base64(audio) string
}
class GPTSeq2txt {
+__init__(key, model_name="whisper-1", base_url)
}
class QWenSeq2txt {
+__init__(key, model_name="qwen-audio-asr")
+transcription(audio_path) string, int
}
class AzureSeq2txt {
+__init__(key, model_name, lang="Chinese", **kwargs)
}
class XinferenceSeq2txt {
+__init__(key, model_name="whisper-small", **kwargs)
+transcription(audio, language="zh", prompt=None, response_format="json", temperature=0.7) string, int
}
class TencentCloudSeq2txt {
+__init__(key, model_name="16k_zh", base_url)
+transcription(audio, max_retries=60, retry_interval=5) string, int
}
class GPUStackSeq2txt {
+__init__(key, model_name, base_url)
}
class GiteeSeq2txt {
+__init__(key, model_name="whisper-1", base_url)
}
class DeepInfraSeq2txt {
+__init__(key, model_name, base_url, **kwargs)
}
GPTSeq2txt --|> Base
QWenSeq2txt --|> Base
AzureSeq2txt --|> Base
XinferenceSeq2txt --|> Base
TencentCloudSeq2txt --|> Base
GPUStackSeq2txt --|> Base
GiteeSeq2txt --|> Base
DeepInfraSeq2txt --|> Base
Summary
This module abstracts multiple speech-to-text backends behind a common interface, providing robust, extensible, and provider-agnostic transcription capabilities for the InfiniFlow system. It encapsulates API intricacies, error handling, and token counting, enabling seamless integration of diverse STT services.