sequence2txt_model.py

Overview

sequence2txt_model.py is a module within the InfiniFlow project that provides a standardized interface and multiple implementations for converting audio sequences into text transcriptions. It serves as a collection of adapter classes that interface with various speech-to-text (STT) services and APIs, such as OpenAI Whisper, Tencent Cloud ASR, Azure OpenAI, and others.

The file defines an abstract base class Base with common utility methods and transcription patterns, then extends it with concrete subclasses for different providers. Each subclass handles provider-specific authentication, API calls, request formatting, and response parsing.

This modular design allows the InfiniFlow system to flexibly switch between or integrate multiple STT backends by instantiating the corresponding class. The transcription results are returned consistently as text along with a token count (likely for usage or cost tracking).

Classes and Methods

1. `Base` (Abstract Base Class)

An abstract base class that defines the interface and some utility functions common to all sequence-to-text implementations.

Methods

init(self, key, model_name, **kwargs)
- Purpose: Constructor placeholder; actual initialization is deferred to subclasses.
- Parameters:
  - key (str): API key or credentials for the service.
  - model_name (str): Name/ID of the model to use.
  - kwargs: Additional keyword arguments for subclasses.
- Returns: None
transcription(self, audio_path, **kwargs) -> (str, int)
- Purpose: Transcribes audio from a given file path using the configured client.
- Parameters:
  - audio_path (str): Path to the audio file to transcribe.
  - kwargs: Additional parameters for subclasses.
- Returns: Tuple of (transcription_text, token_count)
  - transcription_text (str): The transcribed text.
  - token_count (int): Number of tokens in the transcribed text.
- Usage:
```
text, tokens = base_instance.transcription("audio.wav")
```
audio2base64(self, audio) -> str
- Purpose: Converts audio data in bytes or BytesIO to a base64-encoded string.
- Parameters:
  - audio (bytes or io.BytesIO): Audio data.
- Returns: Base64 encoded string representation of the audio.
- Raises: TypeError if input format is unsupported.
- Usage:
```
b64_str = base_instance.audio2base64(audio_bytes)
```

2. `GPTSeq2txt` (OpenAI Whisper API Client)

Implements transcription via OpenAI's Whisper API using the official OpenAI Python SDK.

Initialization

GPTSeq2txt(key, model_name="whisper-1", base_url="https://api.openai.com/v1")

Parameters:
- key (str): OpenAI API key.
- model_name (str, optional): Whisper model name; defaults to "whisper-1".
- base_url (str, optional): API base URL; defaults to OpenAI's.
Usage Example:

gpt = GPTSeq2txt("my_openai_key")
text, tokens = gpt.transcription("audio.wav")

Implementation Notes:
- Uses OpenAI client from the openai package.
- Calls client.audio.transcriptions.create() to perform transcription.

3. `QWenSeq2txt` (Tongyi-Qianwen API Client)

Interfaces with the Tongyi-Qianwen's audio ASR API via the dashscope library.

Initialization

QWenSeq2txt(key, model_name="qwen-audio-asr")

Parameters:
- key (str): Dashscope API key.
- model_name (str): Model identifier.
transcription(self, audio_path) -> (str, int)
- Uses dashscope.MultiModalConversation to send audio for transcription.
- Handles streaming responses and aggregates text.
- Returns error if unsupported model is specified.
Usage Example:

qwen = QWenSeq2txt("my_dashscope_key")
text, tokens = qwen.transcription("audio.wav")

Implementation Details:
- Rejects models containing "paraformer" or "sensevoice" as unsupported.
- Converts local audio path to URI format.
- Streams partial transcription results and concatenates them.
- Exception handling wraps errors with "ERROR" prefix.

4. `AzureSeq2txt` (Azure OpenAI Whisper Client)

Adapter for Azure OpenAI speech-to-text service.

Initialization

AzureSeq2txt(key, model_name, lang="Chinese", base_url=..., **kwargs)

Parameters:
- key (str): Azure API key.
- model_name (str): Model name.
- lang (str): Language for transcription (default: "Chinese").
- base_url/azure_endpoint (str): Azure service endpoint.
Implementation Notes:
- Uses AzureOpenAI client class.
- Initialization expects base_url as a keyword argument named "base_url".

5. `XinferenceSeq2txt` (Xinference API Client)

Client for Xinference ASR service via HTTP POST.

Initialization

XinferenceSeq2txt(key, model_name="whisper-small", base_url=...)

Parameters:
- key (str): API key.
- model_name (str): Model to use.
- base_url (str): API endpoint URL.
transcription(self, audio, language="zh", prompt=None, response_format="json", temperature=0.7) -> (str, int)
- Supports input as either a file path or raw audio bytes.
- Sends multipart POST request with audio file and parameters.
- Parses JSON response for "text" field.
- Returns error string and zero tokens if request fails.
Usage Example:

xinference = XinferenceSeq2txt("api_key", base_url="https://xinference.example.com")
text, tokens = xinference.transcription("audio.wav")

6. `TencentCloudSeq2txt` (Tencent Cloud ASR Client)

Adapter for Tencent Cloud's automatic speech recognition.

Initialization

TencentCloudSeq2txt(key, model_name="16k_zh", base_url="https://asr.tencentcloudapi.com")

Parameters:
- key (str): JSON string containing Tencent Cloud credentials (tencent_cloud_sid, tencent_cloud_sk).
- model_name (str): Model identifier.
- base_url (str): API endpoint URL (default provided).
transcription(self, audio, max_retries=60, retry_interval=5) -> (str, int)
- Converts audio to base64.
- Creates transcription task via CreateRecTask.
- Polls task status with exponential backoff up to max_retries.
- Returns cleaned text by removing timestamps.
- Returns error if task fails or retries exhausted.
Usage Example:

key_json = '{"tencent_cloud_sid": "sid", "tencent_cloud_sk": "sk"}'
tencent = TencentCloudSeq2txt(key_json)
text, tokens = tencent.transcription("audio.wav")

Implementation Details:
- Uses Tencent Cloud SDK for Python.
- Handles exceptions from SDK and general exceptions gracefully.

7. `GPUStackSeq2txt` (GPUStack API Client)

Client for GPUStack ASR service.

Initialization

GPUStackSeq2txt(key, model_name, base_url)

Parameters:
- key (str): API key.
- model_name (str): Model identifier.
- base_url (str): Base URL for the API; appends /v1 if missing.
Notes:
- Does not implement transcription method in this snippet; likely implemented elsewhere or expected to be extended.

8. `GiteeSeq2txt` (Gitee AI Client)

Client for Gitee AI speech-to-text service.

Initialization

GiteeSeq2txt(key, model_name="whisper-1", base_url="https://ai.gitee.com/v1/")

Parameters:
- key (str): API key.
- model_name (str): Model name.
- base_url (str): API base URL.
Implementation:
- Uses OpenAI client class pointing to Gitee AI endpoints.
- transcription method inherited from Base.

9. `DeepInfraSeq2txt` (DeepInfra API Client)

Client for DeepInfra ASR service.

Initialization

DeepInfraSeq2txt(key, model_name, base_url="https://api.deepinfra.com/v1/openai", **kwargs)

Parameters:
- key (str): API key.
- model_name (str): Model name.
- base_url (str): API base URL.
Implementation:
- Uses OpenAI client targeting DeepInfra API.
- transcription method inherited from Base.

Important Implementation Details

Token Counting: All transcriptions return a tuple containing the transcription text and the token count. Token counting uses the utility function num_tokens_from_string from rag.utils, presumably to measure input size for cost or quota purposes.
Error Handling: Subclasses often return transcription strings prefixed with "ERROR:" and zero tokens when exceptions or API failures occur, enabling calling code to detect failures gracefully.
Audio Input Flexibility: Some classes accept either file paths or raw bytes for audio input, converting as needed.
Polling Mechanism: TencentCloudSeq2txt implements a polling loop with retries to handle asynchronous transcription tasks.
API Clients: The module uses multiple third-party SDKs and HTTP clients:
- openai official SDK for OpenAI and similar APIs.
- dashscope for Tongyi-Qianwen.
- requests for Xinference HTTP requests.
- Tencent Cloud Python SDK.
- Azure OpenAI SDK wrapper.

Interaction with Other Parts of the System

The module depends on rag.utils.num_tokens_from_string for token counting, indicating integration with a broader retrieval-augmented generation (RAG) system or token management utility.
Each class acts as a factory with a _FACTORY_NAME attribute, which suggests that these classes may be dynamically instantiated based on configuration or factory patterns elsewhere in InfiniFlow.
The classes abstract away provider-specific details, allowing the rest of the application to invoke transcription() uniformly regardless of backend.

Usage Summary

Typical usage involves:

Instantiating the desired class with API key and configuration.
Calling the transcription(audio_path) method with the audio file.
Receiving the transcribed text and token count.

Example:

from sequence2txt_model import GPTSeq2txt

gpt_client = GPTSeq2txt(key="your_api_key")
text, token_count = gpt_client.transcription("path/to/audio.wav")
print(f"Transcribed text: {text}")
print(f"Token count: {token_count}")

Mermaid Class Diagram

classDiagram
    class Base {
        +__init__(key, model_name, **kwargs)
        +transcription(audio_path, **kwargs) string, int
        +audio2base64(audio) string
    }

    class GPTSeq2txt {
        +__init__(key, model_name="whisper-1", base_url)
    }
    class QWenSeq2txt {
        +__init__(key, model_name="qwen-audio-asr")
        +transcription(audio_path) string, int
    }
    class AzureSeq2txt {
        +__init__(key, model_name, lang="Chinese", **kwargs)
    }
    class XinferenceSeq2txt {
        +__init__(key, model_name="whisper-small", **kwargs)
        +transcription(audio, language="zh", prompt=None, response_format="json", temperature=0.7) string, int
    }
    class TencentCloudSeq2txt {
        +__init__(key, model_name="16k_zh", base_url)
        +transcription(audio, max_retries=60, retry_interval=5) string, int
    }
    class GPUStackSeq2txt {
        +__init__(key, model_name, base_url)
    }
    class GiteeSeq2txt {
        +__init__(key, model_name="whisper-1", base_url)
    }
    class DeepInfraSeq2txt {
        +__init__(key, model_name, base_url, **kwargs)
    }

    GPTSeq2txt --|> Base
    QWenSeq2txt --|> Base
    AzureSeq2txt --|> Base
    XinferenceSeq2txt --|> Base
    TencentCloudSeq2txt --|> Base
    GPUStackSeq2txt --|> Base
    GiteeSeq2txt --|> Base
    DeepInfraSeq2txt --|> Base

Summary

This module abstracts multiple speech-to-text backends behind a common interface, providing robust, extensible, and provider-agnostic transcription capabilities for the InfiniFlow system. It encapsulates API intricacies, error handling, and token counting, enabling seamless integration of diverse STT services.

sequence2txt_model.py

Overview

Classes and Methods

1. Base (Abstract Base Class)

Methods

2. GPTSeq2txt (OpenAI Whisper API Client)

Initialization

3. QWenSeq2txt (Tongyi-Qianwen API Client)

Initialization

4. AzureSeq2txt (Azure OpenAI Whisper Client)

Initialization

5. XinferenceSeq2txt (Xinference API Client)

Initialization

6. TencentCloudSeq2txt (Tencent Cloud ASR Client)

Initialization

7. GPUStackSeq2txt (GPUStack API Client)

Initialization

8. GiteeSeq2txt (Gitee AI Client)

Initialization

9. DeepInfraSeq2txt (DeepInfra API Client)

Initialization

Important Implementation Details

Interaction with Other Parts of the System

Usage Summary

Mermaid Class Diagram

Summary

1. `Base` (Abstract Base Class)

2. `GPTSeq2txt` (OpenAI Whisper API Client)

3. `QWenSeq2txt` (Tongyi-Qianwen API Client)

4. `AzureSeq2txt` (Azure OpenAI Whisper Client)

5. `XinferenceSeq2txt` (Xinference API Client)

6. `TencentCloudSeq2txt` (Tencent Cloud ASR Client)

7. `GPUStackSeq2txt` (GPUStack API Client)

8. `GiteeSeq2txt` (Gitee AI Client)

9. `DeepInfraSeq2txt` (DeepInfra API Client)