audio.py

#audio.py

Overview

The audio.py file provides functionality to process and transcribe audio files into text tokens using a Speech-to-Text large language model (LLM). It primarily defines the chunk function, which takes an audio file (as binary data), verifies its format, temporarily saves it, and then uses an LLM-based transcription service to convert the audio into text. The resulting transcription is tokenized to facilitate downstream natural language processing (NLP) tasks such as semantic search or indexing.

This file acts as a bridge between raw audio input and text-based NLP components, integrating tightly with the LLM service layer for transcription and the RAG (Retrieval-Augmented Generation) NLP tokenization utilities.

Detailed Description

Function: `chunk`

chunk(filename, binary, tenant_id, lang, callback=None, **kwargs) -> list[dict]

Purpose

Processes an audio file by:

Validating its file extension.
Temporarily saving the audio binary content to disk.
Invoking a Speech-to-Text LLM service to transcribe the audio.
Tokenizing the transcription text for further analysis.
Returning a list containing a dictionary with metadata and tokenized transcription.

Parameters

filename (str): The name of the audio file, including extension (e.g., "interview.wav"). Used to infer file type and to tokenize the file name.
binary (bytes): Raw binary content of the audio file.
tenant_id (str): Identifier for tenant or client, used to initialize the LLM service for multi-tenant environments.
lang (str): Language code or name (e.g., "English") indicating the language spoken in the audio. Used for language-specific processing.
callback (callable, optional): A function to receive progress updates and status messages during processing. Signature: callback(progress: float, msg: str).
**kwargs: Additional keyword arguments (currently unused, reserved for future extensions).

Returns

list of dict: A list containing a single dictionary with the following keys:
- "docnm_kwd": The original filename.
- "title_tks": Tokenized form of the filename without extension (coarse tokens).
- "title_sm_tks": Fine-grained tokens derived from the filename tokens.
- Plus additional keys added by the tokenize function to represent the transcribed content tokens.

If an error occurs (unsupported extension, transcription failure, etc.), returns an empty list and reports the error via callback.

Usage Example

def progress_callback(progress, msg):
    print(f"Progress: {progress*100:.1f}%, Message: {msg}")

with open("sample_audio.wav", "rb") as f:
    audio_bytes = f.read()

results = chunk(
    filename="sample_audio.wav",
    binary=audio_bytes,
    tenant_id="tenant_123",
    lang="English",
    callback=progress_callback
)

if results:
    doc = results[0]
    print("Filename tokens:", doc["title_tks"])
    print("Fine tokens:", doc["title_sm_tks"])
    # Access transcription tokens added by tokenize()
else:
    print("Failed to process audio.")

Implementation Details

File Extension Validation:
The function extracts the file extension and validates it against a hardcoded list of supported audio formats (e.g., .wav, .mp3, .aac, .flac, etc.). This prevents unsupported files from being processed.
Temporary File Handling:
The audio binary content is saved to a uniquely named temporary file with the appropriate extension using Python's tempfile.NamedTemporaryFile (with delete=False to keep the file during processing). This file is passed to the LLM transcription service and then cleaned up after processing.
Transcription via LLM:
The LLMBundle class from api.db.services.llm_service is instantiated with the tenant ID, specifying the SPEECH2TEXT LLM type and the target language. The .transcription() method is called with the path to the temporary audio file, returning the transcribed text.
Tokenization:
The filename is tokenized twice:
- rag_tokenizer.tokenize() produces coarse tokens of the filename without file extension.
- rag_tokenizer.fine_grained_tokenize() further refines these tokens.
Then the transcribed text is tokenized by calling tokenize(doc, ans, eng), where eng is a boolean indicating if the language is English.
Error Handling and Callbacks:
Errors during processing invoke the callback with a progress of -1 and the error message. The temporary file is always removed in a finally block to avoid resource leaks.

Interaction with Other Components

api.db.LLMType and api.db.services.llm_service.LLMBundle:
These provide the interface to the LLM service infrastructure. LLMBundle encapsulates the instantiation and usage of specific LLM models, here specifically the speech-to-text model.
rag.nlp.rag_tokenizer and rag.nlp.tokenize:
Utilities for tokenizing text, supporting both coarse and fine-grained tokenization strategies. These are critical for preparing the transcription output for downstream NLP tasks.
Callback Mechanism:
Enables progress reporting to higher-level orchestration layers or user interfaces during lengthy transcription operations.

Mermaid Class Diagram

classDiagram
    class LLMBundle {
        +__init__(tenant_id: str, llm_type: LLMType, lang: str)
        +transcription(audio_path: str) str
    }

    class rag_tokenizer {
        +tokenize(text: str) list
        +fine_grained_tokenize(tokens: list) list
    }

    class audio {
        +chunk(filename: str, binary: bytes, tenant_id: str, lang: str, callback: callable=None, **kwargs) list[dict]
    }

    audio ..> LLMBundle : uses
    audio ..> rag_tokenizer : uses

Summary

audio.py is a utility module designed to transform raw audio files into tokenized text representations by leveraging a Speech-to-Text LLM service. It abstracts the complexity of file handling, format validation, transcription, and tokenization into a single function, chunk, which can be integrated into larger data processing or NLP pipelines within the InfiniFlow system. Its design supports extensibility through callbacks and parameterization, facilitating multi-tenant and multilingual environments.