#audio.py
Overview
The audio.py file provides functionality to process and transcribe audio files into text tokens using a Speech-to-Text large language model (LLM). It primarily defines the chunk function, which takes an audio file (as binary data), verifies its format, temporarily saves it, and then uses an LLM-based transcription service to convert the audio into text. The resulting transcription is tokenized to facilitate downstream natural language processing (NLP) tasks such as semantic search or indexing.
This file acts as a bridge between raw audio input and text-based NLP components, integrating tightly with the LLM service layer for transcription and the RAG (Retrieval-Augmented Generation) NLP tokenization utilities.
Detailed Description
Function: chunk
chunk(filename, binary, tenant_id, lang, callback=None, **kwargs) -> list[dict]
Purpose
Processes an audio file by:
Validating its file extension.
Temporarily saving the audio binary content to disk.
Invoking a Speech-to-Text LLM service to transcribe the audio.
Tokenizing the transcription text for further analysis.
Returning a list containing a dictionary with metadata and tokenized transcription.
Parameters
filename(str): The name of the audio file, including extension (e.g.,"interview.wav"). Used to infer file type and to tokenize the file name.binary(bytes): Raw binary content of the audio file.tenant_id(str): Identifier for tenant or client, used to initialize the LLM service for multi-tenant environments.lang(str): Language code or name (e.g.,"English") indicating the language spoken in the audio. Used for language-specific processing.callback(callable, optional): A function to receive progress updates and status messages during processing. Signature:callback(progress: float, msg: str).**kwargs: Additional keyword arguments (currently unused, reserved for future extensions).
Returns
listofdict: A list containing a single dictionary with the following keys:"docnm_kwd": The original filename."title_tks": Tokenized form of the filename without extension (coarse tokens)."title_sm_tks": Fine-grained tokens derived from the filename tokens.Plus additional keys added by the
tokenizefunction to represent the transcribed content tokens.
If an error occurs (unsupported extension, transcription failure, etc.), returns an empty list and reports the error via callback.
Usage Example
def progress_callback(progress, msg):
print(f"Progress: {progress*100:.1f}%, Message: {msg}")
with open("sample_audio.wav", "rb") as f:
audio_bytes = f.read()
results = chunk(
filename="sample_audio.wav",
binary=audio_bytes,
tenant_id="tenant_123",
lang="English",
callback=progress_callback
)
if results:
doc = results[0]
print("Filename tokens:", doc["title_tks"])
print("Fine tokens:", doc["title_sm_tks"])
# Access transcription tokens added by tokenize()
else:
print("Failed to process audio.")
Implementation Details
File Extension Validation:
The function extracts the file extension and validates it against a hardcoded list of supported audio formats (e.g.,.wav,.mp3,.aac,.flac, etc.). This prevents unsupported files from being processed.Temporary File Handling:
The audio binary content is saved to a uniquely named temporary file with the appropriate extension using Python'stempfile.NamedTemporaryFile(withdelete=Falseto keep the file during processing). This file is passed to the LLM transcription service and then cleaned up after processing.Transcription via LLM:
TheLLMBundleclass fromapi.db.services.llm_serviceis instantiated with the tenant ID, specifying theSPEECH2TEXTLLM type and the target language. The.transcription()method is called with the path to the temporary audio file, returning the transcribed text.Tokenization:
The filename is tokenized twice:rag_tokenizer.tokenize()produces coarse tokens of the filename without file extension.rag_tokenizer.fine_grained_tokenize()further refines these tokens.
Then the transcribed text is tokenized by calling
tokenize(doc, ans, eng), whereengis a boolean indicating if the language is English.Error Handling and Callbacks:
Errors during processing invoke the callback with a progress of-1and the error message. The temporary file is always removed in afinallyblock to avoid resource leaks.
Interaction with Other Components
api.db.LLMTypeandapi.db.services.llm_service.LLMBundle:
These provide the interface to the LLM service infrastructure.LLMBundleencapsulates the instantiation and usage of specific LLM models, here specifically the speech-to-text model.rag.nlp.rag_tokenizerandrag.nlp.tokenize:
Utilities for tokenizing text, supporting both coarse and fine-grained tokenization strategies. These are critical for preparing the transcription output for downstream NLP tasks.Callback Mechanism:
Enables progress reporting to higher-level orchestration layers or user interfaces during lengthy transcription operations.
Mermaid Class Diagram
classDiagram
class LLMBundle {
+__init__(tenant_id: str, llm_type: LLMType, lang: str)
+transcription(audio_path: str) str
}
class rag_tokenizer {
+tokenize(text: str) list
+fine_grained_tokenize(tokens: list) list
}
class audio {
+chunk(filename: str, binary: bytes, tenant_id: str, lang: str, callback: callable=None, **kwargs) list[dict]
}
audio ..> LLMBundle : uses
audio ..> rag_tokenizer : uses
Summary
audio.py is a utility module designed to transform raw audio files into tokenized text representations by leveraging a Speech-to-Text LLM service. It abstracts the complexity of file handling, format validation, transcription, and tokenization into a single function, chunk, which can be integrated into larger data processing or NLP pipelines within the InfiniFlow system. Its design supports extensibility through callbacks and parameterization, facilitating multi-tenant and multilingual environments.