tokenizer.py
Overview
The tokenizer.py file is a core component of the InfiniFlow system responsible for processing textual data by tokenizing and optionally embedding it. It serves as a preprocessing step that converts raw text or structured chunks into tokenized formats suitable for downstream natural language processing (NLP) tasks, including search and retrieval.
This file primarily defines two classes:
TokenizerParam: Configuration parameters governing the tokenization and embedding behavior.
Tokenizer: Implements the main logic for tokenizing textual inputs and generating embeddings with support for asynchronous execution and batch processing.
The file integrates with various services and utilities to retrieve embedding models, manage tenant or knowledge base contexts, and enforce API timeouts and concurrency limits. It supports multiple search methods including full-text tokenization and vector embeddings, enabling flexible search capabilities within the InfiniFlow ecosystem.
Classes and Methods
TokenizerParam(ProcessParamBase)
Configuration class for the Tokenizer component.
Description
Extends
ProcessParamBase.Defines parameters controlling the tokenization and embedding strategies.
Provides validation for parameter values.
Supplies an input form schema for UI or API consumption.
Properties
Property | Type | Default Value | Description |
|---|---|---|---|
|
|
| Specifies which tokenization/search methods to apply. Valid values are |
|
|
| Weight factor used to blend filename embeddings with content embeddings. |
Methods
check() -> NoneValidates the values in
search_method. Raises an error if any value is invalid.get_input_form() -> dict[str, dict]Returns an empty dictionary (placeholder for future input form schema).
Usage Example
param = TokenizerParam()
param.search_method = ["embedding"]
param.filename_embd_weight = 0.2
param.check() # Raises error if invalid values are set
Tokenizer(ProcessBase)
Main processing class responsible for tokenizing text chunks and generating embeddings.
Class Attributes
Attribute | Type | Description |
|---|---|---|
|
| Identifies the component as |
Methods
async _embedding(self, name: str, chunks: list[dict]) -> tuple[list[dict], int]
Generates embeddings for chunks of text, optionally blending with filename embeddings.
Parameters
Parameter | Type | Description |
|---|---|---|
|
| The name associated with the text (e.g., filename or document title). Used to generate a title embedding vector. |
|
| List of text chunks, each chunk being a dictionary containing textual data and optionally questions. |
Returns
tuple:list[dict]: The inputchunksenriched with embedding vectors under keys like"q_<dim>_vec".int: Total token count consumed during embedding.
Description
Determines the embedding model to use based on the knowledge base or tenant context.
Prepares text inputs by cleaning HTML tags and concatenating questions if available.
Computes a title embedding vector for the
name.Encodes chunk texts in batches asynchronously with concurrency limits and timeouts.
Combines title embeddings and chunk embeddings weighted by
filename_embd_weight.Adds computed embedding vectors into each chunk's dictionary.
Returns the modified chunks and total tokens used.
Important Details
Uses
embed_limiterto throttle concurrent embedding requests.Employs
trio.to_thread.run_syncto run synchronous embedding calls asynchronously.Uses regex substitution to remove certain HTML tags from text before embedding.
Embeddings are truncated to fit model max length minus a margin.
Embedding vectors have dynamic dimension keys based on vector length (e.g.,
q_768_vec).
Usage Example
chunks = [{"text": "Example text chunk", "questions": ["What is this?"]}]
chunks_with_embd, token_count = await tokenizer._embedding("document_name", chunks)
async _invoke(self, **kwargs) -> None
Primary entry point to process incoming data: performs tokenization and embedding as configured.
Parameters
Parameter | Type | Description |
|---|---|---|
| dict | Arbitrary keyword arguments containing input data from upstream components. |
Returns
None
Description
Validates and parses input using
TokenizerFromUpstreamschema.Determines which search methods (
full_text,embedding) to apply based on parameters.For full-text tokenization:
Processes
chunksif present, tokenizing questions, keywords, and main text.Otherwise, extracts text payloads from upstream results (
markdown,text,html).Uses
rag_tokenizerfor tokenization and fine-grained tokenization.
Provides progress callbacks during tokenization.
For embedding:
Warns if the upstream
nameis empty (which may reduce embedding accuracy).Calls
_embeddingto generate embeddings for processed chunks.Records token consumption.
Sets outputs for downstream use, including enriched chunks and embedding token usage.
Handles and reports input validation errors.
Important Details
Uses random callbacks to provide progress updates with messages.
Supports multiple input formats (
chunks,markdown_result,text_result,html_result,json_result).Integrates tightly with
rag_tokenizerfor multi-level tokenization.Embedding and tokenization steps can be selectively enabled via
search_method.
Usage Example
await tokenizer._invoke(
chunks=[{"text": "Sample text", "questions": ["What is this?"]}],
name="doc1",
output_format="json"
)
Implementation Details and Algorithms
Tokenization: Utilizes
rag_tokenizerto produce token lists from text. Supports:Normal tokenization (
tokenize).Fine-grained tokenization (
fine_grained_tokenize).Tokenization applied to different text fields: questions, keywords, main text.
Embedding:
Embeddings are created using models obtained from
LLMBundlewhich abstracts the underlying embedding model per tenant or knowledge base.Batching of embedding requests is done to improve efficiency, with batch size controlled by
EMBEDDING_BATCH_SIZE.Uses a timeout decorator (
timeout(60)) to ensure the embedding operation does not hang indefinitely.Embeddings for the
name(e.g., filename) and content are combined with a weighted sum, where the filename embedding weight is configurable.
Concurrency and Rate Limiting:
Embedding inference is performed asynchronously with Trio concurrency primitives.
embed_limiterenforces rate limiting on embedding calls to avoid overloading backend services.
Error Handling:
Input validation errors are caught and reported to outputs.
Empty names for embeddings trigger warnings to aid debugging.
Interaction with Other System Components
Upstream Inputs:
Receives data structured by
TokenizerFromUpstreamschema, which likely comes from prior processing stages.Accepts diverse formats including raw text, markdown, HTML, JSON chunks.
Database Services:
Interacts with
KnowledgebaseServiceandTenantServiceto fetch embedding model identifiers based on context.Uses
LLMBundleto instantiate the embedding model for the tenant or knowledge base.
Tokenization Utility:
Utilizes
rag_tokenizerfor tokenization operations.
Settings and Constants:
Reads
EMBEDDING_BATCH_SIZEfrom global settings to control batch size.
Concurrency and Task Execution:
Uses
triofor asynchronous execution.Applies
embed_limiterto throttle embedding calls.Uses
timeoutdecorator to guard embedding calls.
Downstream Outputs:
Produces tokenized chunks enriched with embeddings.
Outputs token consumption metrics for monitoring.
Provides progress callbacks for UI or logging.
Visual Diagram
The following class diagram summarizes the key classes and their primary methods in this file.
classDiagram
class TokenizerParam {
+search_method: list[str]
+filename_embd_weight: float
+check()
+get_input_form() dict
}
class Tokenizer {
+component_name: str
+_embedding(name: str, chunks: list[dict]) async tuple[list[dict], int]
+_invoke(**kwargs) async None
}
TokenizerParam <|-- Tokenizer : uses parameters
Tokenizer --|> ProcessBase
TokenizerParam --|> ProcessParamBase
Summary
The tokenizer.py file implements a robust and flexible tokenizer and embedder component within the InfiniFlow system. By supporting both full-text tokenization and vector embeddings, it enables rich semantic search capabilities. The design carefully balances efficiency (batching, concurrency control), extensibility (configurable parameters, multiple input formats), and reliability (error handling, timeouts). This component acts as a critical preprocessing stage, transforming raw or partially processed text into token and embedding representations for downstream NLP workflows.