tokenizer.py

Overview

The tokenizer.py file is a core component of the InfiniFlow system responsible for processing textual data by tokenizing and optionally embedding it. It serves as a preprocessing step that converts raw text or structured chunks into tokenized formats suitable for downstream natural language processing (NLP) tasks, including search and retrieval.

This file primarily defines two classes:

TokenizerParam: Configuration parameters governing the tokenization and embedding behavior.
Tokenizer: Implements the main logic for tokenizing textual inputs and generating embeddings with support for asynchronous execution and batch processing.

The file integrates with various services and utilities to retrieve embedding models, manage tenant or knowledge base contexts, and enforce API timeouts and concurrency limits. It supports multiple search methods including full-text tokenization and vector embeddings, enabling flexible search capabilities within the InfiniFlow ecosystem.

Classes and Methods

`TokenizerParam(ProcessParamBase)`

Configuration class for the Tokenizer component.

Description

Extends ProcessParamBase.
Defines parameters controlling the tokenization and embedding strategies.
Provides validation for parameter values.
Supplies an input form schema for UI or API consumption.

Properties

Property	Type	Default Value	Description
`search_method`	`list[str]`	`["full_text", "embedding"]`	Specifies which tokenization/search methods to apply. Valid values are `"full_text"` and `"embedding"`.
`filename_embd_weight`	`float`	`0.1`	Weight factor used to blend filename embeddings with content embeddings.

Methods

check() -> None
Validates the values in search_method. Raises an error if any value is invalid.
get_input_form() -> dict[str, dict]
Returns an empty dictionary (placeholder for future input form schema).

Usage Example

param = TokenizerParam()
param.search_method = ["embedding"]
param.filename_embd_weight = 0.2
param.check()  # Raises error if invalid values are set

`Tokenizer(ProcessBase)`

Main processing class responsible for tokenizing text chunks and generating embeddings.

Class Attributes

Attribute	Type	Description
`component_name`	`str`	Identifies the component as `"Tokenizer"`.

Methods

`async _embedding(self, name: str, chunks: list[dict]) -> tuple[list[dict], int]`

Generates embeddings for chunks of text, optionally blending with filename embeddings.

Parameters

Parameter	Type	Description
`name`	`str`	The name associated with the text (e.g., filename or document title). Used to generate a title embedding vector.
`chunks`	`list[dict]`	List of text chunks, each chunk being a dictionary containing textual data and optionally questions.

Returns

tuple:
- list[dict]: The input chunks enriched with embedding vectors under keys like "q_<dim>_vec".
- int: Total token count consumed during embedding.

Description

Determines the embedding model to use based on the knowledge base or tenant context.
Prepares text inputs by cleaning HTML tags and concatenating questions if available.
Computes a title embedding vector for the name.
Encodes chunk texts in batches asynchronously with concurrency limits and timeouts.
Combines title embeddings and chunk embeddings weighted by filename_embd_weight.
Adds computed embedding vectors into each chunk's dictionary.
Returns the modified chunks and total tokens used.

Important Details

Uses embed_limiter to throttle concurrent embedding requests.
Employs trio.to_thread.run_sync to run synchronous embedding calls asynchronously.
Uses regex substitution to remove certain HTML tags from text before embedding.
Embeddings are truncated to fit model max length minus a margin.
Embedding vectors have dynamic dimension keys based on vector length (e.g., q_768_vec).

Usage Example

chunks = [{"text": "Example text chunk", "questions": ["What is this?"]}]
chunks_with_embd, token_count = await tokenizer._embedding("document_name", chunks)

`async _invoke(self, **kwargs) -> None`

Primary entry point to process incoming data: performs tokenization and embedding as configured.

Parameters

Parameter	Type	Description
`kwargs`	dict	Arbitrary keyword arguments containing input data from upstream components.

Returns

None

Description

Validates and parses input using TokenizerFromUpstream schema.
Determines which search methods (full_text, embedding) to apply based on parameters.
For full-text tokenization:
- Processes chunks if present, tokenizing questions, keywords, and main text.
- Otherwise, extracts text payloads from upstream results (markdown, text, html).
- Uses rag_tokenizer for tokenization and fine-grained tokenization.
Provides progress callbacks during tokenization.
For embedding:
- Warns if the upstream name is empty (which may reduce embedding accuracy).
- Calls _embedding to generate embeddings for processed chunks.
- Records token consumption.
Sets outputs for downstream use, including enriched chunks and embedding token usage.
Handles and reports input validation errors.

Important Details

Uses random callbacks to provide progress updates with messages.
Supports multiple input formats (chunks, markdown_result, text_result, html_result, json_result).
Integrates tightly with rag_tokenizer for multi-level tokenization.
Embedding and tokenization steps can be selectively enabled via search_method.

Usage Example

await tokenizer._invoke(
    chunks=[{"text": "Sample text", "questions": ["What is this?"]}],
    name="doc1",
    output_format="json"
)

Implementation Details and Algorithms

Tokenization: Utilizes rag_tokenizer to produce token lists from text. Supports:
- Normal tokenization (tokenize).
- Fine-grained tokenization (fine_grained_tokenize).
- Tokenization applied to different text fields: questions, keywords, main text.
Embedding:
- Embeddings are created using models obtained from LLMBundle which abstracts the underlying embedding model per tenant or knowledge base.
- Batching of embedding requests is done to improve efficiency, with batch size controlled by EMBEDDING_BATCH_SIZE.
- Uses a timeout decorator (timeout(60)) to ensure the embedding operation does not hang indefinitely.
- Embeddings for the name (e.g., filename) and content are combined with a weighted sum, where the filename embedding weight is configurable.
Concurrency and Rate Limiting:
- Embedding inference is performed asynchronously with Trio concurrency primitives.
- embed_limiter enforces rate limiting on embedding calls to avoid overloading backend services.
Error Handling:
- Input validation errors are caught and reported to outputs.
- Empty names for embeddings trigger warnings to aid debugging.

Interaction with Other System Components

Upstream Inputs:
- Receives data structured by TokenizerFromUpstream schema, which likely comes from prior processing stages.
- Accepts diverse formats including raw text, markdown, HTML, JSON chunks.
Database Services:
- Interacts with KnowledgebaseService and TenantService to fetch embedding model identifiers based on context.
- Uses LLMBundle to instantiate the embedding model for the tenant or knowledge base.
Tokenization Utility:
- Utilizes rag_tokenizer for tokenization operations.
Settings and Constants:
- Reads EMBEDDING_BATCH_SIZE from global settings to control batch size.
Concurrency and Task Execution:
- Uses trio for asynchronous execution.
- Applies embed_limiter to throttle embedding calls.
- Uses timeout decorator to guard embedding calls.
Downstream Outputs:
- Produces tokenized chunks enriched with embeddings.
- Outputs token consumption metrics for monitoring.
- Provides progress callbacks for UI or logging.

Visual Diagram

The following class diagram summarizes the key classes and their primary methods in this file.

classDiagram
    class TokenizerParam {
        +search_method: list[str]
        +filename_embd_weight: float
        +check()
        +get_input_form() dict
    }

    class Tokenizer {
        +component_name: str
        +_embedding(name: str, chunks: list[dict]) async tuple[list[dict], int]
        +_invoke(**kwargs) async None
    }

    TokenizerParam <|-- Tokenizer : uses parameters
    Tokenizer --|> ProcessBase
    TokenizerParam --|> ProcessParamBase

Summary

The tokenizer.py file implements a robust and flexible tokenizer and embedder component within the InfiniFlow system. By supporting both full-text tokenization and vector embeddings, it enables rich semantic search capabilities. The design carefully balances efficiency (batching, concurrency control), extensibility (configurable parameters, multiple input formats), and reliability (error handling, timeouts). This component acts as a critical preprocessing stage, transforming raw or partially processed text into token and embedding representations for downstream NLP workflows.