tokenizer.py


Overview

The tokenizer.py file is a core component of the InfiniFlow system responsible for processing textual data by tokenizing and optionally embedding it. It serves as a preprocessing step that converts raw text or structured chunks into tokenized formats suitable for downstream natural language processing (NLP) tasks, including search and retrieval.

This file primarily defines two classes:

The file integrates with various services and utilities to retrieve embedding models, manage tenant or knowledge base contexts, and enforce API timeouts and concurrency limits. It supports multiple search methods including full-text tokenization and vector embeddings, enabling flexible search capabilities within the InfiniFlow ecosystem.


Classes and Methods

TokenizerParam(ProcessParamBase)

Configuration class for the Tokenizer component.

Description

Properties

Property

Type

Default Value

Description

search_method

list[str]

["full_text", "embedding"]

Specifies which tokenization/search methods to apply. Valid values are "full_text" and "embedding".

filename_embd_weight

float

0.1

Weight factor used to blend filename embeddings with content embeddings.

Methods

Usage Example

param = TokenizerParam()
param.search_method = ["embedding"]
param.filename_embd_weight = 0.2
param.check()  # Raises error if invalid values are set

Tokenizer(ProcessBase)

Main processing class responsible for tokenizing text chunks and generating embeddings.

Class Attributes

Attribute

Type

Description

component_name

str

Identifies the component as "Tokenizer".

Methods


async _embedding(self, name: str, chunks: list[dict]) -> tuple[list[dict], int]

Generates embeddings for chunks of text, optionally blending with filename embeddings.

Parameters

Parameter

Type

Description

name

str

The name associated with the text (e.g., filename or document title). Used to generate a title embedding vector.

chunks

list[dict]

List of text chunks, each chunk being a dictionary containing textual data and optionally questions.

Returns
Description
  1. Determines the embedding model to use based on the knowledge base or tenant context.

  2. Prepares text inputs by cleaning HTML tags and concatenating questions if available.

  3. Computes a title embedding vector for the name.

  4. Encodes chunk texts in batches asynchronously with concurrency limits and timeouts.

  5. Combines title embeddings and chunk embeddings weighted by filename_embd_weight.

  6. Adds computed embedding vectors into each chunk's dictionary.

  7. Returns the modified chunks and total tokens used.

Important Details
Usage Example
chunks = [{"text": "Example text chunk", "questions": ["What is this?"]}]
chunks_with_embd, token_count = await tokenizer._embedding("document_name", chunks)

async _invoke(self, **kwargs) -> None

Primary entry point to process incoming data: performs tokenization and embedding as configured.

Parameters

Parameter

Type

Description

kwargs

dict

Arbitrary keyword arguments containing input data from upstream components.

Returns
Description
  1. Validates and parses input using TokenizerFromUpstream schema.

  2. Determines which search methods (full_text, embedding) to apply based on parameters.

  3. For full-text tokenization:

    • Processes chunks if present, tokenizing questions, keywords, and main text.

    • Otherwise, extracts text payloads from upstream results (markdown, text, html).

    • Uses rag_tokenizer for tokenization and fine-grained tokenization.

  4. Provides progress callbacks during tokenization.

  5. For embedding:

    • Warns if the upstream name is empty (which may reduce embedding accuracy).

    • Calls _embedding to generate embeddings for processed chunks.

    • Records token consumption.

  6. Sets outputs for downstream use, including enriched chunks and embedding token usage.

  7. Handles and reports input validation errors.

Important Details
Usage Example
await tokenizer._invoke(
    chunks=[{"text": "Sample text", "questions": ["What is this?"]}],
    name="doc1",
    output_format="json"
)

Implementation Details and Algorithms


Interaction with Other System Components


Visual Diagram

The following class diagram summarizes the key classes and their primary methods in this file.

classDiagram
    class TokenizerParam {
        +search_method: list[str]
        +filename_embd_weight: float
        +check()
        +get_input_form() dict
    }

    class Tokenizer {
        +component_name: str
        +_embedding(name: str, chunks: list[dict]) async tuple[list[dict], int]
        +_invoke(**kwargs) async None
    }

    TokenizerParam <|-- Tokenizer : uses parameters
    Tokenizer --|> ProcessBase
    TokenizerParam --|> ProcessParamBase

Summary

The tokenizer.py file implements a robust and flexible tokenizer and embedder component within the InfiniFlow system. By supporting both full-text tokenization and vector embeddings, it enables rich semantic search capabilities. The design carefully balances efficiency (batching, concurrency control), extensibility (configurable parameters, multiple input formats), and reliability (error handling, timeouts). This component acts as a critical preprocessing stage, transforming raw or partially processed text into token and embedding representations for downstream NLP workflows.