schema.py


Overview

The schema.py file defines a data model class TokenizerFromUpstream that represents the output of an upstream tokenization process in the InfiniFlow system. This class is primarily used to deserialize, validate, and structure tokenization results received from an external source or service into a consistent Python object. It supports multiple output formats such as JSON, Markdown, text, and HTML, and enforces payload validity based on the selected format.

The file leverages the pydantic library, which provides powerful data validation and serialization via Python type annotations. This file ensures that the tokenization data conforms to expected schemas before further processing downstream.


Classes

TokenizerFromUpstream

A Pydantic BaseModel subclass that models the response from an upstream tokenizer, encapsulating metadata, output format, and the tokenized content in various possible formats.

Attributes

Attribute

Type

Description

created_time

float or None (alias: _created_time)

Timestamp indicating when the tokenization was created, optional.

elapsed_time

float or None (alias: _elapsed_time)

Time elapsed during the tokenization process, optional.

name

str (default: empty string)

Name identifier for the tokenization output.

blob

bytes

Raw binary blob representing the tokenized data.

output_format

Literal["json", "markdown", "text", "html"] or None

Format type of the tokenization output. Determines which payload field(s) will be populated.

chunks

list[dict[str, Any]] or None

Optional list of token chunks as dictionaries. If present, used directly.

json_result

list[dict[str, Any]] or None (alias: json)

Tokenization result in JSON format, if applicable.

markdown_result

str or None (alias: markdown)

Tokenization result as a Markdown string, if applicable.

text_result

str or None (alias: text)

Tokenization result as plain text, if applicable.

html_result

list[str] or None (alias: html)

Tokenization result as a list of HTML strings, if applicable.

Configuration

Methods

_check_payloads(self) -> TokenizerFromUpstream

Usage Example

from schema import TokenizerFromUpstream

# Example JSON payload from upstream tokenizer
payload = {
    "_created_time": 1685000000.0,
    "_elapsed_time": 0.123,
    "name": "example_tokenizer",
    "blob": b"\x00\x01\x02",
    "output_format": "markdown",
    "markdown": "# Tokenized Output\n\nThis is a test.",
}

tokenizer_output = TokenizerFromUpstream(**payload)

print(tokenizer_output.markdown_result)
# Output: # Tokenized Output
#         This is a test.

Implementation Details


Interaction with Other System Components


Diagram: Class Structure of TokenizerFromUpstream

classDiagram
    class TokenizerFromUpstream {
        +float? created_time
        +float? elapsed_time
        +str name
        +bytes blob
        +Literal["json", "markdown", "text", "html"]? output_format
        +list~dict~chunks
        +list~dict~? json_result
        +str? markdown_result
        +str? text_result
        +list~str~? html_result
        +_check_payloads() TokenizerFromUpstream
    }
    TokenizerFromUpstream ..|> BaseModel

Summary

The schema.py file provides a robust and flexible Pydantic data model for handling tokenization outputs with multiple formats from upstream sources. It emphasizes strict validation rules to ensure data consistency across different output formats, facilitating reliable integration within the InfiniFlow system. The design balances flexibility (supporting multiple payload formats and aliases) with safety (forbidding extra fields and enforcing required payload presence), making it a critical component for tokenization data handling workflows.