schema.py

Overview

The schema.py file defines a data model class TokenizerFromUpstream that represents the output of an upstream tokenization process in the InfiniFlow system. This class is primarily used to deserialize, validate, and structure tokenization results received from an external source or service into a consistent Python object. It supports multiple output formats such as JSON, Markdown, text, and HTML, and enforces payload validity based on the selected format.

The file leverages the pydantic library, which provides powerful data validation and serialization via Python type annotations. This file ensures that the tokenization data conforms to expected schemas before further processing downstream.

Classes

TokenizerFromUpstream

A Pydantic BaseModel subclass that models the response from an upstream tokenizer, encapsulating metadata, output format, and the tokenized content in various possible formats.

Attributes

Attribute	Type	Description
`created_time`	`float` or None (alias: `_created_time`)	Timestamp indicating when the tokenization was created, optional.
`elapsed_time`	`float` or None (alias: `_elapsed_time`)	Time elapsed during the tokenization process, optional.
`name`	`str` (default: empty string)	Name identifier for the tokenization output.
`blob`	`bytes`	Raw binary blob representing the tokenized data.
`output_format`	`Literal["json", "markdown", "text", "html"]` or None	Format type of the tokenization output. Determines which payload field(s) will be populated.
`chunks`	`list[dict[str, Any]]` or None	Optional list of token chunks as dictionaries. If present, used directly.
`json_result`	`list[dict[str, Any]]` or None (alias: `json`)	Tokenization result in JSON format, if applicable.
`markdown_result`	`str` or None (alias: `markdown`)	Tokenization result as a Markdown string, if applicable.
`text_result`	`str` or None (alias: `text`)	Tokenization result as plain text, if applicable.
`html_result`	`list[str]` or None (alias: `html`)	Tokenization result as a list of HTML strings, if applicable.

Configuration

populate_by_name=True: Allows population of fields using their alias names.
extra="forbid": Forbids extra fields not defined in the model to prevent unexpected data.

Methods

`_check_payloads(self) -> TokenizerFromUpstream`

Purpose: Data validation hook that runs after model initialization (mode="after"). It enforces consistency between the declared output_format and the presence of corresponding payload data.
Behavior:
- If chunks is present, no further validation is performed, and the model is returned as-is.
- If output_format is "markdown" or "text", ensures that the corresponding markdown_result or text_result is present; otherwise, raises a ValueError.
- For other formats (including "json" or None), ensures that json_result is present if chunks is absent; otherwise, raises a ValueError.
Returns: The validated TokenizerFromUpstream instance.
Raises: ValueError if the required payload fields are missing based on the output_format.

Usage Example

from schema import TokenizerFromUpstream

# Example JSON payload from upstream tokenizer
payload = {
    "_created_time": 1685000000.0,
    "_elapsed_time": 0.123,
    "name": "example_tokenizer",
    "blob": b"\x00\x01\x02",
    "output_format": "markdown",
    "markdown": "# Tokenized Output\n\nThis is a test.",
}

tokenizer_output = TokenizerFromUpstream(**payload)

print(tokenizer_output.markdown_result)
# Output: # Tokenized Output
#         This is a test.

Implementation Details

Data Validation: The class uses Pydantic's @model_validator decorator with mode="after" to perform complex cross-field validation after the model's initial parsing. This ensures that the payload matches the expected format constraints before the object is used downstream.
Alias Usage: Field aliases (e.g., _created_time for created_time) allow the model to map input data fields that have different key names from the Python attribute names, improving flexibility with upstream data formats.
Payload Handling Logic: The validation logic prioritizes chunks if present. If absent, it requires appropriate payload fields depending on the output_format. This design supports multiple output formats flexibly and ensures downstream consumers can reliably use the data.

Interaction with Other System Components

This schema class is designed to interface with upstream tokenization services or modules that produce tokenized data in various formats.
Once instantiated and validated, objects of TokenizerFromUpstream are likely consumed by downstream components responsible for further processing, display rendering, or storage.
The use of blob suggests raw data preservation for low-level operations or debugging.
The strict validation helps maintain data integrity across module boundaries within the InfiniFlow system.

Diagram: Class Structure of `TokenizerFromUpstream`

classDiagram
    class TokenizerFromUpstream {
        +float? created_time
        +float? elapsed_time
        +str name
        +bytes blob
        +Literal["json", "markdown", "text", "html"]? output_format
        +list~dict~chunks
        +list~dict~? json_result
        +str? markdown_result
        +str? text_result
        +list~str~? html_result
        +_check_payloads() TokenizerFromUpstream
    }
    TokenizerFromUpstream ..|> BaseModel

Summary

The schema.py file provides a robust and flexible Pydantic data model for handling tokenization outputs with multiple formats from upstream sources. It emphasizes strict validation rules to ensure data consistency across different output formats, facilitating reliable integration within the InfiniFlow system. The design balances flexibility (supporting multiple payload formats and aliases) with safety (forbidding extra fields and enforcing required payload presence), making it a critical component for tokenization data handling workflows.