schema.py
Overview
The schema.py file defines a data model class TokenizerFromUpstream that represents the output of an upstream tokenization process in the InfiniFlow system. This class is primarily used to deserialize, validate, and structure tokenization results received from an external source or service into a consistent Python object. It supports multiple output formats such as JSON, Markdown, text, and HTML, and enforces payload validity based on the selected format.
The file leverages the pydantic library, which provides powerful data validation and serialization via Python type annotations. This file ensures that the tokenization data conforms to expected schemas before further processing downstream.
Classes
TokenizerFromUpstream
A Pydantic BaseModel subclass that models the response from an upstream tokenizer, encapsulating metadata, output format, and the tokenized content in various possible formats.
Attributes
Attribute | Type | Description |
|---|---|---|
|
| Timestamp indicating when the tokenization was created, optional. |
|
| Time elapsed during the tokenization process, optional. |
|
| Name identifier for the tokenization output. |
|
| Raw binary blob representing the tokenized data. |
|
| Format type of the tokenization output. Determines which payload field(s) will be populated. |
|
| Optional list of token chunks as dictionaries. If present, used directly. |
|
| Tokenization result in JSON format, if applicable. |
|
| Tokenization result as a Markdown string, if applicable. |
|
| Tokenization result as plain text, if applicable. |
|
| Tokenization result as a list of HTML strings, if applicable. |
Configuration
populate_by_name=True: Allows population of fields using their alias names.extra="forbid": Forbids extra fields not defined in the model to prevent unexpected data.
Methods
_check_payloads(self) -> TokenizerFromUpstream
Purpose: Data validation hook that runs after model initialization (
mode="after"). It enforces consistency between the declaredoutput_formatand the presence of corresponding payload data.Behavior:
If
chunksis present, no further validation is performed, and the model is returned as-is.If
output_formatis"markdown"or"text", ensures that the correspondingmarkdown_resultortext_resultis present; otherwise, raises aValueError.For other formats (including
"json"or None), ensures thatjson_resultis present ifchunksis absent; otherwise, raises aValueError.
Returns: The validated
TokenizerFromUpstreaminstance.Raises:
ValueErrorif the required payload fields are missing based on theoutput_format.
Usage Example
from schema import TokenizerFromUpstream
# Example JSON payload from upstream tokenizer
payload = {
"_created_time": 1685000000.0,
"_elapsed_time": 0.123,
"name": "example_tokenizer",
"blob": b"\x00\x01\x02",
"output_format": "markdown",
"markdown": "# Tokenized Output\n\nThis is a test.",
}
tokenizer_output = TokenizerFromUpstream(**payload)
print(tokenizer_output.markdown_result)
# Output: # Tokenized Output
# This is a test.
Implementation Details
Data Validation: The class uses Pydantic's
@model_validatordecorator withmode="after"to perform complex cross-field validation after the model's initial parsing. This ensures that the payload matches the expected format constraints before the object is used downstream.Alias Usage: Field aliases (e.g.,
_created_timeforcreated_time) allow the model to map input data fields that have different key names from the Python attribute names, improving flexibility with upstream data formats.Payload Handling Logic: The validation logic prioritizes
chunksif present. If absent, it requires appropriate payload fields depending on theoutput_format. This design supports multiple output formats flexibly and ensures downstream consumers can reliably use the data.
Interaction with Other System Components
This schema class is designed to interface with upstream tokenization services or modules that produce tokenized data in various formats.
Once instantiated and validated, objects of
TokenizerFromUpstreamare likely consumed by downstream components responsible for further processing, display rendering, or storage.The use of
blobsuggests raw data preservation for low-level operations or debugging.The strict validation helps maintain data integrity across module boundaries within the InfiniFlow system.
Diagram: Class Structure of TokenizerFromUpstream
classDiagram
class TokenizerFromUpstream {
+float? created_time
+float? elapsed_time
+str name
+bytes blob
+Literal["json", "markdown", "text", "html"]? output_format
+list~dict~chunks
+list~dict~? json_result
+str? markdown_result
+str? text_result
+list~str~? html_result
+_check_payloads() TokenizerFromUpstream
}
TokenizerFromUpstream ..|> BaseModel
Summary
The schema.py file provides a robust and flexible Pydantic data model for handling tokenization outputs with multiple formats from upstream sources. It emphasizes strict validation rules to ensure data consistency across different output formats, facilitating reliable integration within the InfiniFlow system. The design balances flexibility (supporting multiple payload formats and aliases) with safety (forbidding extra fields and enforcing required payload presence), making it a critical component for tokenization data handling workflows.