validation_utils.py
Overview
The validation_utils.py file provides a comprehensive set of utilities for validating and parsing incoming HTTP request data in a Flask-based API environment. It leverages Pydantic models for strong schema validation, includes helper functions to normalize and validate specific data types (like UUIDs and Base64-encoded images), and defines data models representing dataset-related requests.
The file focuses on:
Validating JSON request bodies and query parameters.
Parsing and cleaning validated data.
Custom validation logic for fields such as UUID version 1, Base64 avatars, and embedding model identifiers.
Defining Pydantic models for dataset creation, update, deletion, and listing requests.
Formatting Pydantic validation errors into user-friendly messages.
This module is critical in ensuring data integrity and consistent request handling in the InfiniFlow API.
Detailed Documentation
Functions
validate_and_parse_json_request
validate_and_parse_json_request(
request: Request,
validator: type[BaseModel],
*,
extras: dict[str, Any] | None = None,
exclude_unset: bool = False
) -> tuple[dict[str, Any] | None, str | None]
Purpose:
Validates and parses the JSON payload of a Flask Request object using a multi-stage validation pipeline.
Validation stages:
Checks that the content type is
application/json.Parses JSON syntax.
Ensures payload is a dictionary.
Validates the payload against a Pydantic model (
validator).
Parameters:
request: FlaskRequestobject.validator: A Pydantic model class used to validate the payload.extras: Optional dictionary of extra fields to merge into the payload before validation; these are removed from final output.exclude_unset: If True, excludes fields not explicitly set in the input from the returned dictionary.
Returns:
Tuple containing:
Validated data dictionary (with extras removed) or
Noneif validation failed.Noneif successful or a string error message if validation failed.
Raises:
UnsupportedMediaTypeif content type is not JSON.BadRequestfor malformed JSON.ValidationErrorif the payload violates the schema.
Example:
validated_data, error = validate_and_parse_json_request(request, DatasetSchema)
if error:
return jsonify({"error": error}), 400
process(validated_data)
validate_and_parse_request_args
validate_and_parse_request_args(
request: Request,
validator: type[BaseModel],
*,
extras: dict[str, Any] | None = None
) -> tuple[dict[str, Any] | None, str | None]
Purpose:
Validates and parses URL query parameters (request.args) against a Pydantic model.
Parameters:
request: FlaskRequestobject.validator: Pydantic model class for validation.extras: Optional extra fields merged into args before validation and removed later.
Returns:
Tuple of:
Validated and parsed arguments dictionary or
Noneon failure.Noneif success or string error message if failure.
Raises:
TypeErrorifvalidatoris not a subclass ofBaseModel.
Example:
args, error = validate_and_parse_request_args(request, QueryParamsModel)
if error:
return jsonify({"error": error}), 400
handle_args(args)
format_validation_error_message
format_validation_error_message(e: ValidationError) -> str
Purpose:
Converts a Pydantic ValidationError into a human-readable error message string.
Parameters:
e: PydanticValidationError.
Returns:
A multi-line string where each line contains:
The field path (dot-separated).
The error message.
The offending input value (truncated to 128 chars).
Example:
try:
UserModel(name=123)
except ValidationError as e:
print(format_validation_error_message(e))
# Output:
# Field: <name> - Message: <Input should be a valid string> - Value: <123>
normalize_str
normalize_str(v: Any) -> Any
Purpose:
Normalizes string input by stripping whitespace and converting to lowercase. Non-string inputs are returned unchanged.
Parameters:
v: Any value.
Returns:
Normalized string if input was a string.
Original value if not.
Example:
normalize_str(" Admin ") # returns "admin"
normalize_str(42) # returns 42
validate_uuid1_hex
validate_uuid1_hex(v: Any) -> str
Purpose:
Validates that input is a UUID version 1 and returns its 32-character hexadecimal string (without hyphens).
Parameters:
v: UUID object or string.
Returns:
UUID version 1 hex string in lowercase.
Raises:
PydanticCustomErrorif the input is not a valid UUID v1 string/object.
Example:
validate_uuid1_hex("550e8400-e29b-41d4-a716-446655440000")
# returns "550e8400e29b41d4a716446655440000"
Classes
Base
class Base(BaseModel):
model_config = ConfigDict(extra="forbid", strict=True)
Description:
Base Pydantic model with strict validation and forbids extra fields beyond those defined.
RaptorConfig
class RaptorConfig(Base):
use_raptor: bool = False
prompt: str = "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n {cluster_content}\nThe above is the content you need to summarize."
max_token: int = 256
threshold: float = 0.1
max_cluster: int = 64
random_seed: int = 0
Description:
Configuration model for the Raptor summarization feature with fields controlling usage, prompt template, token limits, thresholds, clustering, and random seed.
GraphragConfig
class GraphragConfig(Base):
use_graphrag: bool = False
entity_types: list[str] = ["organization", "person", "geo", "event", "category"]
method: Literal["light", "general"] = "light"
community: bool = False
resolution: bool = False
Description:
Configuration model for Graphrag entity extraction and community detection with options for entity types and method selection.
ParserConfig
class ParserConfig(Base):
auto_keywords: int = 0
auto_questions: int = 0
chunk_token_num: int = 512
delimiter: str = r"\n"
graphrag: GraphragConfig = GraphragConfig()
html4excel: bool = False
layout_recognize: str = "DeepDOC"
raptor: RaptorConfig = RaptorConfig()
tag_kb_ids: list[str] = []
topn_tags: int = 1
filename_embd_weight: float | None = 0.1
task_page_size: int | None = None
pages: list[list[int]] | None = None
Description:
Parser configuration including keyword and question auto-generation, chunk sizes, Graphrag and Raptor sub-configurations, layout recognition, tagging, and pagination.
CreateDatasetReq
class CreateDatasetReq(Base):
name: str # constrained string, max length by DATASET_NAME_LIMIT
avatar: str | None = None # Base64-encoded image
description: str | None = None
embedding_model: str | None = None # with alias `embd_id`
permission: Literal["me", "team"] = "me"
chunk_method: Literal["naive", "book", "email", ...] = "naive" # alias `parser_id`
parser_config: ParserConfig | None = None
Description:
Request model for creating datasets with fields for naming, avatar image, description, embedding model, permissions, chunking method, and parser configurations.
Notable Validators:
validate_avatar_base64: Ensures the avatar string is a valid Base64 image with supported MIME types (image/jpeg,image/png).normalize_embedding_model: Strips whitespace from embedding model string.validate_embedding_model: Validates embedding model format as<model_name>@<provider>.normalize_empty_parser_config: Converts empty parser config dict toNone.validate_parser_config_json_length: Ensures serialized parser config JSON does not exceed 65,535 characters.
UpdateDatasetReq
class UpdateDatasetReq(CreateDatasetReq):
dataset_id: str
name: str = ""
pagerank: int = 0
Description:
Extends CreateDatasetReq with fields for updating existing datasets, including a validated UUIDv1 dataset_id, optional name, and pagerank score.
Validator:
validate_dataset_id: Ensuresdataset_idis a valid UUID version 1 hex string.
DeleteReq
class DeleteReq(Base):
ids: list[str] | None
Description:
Request model for deletion operations accepting a list of dataset IDs.
Validator:
validate_ids:
Validates each ID as UUIDv1 hex string, ensures no duplicates, and preserves original order. Returns normalized UUID hex list orNone.
DeleteDatasetReq
class DeleteDatasetReq(DeleteReq):
...
Description:
Alias for DeleteReq specifically for dataset deletion requests.
BaseListReq
class BaseListReq(BaseModel):
id: str | None = None
name: str | None = None
page: int = 1
page_size: int = 30
orderby: Literal["create_time", "update_time"] = "create_time"
desc: bool = True
Description:
Base request model for listing datasets supporting pagination, filtering, sorting, and ordering.
Validator:
validate_id: Validatesidas UUIDv1 hex string if provided.
ListDatasetReq
class ListDatasetReq(BaseListReq):
...
Description:
Alias for BaseListReq to list datasets.
Important Implementation Details and Algorithms
Multi-stage request validation: Functions first verify content types and JSON syntax, then use Pydantic models for schema validation. This layered approach improves reliability and error diagnostics.
UUID version 1 enforcement: UUID validation strictly requires version 1 UUIDs, preventing accidental acceptance of other UUID versions, enhancing data consistency.
Base64 avatar validation: Checks MIME type prefixes and supported image formats to ensure avatars meet expected encoding and format requirements.
Error formatting: Validation errors are converted into clear, concise strings identifying the field, error message, and invalid input, improving debugging and user feedback.
Extras merging and removal: Optional extra fields can be merged into data before validation to support server-side enrichment, then stripped out before returning the validated dictionary.
Parser config size validation: Ensures serialized parser configurations do not exceed 65,535 characters to avoid payload bloat or storage issues.
Interaction with Other System Components
Flask Web Framework: This module directly consumes
flask.Requestobjects for data extraction and validation.Pydantic Models: Uses Pydantic extensively for data validation, serialization, and field constraints.
API Constants: Imports
DATASET_NAME_LIMITfromapi.constantsto enforce name length constraints.Error Handling: Raises Flask/Werkzeug exceptions (
BadRequest,UnsupportedMediaType) and Pydantic errors which should be handled by Flask error handlers or middleware.Dataset Management: The
CreateDatasetReq,UpdateDatasetReq,DeleteDatasetReq, andListDatasetReqclasses are designed for dataset CRUD operations and likely interact with database models or service layers elsewhere.
Visual Diagram: Function Flowchart
flowchart TD
A[validate_and_parse_json_request] -->|calls| B[validator(**payload)]
A --> C[format_validation_error_message]
D[validate_and_parse_request_args] -->|calls| E[validator(**args)]
D --> C
F[format_validation_error_message] --> G[returns formatted string]
H[validate_uuid1_hex] --> I{check if input is UUID v1}
I -->|valid| J[return hex string]
I -->|invalid| K[raise PydanticCustomError]
L[normalize_str] --> M[returns normalized string or original]
N[DeleteReq.validate_ids] --> O[validate_uuid1_hex for each id]
O --> P{check duplicates}
P -->|duplicates| K
P -->|no duplicates| Q[return list of hex ids]
Summary
validation_utils.py is a foundational module for robust input validation in the InfiniFlow API backend. It standardizes how request data is validated, normalized, and processed using Pydantic schemas and custom validators. Its strict validation ensures data correctness and prevents malformed or unauthorized data from entering the system. The defined request models align closely with dataset-related API operations, indicating tight integration with dataset management components.
This detailed validation layer improves API reliability, security, and developer experience by providing precise error messages and consistent data handling workflows.