validation_utils.py

Overview

The validation_utils.py file provides a comprehensive set of utilities for validating and parsing incoming HTTP request data in a Flask-based API environment. It leverages Pydantic models for strong schema validation, includes helper functions to normalize and validate specific data types (like UUIDs and Base64-encoded images), and defines data models representing dataset-related requests.

The file focuses on:

Validating JSON request bodies and query parameters.
Parsing and cleaning validated data.
Custom validation logic for fields such as UUID version 1, Base64 avatars, and embedding model identifiers.
Defining Pydantic models for dataset creation, update, deletion, and listing requests.
Formatting Pydantic validation errors into user-friendly messages.

This module is critical in ensuring data integrity and consistent request handling in the InfiniFlow API.

Detailed Documentation

Functions

`validate_and_parse_json_request`

validate_and_parse_json_request(
    request: Request,
    validator: type[BaseModel],
    *,
    extras: dict[str, Any] | None = None,
    exclude_unset: bool = False
) -> tuple[dict[str, Any] | None, str | None]

Purpose:
Validates and parses the JSON payload of a Flask Request object using a multi-stage validation pipeline.

Validation stages:

Checks that the content type is application/json.
Parses JSON syntax.
Ensures payload is a dictionary.
Validates the payload against a Pydantic model (validator).

Parameters:

request: Flask Request object.
validator: A Pydantic model class used to validate the payload.
extras: Optional dictionary of extra fields to merge into the payload before validation; these are removed from final output.
exclude_unset: If True, excludes fields not explicitly set in the input from the returned dictionary.

Returns:
Tuple containing:

Validated data dictionary (with extras removed) or None if validation failed.
None if successful or a string error message if validation failed.

Raises:

UnsupportedMediaType if content type is not JSON.
BadRequest for malformed JSON.
ValidationError if the payload violates the schema.

Example:

validated_data, error = validate_and_parse_json_request(request, DatasetSchema)
if error:
    return jsonify({"error": error}), 400
process(validated_data)

`validate_and_parse_request_args`

validate_and_parse_request_args(
    request: Request,
    validator: type[BaseModel],
    *,
    extras: dict[str, Any] | None = None
) -> tuple[dict[str, Any] | None, str | None]

Purpose:
Validates and parses URL query parameters (request.args) against a Pydantic model.

Parameters:

request: Flask Request object.
validator: Pydantic model class for validation.
extras: Optional extra fields merged into args before validation and removed later.

Returns:
Tuple of:

Validated and parsed arguments dictionary or None on failure.
None if success or string error message if failure.

Raises:

TypeError if validator is not a subclass of BaseModel.

Example:

args, error = validate_and_parse_request_args(request, QueryParamsModel)
if error:
    return jsonify({"error": error}), 400
handle_args(args)

`format_validation_error_message`

format_validation_error_message(e: ValidationError) -> str

Purpose:
Converts a Pydantic ValidationError into a human-readable error message string.

Parameters:

e: Pydantic ValidationError.

Returns:
A multi-line string where each line contains:

The field path (dot-separated).
The error message.
The offending input value (truncated to 128 chars).

Example:

try:
    UserModel(name=123)
except ValidationError as e:
    print(format_validation_error_message(e))
# Output:
# Field: <name> - Message: <Input should be a valid string> - Value: <123>

`normalize_str`

normalize_str(v: Any) -> Any

Purpose:
Normalizes string input by stripping whitespace and converting to lowercase. Non-string inputs are returned unchanged.

Parameters:

v: Any value.

Returns:

Normalized string if input was a string.
Original value if not.

Example:

normalize_str("  Admin ")  # returns "admin"
normalize_str(42)          # returns 42

`validate_uuid1_hex`

validate_uuid1_hex(v: Any) -> str

Purpose:
Validates that input is a UUID version 1 and returns its 32-character hexadecimal string (without hyphens).

Parameters:

v: UUID object or string.

Returns:

UUID version 1 hex string in lowercase.

Raises:

PydanticCustomError if the input is not a valid UUID v1 string/object.

Example:

validate_uuid1_hex("550e8400-e29b-41d4-a716-446655440000")
# returns "550e8400e29b41d4a716446655440000"

Classes

`Base`

class Base(BaseModel):
    model_config = ConfigDict(extra="forbid", strict=True)

Description:
Base Pydantic model with strict validation and forbids extra fields beyond those defined.

`RaptorConfig`

class RaptorConfig(Base):
    use_raptor: bool = False
    prompt: str = "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n      {cluster_content}\nThe above is the content you need to summarize."
    max_token: int = 256
    threshold: float = 0.1
    max_cluster: int = 64
    random_seed: int = 0

Description:
Configuration model for the Raptor summarization feature with fields controlling usage, prompt template, token limits, thresholds, clustering, and random seed.

`GraphragConfig`

class GraphragConfig(Base):
    use_graphrag: bool = False
    entity_types: list[str] = ["organization", "person", "geo", "event", "category"]
    method: Literal["light", "general"] = "light"
    community: bool = False
    resolution: bool = False

Description:
Configuration model for Graphrag entity extraction and community detection with options for entity types and method selection.

`ParserConfig`

class ParserConfig(Base):
    auto_keywords: int = 0
    auto_questions: int = 0
    chunk_token_num: int = 512
    delimiter: str = r"\n"
    graphrag: GraphragConfig = GraphragConfig()
    html4excel: bool = False
    layout_recognize: str = "DeepDOC"
    raptor: RaptorConfig = RaptorConfig()
    tag_kb_ids: list[str] = []
    topn_tags: int = 1
    filename_embd_weight: float | None = 0.1
    task_page_size: int | None = None
    pages: list[list[int]] | None = None

Description:
Parser configuration including keyword and question auto-generation, chunk sizes, Graphrag and Raptor sub-configurations, layout recognition, tagging, and pagination.

`CreateDatasetReq`

class CreateDatasetReq(Base):
    name: str  # constrained string, max length by DATASET_NAME_LIMIT
    avatar: str | None = None  # Base64-encoded image
    description: str | None = None
    embedding_model: str | None = None  # with alias `embd_id`
    permission: Literal["me", "team"] = "me"
    chunk_method: Literal["naive", "book", "email", ...] = "naive"  # alias `parser_id`
    parser_config: ParserConfig | None = None

Description:
Request model for creating datasets with fields for naming, avatar image, description, embedding model, permissions, chunking method, and parser configurations.

Notable Validators:

validate_avatar_base64: Ensures the avatar string is a valid Base64 image with supported MIME types (image/jpeg, image/png).
normalize_embedding_model: Strips whitespace from embedding model string.
validate_embedding_model: Validates embedding model format as <model_name>@<provider>.
normalize_empty_parser_config: Converts empty parser config dict to None.
validate_parser_config_json_length: Ensures serialized parser config JSON does not exceed 65,535 characters.

`UpdateDatasetReq`

class UpdateDatasetReq(CreateDatasetReq):
    dataset_id: str
    name: str = ""
    pagerank: int = 0

Description:
Extends CreateDatasetReq with fields for updating existing datasets, including a validated UUIDv1 dataset_id, optional name, and pagerank score.

Validator:

validate_dataset_id: Ensures dataset_id is a valid UUID version 1 hex string.

`DeleteReq`

class DeleteReq(Base):
    ids: list[str] | None

Description:
Request model for deletion operations accepting a list of dataset IDs.

Validator:

validate_ids:
Validates each ID as UUIDv1 hex string, ensures no duplicates, and preserves original order. Returns normalized UUID hex list or None.

`DeleteDatasetReq`

class DeleteDatasetReq(DeleteReq):
    ...

Description:
Alias for DeleteReq specifically for dataset deletion requests.

`BaseListReq`

class BaseListReq(BaseModel):
    id: str | None = None
    name: str | None = None
    page: int = 1
    page_size: int = 30
    orderby: Literal["create_time", "update_time"] = "create_time"
    desc: bool = True

Description:
Base request model for listing datasets supporting pagination, filtering, sorting, and ordering.

Validator:

validate_id: Validates id as UUIDv1 hex string if provided.

`ListDatasetReq`

class ListDatasetReq(BaseListReq):
    ...

Description:
Alias for BaseListReq to list datasets.

Important Implementation Details and Algorithms

Multi-stage request validation: Functions first verify content types and JSON syntax, then use Pydantic models for schema validation. This layered approach improves reliability and error diagnostics.
UUID version 1 enforcement: UUID validation strictly requires version 1 UUIDs, preventing accidental acceptance of other UUID versions, enhancing data consistency.
Base64 avatar validation: Checks MIME type prefixes and supported image formats to ensure avatars meet expected encoding and format requirements.
Error formatting: Validation errors are converted into clear, concise strings identifying the field, error message, and invalid input, improving debugging and user feedback.
Extras merging and removal: Optional extra fields can be merged into data before validation to support server-side enrichment, then stripped out before returning the validated dictionary.
Parser config size validation: Ensures serialized parser configurations do not exceed 65,535 characters to avoid payload bloat or storage issues.

Interaction with Other System Components

Flask Web Framework: This module directly consumes flask.Request objects for data extraction and validation.
Pydantic Models: Uses Pydantic extensively for data validation, serialization, and field constraints.
API Constants: Imports DATASET_NAME_LIMIT from api.constants to enforce name length constraints.
Error Handling: Raises Flask/Werkzeug exceptions (BadRequest, UnsupportedMediaType) and Pydantic errors which should be handled by Flask error handlers or middleware.
Dataset Management: The CreateDatasetReq, UpdateDatasetReq, DeleteDatasetReq, and ListDatasetReq classes are designed for dataset CRUD operations and likely interact with database models or service layers elsewhere.

Visual Diagram: Function Flowchart

flowchart TD
    A[validate_and_parse_json_request] -->|calls| B[validator(**payload)]
    A --> C[format_validation_error_message]
    D[validate_and_parse_request_args] -->|calls| E[validator(**args)]
    D --> C
    F[format_validation_error_message] --> G[returns formatted string]
    H[validate_uuid1_hex] --> I{check if input is UUID v1}
    I -->|valid| J[return hex string]
    I -->|invalid| K[raise PydanticCustomError]
    L[normalize_str] --> M[returns normalized string or original]
    N[DeleteReq.validate_ids] --> O[validate_uuid1_hex for each id]
    O --> P{check duplicates}
    P -->|duplicates| K
    P -->|no duplicates| Q[return list of hex ids]

Summary

validation_utils.py is a foundational module for robust input validation in the InfiniFlow API backend. It standardizes how request data is validated, normalized, and processed using Pydantic schemas and custom validators. Its strict validation ensures data correctness and prevents malformed or unauthorized data from entering the system. The defined request models align closely with dataset-related API operations, indicating tight integration with dataset management components.

This detailed validation layer improves API reliability, security, and developer experience by providing precise error messages and consistent data handling workflows.