validation_utils.py


Overview

The validation_utils.py file provides a comprehensive set of utilities for validating and parsing incoming HTTP request data in a Flask-based API environment. It leverages Pydantic models for strong schema validation, includes helper functions to normalize and validate specific data types (like UUIDs and Base64-encoded images), and defines data models representing dataset-related requests.

The file focuses on:

This module is critical in ensuring data integrity and consistent request handling in the InfiniFlow API.


Detailed Documentation

Functions


validate_and_parse_json_request

validate_and_parse_json_request(
    request: Request,
    validator: type[BaseModel],
    *,
    extras: dict[str, Any] | None = None,
    exclude_unset: bool = False
) -> tuple[dict[str, Any] | None, str | None]

Purpose:
Validates and parses the JSON payload of a Flask Request object using a multi-stage validation pipeline.

Validation stages:

  1. Checks that the content type is application/json.

  2. Parses JSON syntax.

  3. Ensures payload is a dictionary.

  4. Validates the payload against a Pydantic model (validator).

Parameters:

Returns:
Tuple containing:

Raises:

Example:

validated_data, error = validate_and_parse_json_request(request, DatasetSchema)
if error:
    return jsonify({"error": error}), 400
process(validated_data)

validate_and_parse_request_args

validate_and_parse_request_args(
    request: Request,
    validator: type[BaseModel],
    *,
    extras: dict[str, Any] | None = None
) -> tuple[dict[str, Any] | None, str | None]

Purpose:
Validates and parses URL query parameters (request.args) against a Pydantic model.

Parameters:

Returns:
Tuple of:

Raises:

Example:

args, error = validate_and_parse_request_args(request, QueryParamsModel)
if error:
    return jsonify({"error": error}), 400
handle_args(args)

format_validation_error_message

format_validation_error_message(e: ValidationError) -> str

Purpose:
Converts a Pydantic ValidationError into a human-readable error message string.

Parameters:

Returns:
A multi-line string where each line contains:

Example:

try:
    UserModel(name=123)
except ValidationError as e:
    print(format_validation_error_message(e))
# Output:
# Field: <name> - Message: <Input should be a valid string> - Value: <123>

normalize_str

normalize_str(v: Any) -> Any

Purpose:
Normalizes string input by stripping whitespace and converting to lowercase. Non-string inputs are returned unchanged.

Parameters:

Returns:

Example:

normalize_str("  Admin ")  # returns "admin"
normalize_str(42)          # returns 42

validate_uuid1_hex

validate_uuid1_hex(v: Any) -> str

Purpose:
Validates that input is a UUID version 1 and returns its 32-character hexadecimal string (without hyphens).

Parameters:

Returns:

Raises:

Example:

validate_uuid1_hex("550e8400-e29b-41d4-a716-446655440000")
# returns "550e8400e29b41d4a716446655440000"

Classes


Base

class Base(BaseModel):
    model_config = ConfigDict(extra="forbid", strict=True)

Description:
Base Pydantic model with strict validation and forbids extra fields beyond those defined.


RaptorConfig

class RaptorConfig(Base):
    use_raptor: bool = False
    prompt: str = "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n      {cluster_content}\nThe above is the content you need to summarize."
    max_token: int = 256
    threshold: float = 0.1
    max_cluster: int = 64
    random_seed: int = 0

Description:
Configuration model for the Raptor summarization feature with fields controlling usage, prompt template, token limits, thresholds, clustering, and random seed.


GraphragConfig

class GraphragConfig(Base):
    use_graphrag: bool = False
    entity_types: list[str] = ["organization", "person", "geo", "event", "category"]
    method: Literal["light", "general"] = "light"
    community: bool = False
    resolution: bool = False

Description:
Configuration model for Graphrag entity extraction and community detection with options for entity types and method selection.


ParserConfig

class ParserConfig(Base):
    auto_keywords: int = 0
    auto_questions: int = 0
    chunk_token_num: int = 512
    delimiter: str = r"\n"
    graphrag: GraphragConfig = GraphragConfig()
    html4excel: bool = False
    layout_recognize: str = "DeepDOC"
    raptor: RaptorConfig = RaptorConfig()
    tag_kb_ids: list[str] = []
    topn_tags: int = 1
    filename_embd_weight: float | None = 0.1
    task_page_size: int | None = None
    pages: list[list[int]] | None = None

Description:
Parser configuration including keyword and question auto-generation, chunk sizes, Graphrag and Raptor sub-configurations, layout recognition, tagging, and pagination.


CreateDatasetReq

class CreateDatasetReq(Base):
    name: str  # constrained string, max length by DATASET_NAME_LIMIT
    avatar: str | None = None  # Base64-encoded image
    description: str | None = None
    embedding_model: str | None = None  # with alias `embd_id`
    permission: Literal["me", "team"] = "me"
    chunk_method: Literal["naive", "book", "email", ...] = "naive"  # alias `parser_id`
    parser_config: ParserConfig | None = None

Description:
Request model for creating datasets with fields for naming, avatar image, description, embedding model, permissions, chunking method, and parser configurations.

Notable Validators:


UpdateDatasetReq

class UpdateDatasetReq(CreateDatasetReq):
    dataset_id: str
    name: str = ""
    pagerank: int = 0

Description:
Extends CreateDatasetReq with fields for updating existing datasets, including a validated UUIDv1 dataset_id, optional name, and pagerank score.

Validator:


DeleteReq

class DeleteReq(Base):
    ids: list[str] | None

Description:
Request model for deletion operations accepting a list of dataset IDs.

Validator:


DeleteDatasetReq

class DeleteDatasetReq(DeleteReq):
    ...

Description:
Alias for DeleteReq specifically for dataset deletion requests.


BaseListReq

class BaseListReq(BaseModel):
    id: str | None = None
    name: str | None = None
    page: int = 1
    page_size: int = 30
    orderby: Literal["create_time", "update_time"] = "create_time"
    desc: bool = True

Description:
Base request model for listing datasets supporting pagination, filtering, sorting, and ordering.

Validator:


ListDatasetReq

class ListDatasetReq(BaseListReq):
    ...

Description:
Alias for BaseListReq to list datasets.


Important Implementation Details and Algorithms


Interaction with Other System Components


Visual Diagram: Function Flowchart

flowchart TD
    A[validate_and_parse_json_request] -->|calls| B[validator(**payload)]
    A --> C[format_validation_error_message]
    D[validate_and_parse_request_args] -->|calls| E[validator(**args)]
    D --> C
    F[format_validation_error_message] --> G[returns formatted string]
    H[validate_uuid1_hex] --> I{check if input is UUID v1}
    I -->|valid| J[return hex string]
    I -->|invalid| K[raise PydanticCustomError]
    L[normalize_str] --> M[returns normalized string or original]
    N[DeleteReq.validate_ids] --> O[validate_uuid1_hex for each id]
    O --> P{check duplicates}
    P -->|duplicates| K
    P -->|no duplicates| Q[return list of hex ids]

Summary

validation_utils.py is a foundational module for robust input validation in the InfiniFlow API backend. It standardizes how request data is validated, normalized, and processed using Pydantic schemas and custom validators. Its strict validation ensures data correctness and prevents malformed or unauthorized data from entering the system. The defined request models align closely with dataset-related API operations, indicating tight integration with dataset management components.

This detailed validation layer improves API reliability, security, and developer experience by providing precise error messages and consistent data handling workflows.