dataset.py

Overview

The dataset.py file implements a set of RESTful API endpoints for managing datasets within the InfiniFlow platform. It provides operations to create, delete, update, and list datasets, along with additional endpoints to manage and retrieve knowledge graph data associated with datasets.

These APIs are secured with token-based authentication and integrate tightly with underlying database services and search infrastructure to maintain dataset information and related documents.

Key functionalities include:

Dataset lifecycle management (CRUD operations)
Dataset permission and ownership controls
Validation and parsing of incoming requests
Interaction with knowledge graph data stored in the document store
Support for paging, filtering, and ordering datasets in listings

API Endpoint Functions

All route handlers are decorated with @token_required, enforcing authentication.
They utilize various services such as KnowledgebaseService, TenantService, DocumentService, and others for DB and domain logic.

1. `create(tenant_id)`

Route: POST /datasets
Purpose: Create a new dataset under the specified tenant.

Parameters:

tenant_id (str): ID of the tenant creating the dataset.
JSON body parameters:
- name (str, required): Dataset name.
- avatar (str, optional): Base64-encoded avatar image.
- description (str, optional): Description text.
- embedding_model (str, optional): Embedding model name.
- permission (str, optional): Permission scope ("me" or "team").
- chunk_method (str, optional): Chunking method for document parsing.
- parser_config (object, optional): Configuration for the parser.

Returns:

200 OK with dataset data on success.
Error responses with appropriate messages on validation or database issues.

Implementation Details:

Validates request payload using CreateDatasetReq schema.
Checks for dataset name uniqueness within the tenant.
Fills in parser config and assigns a UUID.
Verifies embedding model availability.
Persists the dataset using KnowledgebaseService.
Remaps database fields to API response fields for consistency.

Example Usage:

POST /datasets
Authorization: Bearer <token>
Content-Type: application/json

{
  "name": "My Dataset",
  "embedding_model": "text-embedding-ada-002",
  "chunk_method": "naive"
}

2. `delete(tenant_id)`

Route: DELETE /datasets
Purpose: Delete one or multiple datasets for the tenant.

Parameters:

tenant_id (str): Tenant ID.
JSON body parameters:
- ids (array of str or null, required): IDs of datasets to delete.
  - null indicates delete all datasets.
  - Empty array means delete none.

Returns:

200 OK on full or partial success.
Detailed error messages if some deletions fail or permissions are insufficient.

Implementation Details:

Validates request against DeleteDatasetReq.
Handles three deletion modes: all datasets, specified datasets, or none.
For each dataset:
- Deletes related documents and files via DocumentService, File2DocumentService, and FileService.
- Deletes the dataset record.
Collects and reports any errors encountered during deletions.

3. `update(tenant_id, dataset_id)`

Route: PUT /datasets/<dataset_id>
Purpose: Update dataset properties.

Parameters:

tenant_id (str): Tenant ID.
dataset_id (str): ID of the dataset to update.
JSON body parameters (any subset):
- name (str): New dataset name.
- avatar (str): Updated avatar base64 string.
- description (str): Description update.
- embedding_model (str): New embedding model.
- permission (str): Updated permission ("me" or "team").
- chunk_method (str): Updated chunking method.
- pagerank (int): Page rank integer.
- parser_config (object): Updated parser config.

Returns:

200 OK with updated dataset data on success.
Error responses for invalid updates or permission issues.

Implementation Details:

Validates and parses request using UpdateDatasetReq.
Merges parser config updates with existing config.
Checks for dataset name conflicts.
Verifies embedding model availability and consistency with existing chunks.
Handles special logic for pagerank updates conditioned on document engine type.
Uses KnowledgebaseService to update dataset in DB.

4. `list_datasets(tenant_id)`

Route: GET /datasets
Purpose: Retrieve a paginated list of datasets accessible by the tenant.

Query Parameters:

id (str, optional): Filter by dataset ID.
name (str, optional): Filter by dataset name.
page (int, optional, default=1): Page number.
page_size (int, optional, default=30): Items per page.
orderby (str, optional, default="create_time"): Sort field.
desc (bool, optional, default=True): Sort order descending if true.

Returns:

200 OK with an array of dataset objects.

Implementation Details:

Validates query arguments using ListDatasetReq.
Applies filters and pagination.
Checks user permissions on datasets.
Joins tenant info for multi-tenant users.
Maps database fields to API response format.

5. `knowledge_graph(tenant_id, dataset_id)`

Route: GET /datasets/<dataset_id>/knowledge_graph
Purpose: Retrieve the knowledge graph and mind map data for a dataset.

Parameters:

tenant_id (str): Tenant ID.
dataset_id (str): Dataset ID.

Returns:

200 OK with JSON containing graph and mind_map objects.
Authorization error if tenant lacks access.

Implementation Details:

Checks dataset accessibility.
Queries the document store index for knowledge graph data.
Parses and filters nodes and edges by pagerank and weight.
Limits results to maximum counts for performance.

6. `delete_knowledge_graph(tenant_id, dataset_id)`

Route: DELETE /datasets/<dataset_id>/knowledge_graph
Purpose: Delete the knowledge graph data associated with a dataset.

Parameters:

tenant_id (str): Tenant ID.
dataset_id (str): Dataset ID.

Returns:

200 OK with data=True on success.
Authorization error if permission denied.

Implementation Details:

Checks dataset accessibility.
Deletes documents with knowledge graph keywords from the document store index.

Important Implementation Details

Field Name Transformations:
The API maps certain internal database field names to more user-friendly ones in responses, e.g.:
- embedding_model → embd_id
- chunk_method → parser_id
Parser Configuration Handling:
When creating or updating datasets, the parser configuration is either merged with existing config or set based on the chunking method.
Embedding Model Verification:
The system verifies that the embedding model requested is available to the tenant before allowing creation or update.
Database Operations:
All database interactions are wrapped in try-except blocks to catch OperationalError and log exceptions.
Permission Checks:
Before modifying or accessing datasets, the code verifies that the tenant has permission.
Search and Document Store:
The knowledge graph endpoints interact with a search index (possibly Elasticsearch or Infinity doc engine) to retrieve and delete graph-related documents.

Interaction with Other System Components

Authentication:
Uses @token_required decorator to secure endpoints with bearer tokens.
Database Models and Services:
Uses ORM models from peewee and service classes (KnowledgebaseService, TenantService, DocumentService, etc.) to perform CRUD and query operations on datasets, documents, files, and tenants.
Search Infrastructure:
Accesses settings.docStoreConn and settings.retrievaler to perform knowledge graph document indexing, searching, and deletion.
Validation Utilities:
Request payloads and query parameters are validated using Pydantic-based request schemas (CreateDatasetReq, DeleteDatasetReq, etc.) and helper functions.
Utility Methods:
Uses various utilities for remapping dictionary keys, merging configs, generating UUIDs, and formatting result responses.

Visual Diagram - Class and Function Structure

Below is a Mermaid class diagram representing the main functions and their key relationships in this file. Since this file primarily defines route handler functions rather than classes, the diagram shows these as "classes" with their main responsibilities and usage of key services.

classDiagram
    class DatasetAPI {
        +create(tenant_id)
        +delete(tenant_id)
        +update(tenant_id, dataset_id)
        +list_datasets(tenant_id)
        +knowledge_graph(tenant_id, dataset_id)
        +delete_knowledge_graph(tenant_id, dataset_id)
    }

    class KnowledgebaseService {
        +get_or_none()
        +save()
        +get_by_id()
        +delete_by_id()
        +update_by_id()
        +accessible()
        +get_list()
    }

    class TenantService {
        +get_by_id()
        +get_joined_tenants_by_user_id()
    }

    class DocumentService {
        +query()
        +remove_document()
    }

    class File2DocumentService {
        +get_by_document_id()
        +delete_by_document_id()
    }

    class FileService {
        +filter_delete()
    }

    DatasetAPI ..> KnowledgebaseService : uses
    DatasetAPI ..> TenantService : uses
    DatasetAPI ..> DocumentService : uses
    DatasetAPI ..> File2DocumentService : uses
    DatasetAPI ..> FileService : uses

Summary

The dataset.py module serves as the API layer for managing datasets in the InfiniFlow application. It provides secure, validated, and permission-aware endpoints for dataset lifecycle operations and knowledge graph management. It abstracts complex database operations and integration with search infrastructure behind clean REST interfaces, making it a crucial component for dataset and knowledge management workflows within the system.

dataset.py

Overview

API Endpoint Functions

1. create(tenant_id)

2. delete(tenant_id)

3. update(tenant_id, dataset_id)

4. list_datasets(tenant_id)

5. knowledge_graph(tenant_id, dataset_id)

6. delete_knowledge_graph(tenant_id, dataset_id)

Important Implementation Details

Interaction with Other System Components

Visual Diagram - Class and Function Structure

Summary

1. `create(tenant_id)`

2. `delete(tenant_id)`

3. `update(tenant_id, dataset_id)`

4. `list_datasets(tenant_id)`

5. `knowledge_graph(tenant_id, dataset_id)`

6. `delete_knowledge_graph(tenant_id, dataset_id)`