dataset.py
Overview
The dataset.py file implements a set of RESTful API endpoints for managing datasets within the InfiniFlow platform. It provides operations to create, delete, update, and list datasets, along with additional endpoints to manage and retrieve knowledge graph data associated with datasets.
These APIs are secured with token-based authentication and integrate tightly with underlying database services and search infrastructure to maintain dataset information and related documents.
Key functionalities include:
Dataset lifecycle management (CRUD operations)
Dataset permission and ownership controls
Validation and parsing of incoming requests
Interaction with knowledge graph data stored in the document store
Support for paging, filtering, and ordering datasets in listings
API Endpoint Functions
All route handlers are decorated with
@token_required, enforcing authentication.
They utilize various services such asKnowledgebaseService,TenantService,DocumentService, and others for DB and domain logic.
1. create(tenant_id)
Route: POST /datasets
Purpose: Create a new dataset under the specified tenant.
Parameters:
tenant_id(str): ID of the tenant creating the dataset.JSON body parameters:
name(str, required): Dataset name.avatar(str, optional): Base64-encoded avatar image.description(str, optional): Description text.embedding_model(str, optional): Embedding model name.permission(str, optional): Permission scope ("me"or"team").chunk_method(str, optional): Chunking method for document parsing.parser_config(object, optional): Configuration for the parser.
Returns:
200 OKwith dataset data on success.Error responses with appropriate messages on validation or database issues.
Implementation Details:
Validates request payload using
CreateDatasetReqschema.Checks for dataset name uniqueness within the tenant.
Fills in parser config and assigns a UUID.
Verifies embedding model availability.
Persists the dataset using
KnowledgebaseService.Remaps database fields to API response fields for consistency.
Example Usage:
POST /datasets
Authorization: Bearer <token>
Content-Type: application/json
{
"name": "My Dataset",
"embedding_model": "text-embedding-ada-002",
"chunk_method": "naive"
}
2. delete(tenant_id)
Route: DELETE /datasets
Purpose: Delete one or multiple datasets for the tenant.
Parameters:
tenant_id(str): Tenant ID.JSON body parameters:
ids(array of str or null, required): IDs of datasets to delete.nullindicates delete all datasets.Empty array means delete none.
Returns:
200 OKon full or partial success.Detailed error messages if some deletions fail or permissions are insufficient.
Implementation Details:
Validates request against
DeleteDatasetReq.Handles three deletion modes: all datasets, specified datasets, or none.
For each dataset:
Deletes related documents and files via
DocumentService,File2DocumentService, andFileService.Deletes the dataset record.
Collects and reports any errors encountered during deletions.
3. update(tenant_id, dataset_id)
Route: PUT /datasets/<dataset_id>
Purpose: Update dataset properties.
Parameters:
tenant_id(str): Tenant ID.dataset_id(str): ID of the dataset to update.JSON body parameters (any subset):
name(str): New dataset name.avatar(str): Updated avatar base64 string.description(str): Description update.embedding_model(str): New embedding model.permission(str): Updated permission ("me"or"team").chunk_method(str): Updated chunking method.pagerank(int): Page rank integer.parser_config(object): Updated parser config.
Returns:
200 OKwith updated dataset data on success.Error responses for invalid updates or permission issues.
Implementation Details:
Validates and parses request using
UpdateDatasetReq.Merges parser config updates with existing config.
Checks for dataset name conflicts.
Verifies embedding model availability and consistency with existing chunks.
Handles special logic for
pagerankupdates conditioned on document engine type.Uses
KnowledgebaseServiceto update dataset in DB.
4. list_datasets(tenant_id)
Route: GET /datasets
Purpose: Retrieve a paginated list of datasets accessible by the tenant.
Query Parameters:
id(str, optional): Filter by dataset ID.name(str, optional): Filter by dataset name.page(int, optional, default=1): Page number.page_size(int, optional, default=30): Items per page.orderby(str, optional, default="create_time"): Sort field.desc(bool, optional, default=True): Sort order descending if true.
Returns:
200 OKwith an array of dataset objects.
Implementation Details:
Validates query arguments using
ListDatasetReq.Applies filters and pagination.
Checks user permissions on datasets.
Joins tenant info for multi-tenant users.
Maps database fields to API response format.
5. knowledge_graph(tenant_id, dataset_id)
Route: GET /datasets/<dataset_id>/knowledge_graph
Purpose: Retrieve the knowledge graph and mind map data for a dataset.
Parameters:
tenant_id(str): Tenant ID.dataset_id(str): Dataset ID.
Returns:
200 OKwith JSON containinggraphandmind_mapobjects.Authorization error if tenant lacks access.
Implementation Details:
Checks dataset accessibility.
Queries the document store index for knowledge graph data.
Parses and filters nodes and edges by pagerank and weight.
Limits results to maximum counts for performance.
6. delete_knowledge_graph(tenant_id, dataset_id)
Route: DELETE /datasets/<dataset_id>/knowledge_graph
Purpose: Delete the knowledge graph data associated with a dataset.
Parameters:
tenant_id(str): Tenant ID.dataset_id(str): Dataset ID.
Returns:
200 OKwithdata=Trueon success.Authorization error if permission denied.
Implementation Details:
Checks dataset accessibility.
Deletes documents with knowledge graph keywords from the document store index.
Important Implementation Details
Field Name Transformations:
The API maps certain internal database field names to more user-friendly ones in responses, e.g.:embedding_model→embd_idchunk_method→parser_id
Parser Configuration Handling:
When creating or updating datasets, the parser configuration is either merged with existing config or set based on the chunking method.Embedding Model Verification:
The system verifies that the embedding model requested is available to the tenant before allowing creation or update.Database Operations:
All database interactions are wrapped in try-except blocks to catchOperationalErrorand log exceptions.Permission Checks:
Before modifying or accessing datasets, the code verifies that the tenant has permission.Search and Document Store:
The knowledge graph endpoints interact with a search index (possibly Elasticsearch or Infinity doc engine) to retrieve and delete graph-related documents.
Interaction with Other System Components
Authentication:
Uses@token_requireddecorator to secure endpoints with bearer tokens.Database Models and Services:
Uses ORM models frompeeweeand service classes (KnowledgebaseService,TenantService,DocumentService, etc.) to perform CRUD and query operations on datasets, documents, files, and tenants.Search Infrastructure:
Accessessettings.docStoreConnandsettings.retrievalerto perform knowledge graph document indexing, searching, and deletion.Validation Utilities:
Request payloads and query parameters are validated using Pydantic-based request schemas (CreateDatasetReq,DeleteDatasetReq, etc.) and helper functions.Utility Methods:
Uses various utilities for remapping dictionary keys, merging configs, generating UUIDs, and formatting result responses.
Visual Diagram - Class and Function Structure
Below is a Mermaid class diagram representing the main functions and their key relationships in this file. Since this file primarily defines route handler functions rather than classes, the diagram shows these as "classes" with their main responsibilities and usage of key services.
classDiagram
class DatasetAPI {
+create(tenant_id)
+delete(tenant_id)
+update(tenant_id, dataset_id)
+list_datasets(tenant_id)
+knowledge_graph(tenant_id, dataset_id)
+delete_knowledge_graph(tenant_id, dataset_id)
}
class KnowledgebaseService {
+get_or_none()
+save()
+get_by_id()
+delete_by_id()
+update_by_id()
+accessible()
+get_list()
}
class TenantService {
+get_by_id()
+get_joined_tenants_by_user_id()
}
class DocumentService {
+query()
+remove_document()
}
class File2DocumentService {
+get_by_document_id()
+delete_by_document_id()
}
class FileService {
+filter_delete()
}
DatasetAPI ..> KnowledgebaseService : uses
DatasetAPI ..> TenantService : uses
DatasetAPI ..> DocumentService : uses
DatasetAPI ..> File2DocumentService : uses
DatasetAPI ..> FileService : uses
Summary
The dataset.py module serves as the API layer for managing datasets in the InfiniFlow application. It provides secure, validated, and permission-aware endpoints for dataset lifecycle operations and knowledge graph management. It abstracts complex database operations and integration with search infrastructure behind clean REST interfaces, making it a crucial component for dataset and knowledge management workflows within the system.