dataset.py
Overview
The dataset.py file is a core module responsible for managing datasets within the InfiniFlow system. It provides the DataSet class, which encapsulates dataset properties and exposes methods to interact with dataset resources such as documents and parsing configurations.
Through this class, users can create, update, upload, list, delete, and asynchronously parse documents within a dataset. The file also defines a nested ParserConfig class to handle dataset parsing configurations.
The file relies on a base class Base from which it inherits common RESTful API interaction methods (get, post, put, rm), and it integrates the Document class to represent individual documents belonging to a dataset.
Classes and Methods
class DataSet(Base)
Represents a dataset entity in the InfiniFlow platform. It manages dataset metadata and provides interfaces for manipulating documents and parsing tasks within the dataset.
Properties
id(str): Unique identifier of the dataset.name(str): Human-readable name of the dataset.avatar(str): URL or reference to an avatar image representing the dataset.tenant_id(optional, any): Identifier for the tenant owning this dataset.description(str): Description text about the dataset.embedding_model(str): Name or identifier of the embedding model used.permission(str): Permission level, defaulting to"me".document_count(int): Number of documents contained.chunk_count(int): Number of chunks (parsed segments) in the dataset.chunk_method(str): Method used for chunking documents, defaults to"naive".parser_config(optionalParserConfig): Configuration for document parsing.pagerank(int): Pagerank value for dataset ordering or relevance.
Constructor
def __init__(self, rag, res_dict):
Initializes a
DataSetinstance.Parameters:
rag: Context or client object used for API interactions.res_dict(dict): Dictionary containing dataset resource data.
The constructor initializes dataset fields from
res_dictand removes any keys fromres_dictthat do not correspond to properties of the class before calling the superclass initializer.
Nested Class: ParserConfig
class ParserConfig(Base):
def __init__(self, rag, res_dict):
super().__init__(rag, res_dict)
A simple configuration holder class for parser settings associated with a dataset.
Inherits from
Baseto leverage RESTful capabilities if needed.
Methods:
update
def update(self, update_message: dict) -> DataSet
Updates the dataset metadata on the server.
Parameters:
update_message(dict): Dictionary containing fields to update.
Returns:
The updated
DataSetinstance.
Raises:
Exception with server error message if update fails.
Usage example:
dataset = DataSet(rag, existing_data)
updated_dataset = dataset.update({"name": "New Dataset Name", "description": "Updated description"})
upload_documents
def upload_documents(self, document_list: list[dict]) -> list[Document]
Uploads multiple documents to the dataset.
Parameters:
document_list: List of dictionaries, each with keys"display_name"(filename) and"blob"(file content/bytes).
Returns:
List of
Documentinstances representing the uploaded documents.
Raises:
Exception with server error message on failure.
Implementation detail:
Uses multipart file upload via
postmethod.
Usage example:
docs = [
{"display_name": "file1.pdf", "blob": file1_bytes},
{"display_name": "file2.txt", "blob": file2_bytes},
]
uploaded_docs = dataset.upload_documents(docs)
list_documents
def list_documents(
self,
id: str | None = None,
name: str | None = None,
keywords: str | None = None,
page: int = 1,
page_size: int = 30,
orderby: str = "create_time",
desc: bool = True,
create_time_from: int = 0,
create_time_to: int = 0,
) -> list[Document]
Retrieves a paginated and filtered list of documents in the dataset.
Parameters:
id: Filter by document ID.name: Filter by document name.keywords: Search keywords in documents.page: Page number for pagination.page_size: Number of documents per page.orderby: Field to order results by (default:"create_time").desc: Whether to sort descending (default:True).create_time_from: Filter documents created after this timestamp.create_time_to: Filter documents created before this timestamp.
Returns:
List of
Documentinstances matching the criteria.
Raises:
Exception with server error message on failure.
Usage example:
docs = dataset.list_documents(keywords="report", page=2, page_size=10)
delete_documents
def delete_documents(self, ids: list[str] | None = None) -> None
Deletes documents from the dataset by their IDs.
Parameters:
ids: List of document IDs to delete. IfNone, no documents are deleted.
Raises:
Exception with server error message if deletion fails.
Usage example:
dataset.delete_documents(ids=["doc123", "doc456"])
async_parse_documents
def async_parse_documents(self, document_ids: list[str]) -> None
Initiates asynchronous parsing of specified documents into chunks.
Parameters:
document_ids: List of document IDs to parse.
Raises:
Exception with server error message on failure.
Usage example:
dataset.async_parse_documents(["doc123", "doc456"])
async_cancel_parse_documents
def async_cancel_parse_documents(self, document_ids: list[str]) -> None
Cancels ongoing asynchronous parsing tasks for specified documents.
Parameters:
document_ids: List of document IDs whose parsing tasks should be cancelled.
Raises:
Exception with server error message on failure.
Usage example:
dataset.async_cancel_parse_documents(["doc123"])
Important Implementation Details
The class relies on an inherited
Baseclass for HTTP REST calls (get,post,put,rm).Methods handle JSON responses and raise exceptions on error codes.
Document uploads use multipart file upload with tuples specifying filename and binary blob.
The constructor sanitizes the input dictionary to ensure only valid dataset attributes remain before passing to the base initializer.
The class tightly couples with the
Documentclass, creatingDocumentinstances when listing or uploading documents.
Integration and Interactions
With
Baseclass: Provides RESTful API methods.DataSetis a specialized resource manager extending this base functionality.With
Documentclass: Represents documents inside datasets.DataSetreturnsDocumentinstances when listing or uploading documents.With backend API: All dataset and document operations correspond to HTTP endpoints under
/datasets/{dataset_id}and related paths.With the
ragclient/context: Passed throughout for API context, authentication, and request handling.
This file is a fundamental component enabling dataset lifecycle management and document operations in the InfiniFlow application.
Diagram: Class Structure of dataset.py
classDiagram
class DataSet {
-id: str
-name: str
-avatar: str
-tenant_id
-description: str
-embedding_model: str
-permission: str
-document_count: int
-chunk_count: int
-chunk_method: str
-parser_config: ParserConfig
-pagerank: int
+__init__(rag, res_dict)
+update(update_message: dict) DataSet
+upload_documents(document_list: list[dict]) list~Document~
+list_documents(id: str=None, name: str=None, keywords: str=None, page: int=1, page_size: int=30, orderby: str="create_time", desc: bool=True, create_time_from: int=0, create_time_to: int=0) list~Document~
+delete_documents(ids: list~str~=None) None
+async_parse_documents(document_ids: list~str~) None
+async_cancel_parse_documents(document_ids: list~str~) None
}
class ParserConfig {
+__init__(rag, res_dict)
}
DataSet o-- ParserConfig : "has a"
DataSet ..> Base : "inherits"
DataSet --> Document : "uses"
Summary
The dataset.py file defines the DataSet class, a key interface for managing datasets and their documents in the InfiniFlow system. It supports dataset metadata updates, document upload/list/delete, and asynchronous parsing operations by interacting with a backend API through inherited REST methods. The nested ParserConfig class encapsulates parsing configurations. This module is essential for dataset lifecycle management and integrates closely with the Document class and the overall API client context.