dataset.py

Overview

The dataset.py file is a core module responsible for managing datasets within the InfiniFlow system. It provides the DataSet class, which encapsulates dataset properties and exposes methods to interact with dataset resources such as documents and parsing configurations.

Through this class, users can create, update, upload, list, delete, and asynchronously parse documents within a dataset. The file also defines a nested ParserConfig class to handle dataset parsing configurations.

The file relies on a base class Base from which it inherits common RESTful API interaction methods (get, post, put, rm), and it integrates the Document class to represent individual documents belonging to a dataset.

Classes and Methods

`class DataSet(Base)`

Represents a dataset entity in the InfiniFlow platform. It manages dataset metadata and provides interfaces for manipulating documents and parsing tasks within the dataset.

Properties

id (str): Unique identifier of the dataset.
name (str): Human-readable name of the dataset.
avatar (str): URL or reference to an avatar image representing the dataset.
tenant_id (optional, any): Identifier for the tenant owning this dataset.
description (str): Description text about the dataset.
embedding_model (str): Name or identifier of the embedding model used.
permission (str): Permission level, defaulting to "me".
document_count (int): Number of documents contained.
chunk_count (int): Number of chunks (parsed segments) in the dataset.
chunk_method (str): Method used for chunking documents, defaults to "naive".
parser_config (optional ParserConfig): Configuration for document parsing.
pagerank (int): Pagerank value for dataset ordering or relevance.

Constructor

def __init__(self, rag, res_dict):

Initializes a DataSet instance.
Parameters:
- rag: Context or client object used for API interactions.
- res_dict (dict): Dictionary containing dataset resource data.
The constructor initializes dataset fields from res_dict and removes any keys from res_dict that do not correspond to properties of the class before calling the superclass initializer.

Nested Class: `ParserConfig`

class ParserConfig(Base):
    def __init__(self, rag, res_dict):
        super().__init__(rag, res_dict)

A simple configuration holder class for parser settings associated with a dataset.
Inherits from Base to leverage RESTful capabilities if needed.

Methods:

`update`

def update(self, update_message: dict) -> DataSet

Updates the dataset metadata on the server.
Parameters:
- update_message (dict): Dictionary containing fields to update.
Returns:
- The updated DataSet instance.
Raises:
- Exception with server error message if update fails.
Usage example:

dataset = DataSet(rag, existing_data)
updated_dataset = dataset.update({"name": "New Dataset Name", "description": "Updated description"})

`upload_documents`

def upload_documents(self, document_list: list[dict]) -> list[Document]

Uploads multiple documents to the dataset.
Parameters:
- document_list: List of dictionaries, each with keys "display_name" (filename) and "blob" (file content/bytes).
Returns:
- List of Document instances representing the uploaded documents.
Raises:
- Exception with server error message on failure.
Implementation detail:
- Uses multipart file upload via post method.
Usage example:

docs = [
    {"display_name": "file1.pdf", "blob": file1_bytes},
    {"display_name": "file2.txt", "blob": file2_bytes},
]
uploaded_docs = dataset.upload_documents(docs)

`list_documents`

def list_documents(
    self,
    id: str | None = None,
    name: str | None = None,
    keywords: str | None = None,
    page: int = 1,
    page_size: int = 30,
    orderby: str = "create_time",
    desc: bool = True,
    create_time_from: int = 0,
    create_time_to: int = 0,
) -> list[Document]

Retrieves a paginated and filtered list of documents in the dataset.
Parameters:
- id: Filter by document ID.
- name: Filter by document name.
- keywords: Search keywords in documents.
- page: Page number for pagination.
- page_size: Number of documents per page.
- orderby: Field to order results by (default: "create_time").
- desc: Whether to sort descending (default: True).
- create_time_from: Filter documents created after this timestamp.
- create_time_to: Filter documents created before this timestamp.
Returns:
- List of Document instances matching the criteria.
Raises:
- Exception with server error message on failure.
Usage example:

docs = dataset.list_documents(keywords="report", page=2, page_size=10)

`delete_documents`

def delete_documents(self, ids: list[str] | None = None) -> None

Deletes documents from the dataset by their IDs.
Parameters:
- ids: List of document IDs to delete. If None, no documents are deleted.
Raises:
- Exception with server error message if deletion fails.
Usage example:

dataset.delete_documents(ids=["doc123", "doc456"])

`async_parse_documents`

def async_parse_documents(self, document_ids: list[str]) -> None

Initiates asynchronous parsing of specified documents into chunks.
Parameters:
- document_ids: List of document IDs to parse.
Raises:
- Exception with server error message on failure.
Usage example:

dataset.async_parse_documents(["doc123", "doc456"])

`async_cancel_parse_documents`

def async_cancel_parse_documents(self, document_ids: list[str]) -> None

Cancels ongoing asynchronous parsing tasks for specified documents.
Parameters:
- document_ids: List of document IDs whose parsing tasks should be cancelled.
Raises:
- Exception with server error message on failure.
Usage example:

dataset.async_cancel_parse_documents(["doc123"])

Important Implementation Details

The class relies on an inherited Base class for HTTP REST calls (get, post, put, rm).
Methods handle JSON responses and raise exceptions on error codes.
Document uploads use multipart file upload with tuples specifying filename and binary blob.
The constructor sanitizes the input dictionary to ensure only valid dataset attributes remain before passing to the base initializer.
The class tightly couples with the Document class, creating Document instances when listing or uploading documents.

Integration and Interactions

With Base class: Provides RESTful API methods. DataSet is a specialized resource manager extending this base functionality.
With Document class: Represents documents inside datasets. DataSet returns Document instances when listing or uploading documents.
With backend API: All dataset and document operations correspond to HTTP endpoints under /datasets/{dataset_id} and related paths.
With the rag client/context: Passed throughout for API context, authentication, and request handling.

This file is a fundamental component enabling dataset lifecycle management and document operations in the InfiniFlow application.

Diagram: Class Structure of `dataset.py`

classDiagram
    class DataSet {
        -id: str
        -name: str
        -avatar: str
        -tenant_id
        -description: str
        -embedding_model: str
        -permission: str
        -document_count: int
        -chunk_count: int
        -chunk_method: str
        -parser_config: ParserConfig
        -pagerank: int
        +__init__(rag, res_dict)
        +update(update_message: dict) DataSet
        +upload_documents(document_list: list[dict]) list~Document~
        +list_documents(id: str=None, name: str=None, keywords: str=None, page: int=1, page_size: int=30, orderby: str="create_time", desc: bool=True, create_time_from: int=0, create_time_to: int=0) list~Document~
        +delete_documents(ids: list~str~=None) None
        +async_parse_documents(document_ids: list~str~) None
        +async_cancel_parse_documents(document_ids: list~str~) None
    }

    class ParserConfig {
        +__init__(rag, res_dict)
    }

    DataSet o-- ParserConfig : "has a"
    DataSet ..> Base : "inherits"
    DataSet --> Document : "uses"

Summary

The dataset.py file defines the DataSet class, a key interface for managing datasets and their documents in the InfiniFlow system. It supports dataset metadata updates, document upload/list/delete, and asynchronous parsing operations by interacting with a backend API through inherited REST methods. The nested ParserConfig class encapsulates parsing configurations. This module is essential for dataset lifecycle management and integrates closely with the Document class and the overall API client context.

dataset.py

Overview

Classes and Methods

class DataSet(Base)

Properties

Constructor

Nested Class: ParserConfig

Methods:

update

upload_documents

list_documents

delete_documents

async_parse_documents

async_cancel_parse_documents