dataset.py


Overview

The dataset.py file is a core module responsible for managing datasets within the InfiniFlow system. It provides the DataSet class, which encapsulates dataset properties and exposes methods to interact with dataset resources such as documents and parsing configurations.

Through this class, users can create, update, upload, list, delete, and asynchronously parse documents within a dataset. The file also defines a nested ParserConfig class to handle dataset parsing configurations.

The file relies on a base class Base from which it inherits common RESTful API interaction methods (get, post, put, rm), and it integrates the Document class to represent individual documents belonging to a dataset.


Classes and Methods

class DataSet(Base)

Represents a dataset entity in the InfiniFlow platform. It manages dataset metadata and provides interfaces for manipulating documents and parsing tasks within the dataset.

Properties

Constructor

def __init__(self, rag, res_dict):

Nested Class: ParserConfig

class ParserConfig(Base):
    def __init__(self, rag, res_dict):
        super().__init__(rag, res_dict)

Methods:

update

def update(self, update_message: dict) -> DataSet
dataset = DataSet(rag, existing_data)
updated_dataset = dataset.update({"name": "New Dataset Name", "description": "Updated description"})

upload_documents

def upload_documents(self, document_list: list[dict]) -> list[Document]
docs = [
    {"display_name": "file1.pdf", "blob": file1_bytes},
    {"display_name": "file2.txt", "blob": file2_bytes},
]
uploaded_docs = dataset.upload_documents(docs)

list_documents

def list_documents(
    self,
    id: str | None = None,
    name: str | None = None,
    keywords: str | None = None,
    page: int = 1,
    page_size: int = 30,
    orderby: str = "create_time",
    desc: bool = True,
    create_time_from: int = 0,
    create_time_to: int = 0,
) -> list[Document]
docs = dataset.list_documents(keywords="report", page=2, page_size=10)

delete_documents

def delete_documents(self, ids: list[str] | None = None) -> None
dataset.delete_documents(ids=["doc123", "doc456"])

async_parse_documents

def async_parse_documents(self, document_ids: list[str]) -> None
dataset.async_parse_documents(["doc123", "doc456"])

async_cancel_parse_documents

def async_cancel_parse_documents(self, document_ids: list[str]) -> None
dataset.async_cancel_parse_documents(["doc123"])

Important Implementation Details


Integration and Interactions

This file is a fundamental component enabling dataset lifecycle management and document operations in the InfiniFlow application.


Diagram: Class Structure of dataset.py

classDiagram
    class DataSet {
        -id: str
        -name: str
        -avatar: str
        -tenant_id
        -description: str
        -embedding_model: str
        -permission: str
        -document_count: int
        -chunk_count: int
        -chunk_method: str
        -parser_config: ParserConfig
        -pagerank: int
        +__init__(rag, res_dict)
        +update(update_message: dict) DataSet
        +upload_documents(document_list: list[dict]) list~Document~
        +list_documents(id: str=None, name: str=None, keywords: str=None, page: int=1, page_size: int=30, orderby: str="create_time", desc: bool=True, create_time_from: int=0, create_time_to: int=0) list~Document~
        +delete_documents(ids: list~str~=None) None
        +async_parse_documents(document_ids: list~str~) None
        +async_cancel_parse_documents(document_ids: list~str~) None
    }

    class ParserConfig {
        +__init__(rag, res_dict)
    }

    DataSet o-- ParserConfig : "has a"
    DataSet ..> Base : "inherits"
    DataSet --> Document : "uses"

Summary

The dataset.py file defines the DataSet class, a key interface for managing datasets and their documents in the InfiniFlow system. It supports dataset metadata updates, document upload/list/delete, and asynchronous parsing operations by interacting with a backend API through inherited REST methods. The nested ParserConfig class encapsulates parsing configurations. This module is essential for dataset lifecycle management and integrates closely with the Document class and the overall API client context.