common.py

Overview

common.py is a utility module designed to facilitate interaction with a remote knowledge base and document management service exposed via a RESTful API. The file provides a set of functions to manage datasets (knowledge bases) and documents on a specified host server, abstracting the HTTP request details and simplifying common operations such as creating, listing, updating, and removing datasets, as well as uploading and parsing documents.

The module is primarily intended for use in applications that need to programmatically manage knowledge bases and their associated documents, enabling integration with the backend service without requiring the user to handle raw HTTP requests.

Constants

HOST_ADDRESS (str):
The base URL of the remote service. Defaults to "http://127.0.0.1:9380" but can be overridden by setting the environment variable HOST_ADDRESS.
DATASET_NAME_LIMIT (int):
A constant defining the maximum allowed length for dataset names (128 characters). This limit is defined but not enforced within this file.

Functions

1. `create_dataset(auth: str, dataset_name: str) -> dict`

Creates a new dataset (knowledge base) on the remote server.

Parameters:
- auth (str): Authorization token or header value required for API authentication.
- dataset_name (str): The name of the dataset to be created.
Returns:
A dictionary parsed from the JSON response of the server, typically containing status and dataset details.

Usage Example:

response = create_dataset(auth="Bearer token123", dataset_name="MyDataset")
print(response)

2. `list_dataset(auth: str, page_number: int, page_size: int = 30) -> dict`

Retrieves a paginated list of datasets available on the server.

Parameters:
- auth (str): Authorization token.
- page_number (int): The page number to retrieve.
- page_size (int, optional): Number of datasets per page (default is 30).
Returns:
A dictionary containing the list of datasets and pagination info.

Usage Example:

datasets = list_dataset(auth="Bearer token123", page_number=1, page_size=20)
print(datasets)

3. `rm_dataset(auth: str, dataset_id: str) -> dict`

Removes (deletes) a dataset specified by its identifier.

Parameters:
- auth (str): Authorization token.
- dataset_id (str): Unique identifier of the dataset to be removed.
Returns:
JSON response as a dictionary indicating success or failure.

Usage Example:

result = rm_dataset(auth="Bearer token123", dataset_id="dataset_123")
print(result)

4. `update_dataset(auth: str, json_req: dict) -> dict`

Updates dataset information based on the provided JSON request payload.

Parameters:
- auth (str): Authorization token.
- json_req (dict): A dictionary containing fields and values to update for the dataset.
Returns:
JSON response as a dictionary with update status.

Usage Example:

update_info = {"kb_id": "dataset_123", "name": "UpdatedName"}
response = update_dataset(auth="Bearer token123", json_req=update_info)
print(response)

5. `upload_file(auth: str, dataset_id: str, path: str) -> dict`

Uploads a file associated with a dataset.

Parameters:
- auth (str): Authorization token.
- dataset_id (str): Identifier of the dataset to which the file will be uploaded.
- path (str): Local file path of the file to upload.
Returns:
JSON response from the server indicating success or failure of the upload.
Important Note:
The file is opened in binary mode and sent as a multipart form data request.

Usage Example:

response = upload_file(auth="Bearer token123", dataset_id="dataset_123", path="/path/to/file.pdf")
print(response)

6. `list_document(auth: str, dataset_id: str) -> dict`

Lists all documents associated with a given dataset.

Parameters:
- auth (str): Authorization token.
- dataset_id (str): Identifier of the dataset whose documents are to be listed.
Returns:
Dictionary containing document list and metadata.

Usage Example:

docs = list_document(auth="Bearer token123", dataset_id="dataset_123")
print(docs)

7. `get_docs_info(auth: str, doc_ids: list) -> dict`

Fetches detailed information about a list of document IDs.

Parameters:
- auth (str): Authorization token.
- doc_ids (list of str): List of document identifiers.
Returns:
Dictionary with detailed document metadata.

Usage Example:

info = get_docs_info(auth="Bearer token123", doc_ids=["doc1", "doc2"])
print(info)

8. `parse_docs(auth: str, doc_ids: list) -> dict`

Triggers parsing and processing of specified documents on the server.

Parameters:
- auth (str): Authorization token.
- doc_ids (list of str): List of document IDs to be parsed.
Returns:
Server response indicating parsing status.

Usage Example:

parse_result = parse_docs(auth="Bearer token123", doc_ids=["doc1", "doc2"])
print(parse_result)

9. `parse_file(auth: str, document_id: str)`

Description:
This function is declared but not implemented (pass statement). It appears intended to parse a single file/document by its ID, likely similar in function to parse_docs but for individual files.
Parameters:
- auth (str): Authorization token.
- document_id (str): Identifier of the document to parse.
Returns:
None (not implemented).

Implementation Details

All functions communicate with a backend server using HTTP POST requests.
Authorization is consistently handled by passing an "Authorization" header with each request.
The backend API endpoints are derived from the base HOST_ADDRESS with specific paths such as /v1/kb/create for dataset creation.
The module uses the requests library for HTTP communication.
File uploads are handled using multipart form data in the upload_file function.
Pagination is supported in list_dataset via query parameters.
JSON responses from the server are parsed and returned as Python dictionaries.

Interaction with Other Parts of the System

This module acts as a client-side interface to the InfiniFlow knowledge base backend services.
It can be imported and used by higher-level components or services that need to manage datasets and documents.
It depends on environment configuration (HOST_ADDRESS) to target the correct backend endpoint.
It requires an authentication mechanism external to this file to provide valid tokens for API access.
The file does not include error handling, so it expects calling code to manage exceptions or response validations.

Diagram: Flowchart of Functions and Their Relationships

flowchart TD
    A[Start] --> B[create_dataset]
    A --> C[list_dataset]
    A --> D[rm_dataset]
    A --> E[update_dataset]
    A --> F[upload_file]
    A --> G[list_document]
    A --> H[get_docs_info]
    A --> I[parse_docs]
    A --> J[parse_file (unimplemented)]
    
    F --> G
    G --> H
    H --> I

Diagram Explanation:

The flowchart illustrates the module's main functions as independent API calls starting from a common entry point (user/application).
The upload and document-related functions (upload_file, list_document, get_docs_info, parse_docs) form a logical chain representing document lifecycle operations.
Dataset management functions (create_dataset, list_dataset, rm_dataset, update_dataset) are independent but related to dataset administration.
parse_file is shown as unimplemented and isolated.

Summary

The common.py file is a concise and focused utility module for managing knowledge bases and documents via a REST API. Its simple function-based interface abstracts HTTP details, enabling easy integration into larger systems that require knowledge base operations or document processing workflows. To maximize robustness, future enhancements could include error handling, input validation, and implementing currently stubbed functions like parse_file.

common.py

Overview

Constants

Functions

1. create_dataset(auth: str, dataset_name: str) -> dict

2. list_dataset(auth: str, page_number: int, page_size: int = 30) -> dict

3. rm_dataset(auth: str, dataset_id: str) -> dict

4. update_dataset(auth: str, json_req: dict) -> dict

5. upload_file(auth: str, dataset_id: str, path: str) -> dict

6. list_document(auth: str, dataset_id: str) -> dict

7. get_docs_info(auth: str, doc_ids: list) -> dict

8. parse_docs(auth: str, doc_ids: list) -> dict

9. parse_file(auth: str, document_id: str)

Implementation Details

Interaction with Other Parts of the System

Diagram: Flowchart of Functions and Their Relationships

Summary

1. `create_dataset(auth: str, dataset_name: str) -> dict`

2. `list_dataset(auth: str, page_number: int, page_size: int = 30) -> dict`

3. `rm_dataset(auth: str, dataset_id: str) -> dict`

4. `update_dataset(auth: str, json_req: dict) -> dict`

5. `upload_file(auth: str, dataset_id: str, path: str) -> dict`

6. `list_document(auth: str, dataset_id: str) -> dict`

7. `get_docs_info(auth: str, doc_ids: list) -> dict`

8. `parse_docs(auth: str, doc_ids: list) -> dict`

9. `parse_file(auth: str, document_id: str)`