document.py

Overview

The document.py file defines the Document class, which represents a document entity within the InfiniFlow system. This class provides an interface to manage document metadata, perform updates, download document content, and interact with document chunks (smaller portions or segments of the document). It also contains an inner class ParserConfig for configuration related to document parsing.

The Document class inherits from a base class Base (from .base) which presumably provides common RESTful HTTP methods (get, post, put, rm) to interact with the backend API. The Document class uses these methods to communicate with dataset and document endpoints, facilitating CRUD operations and chunk management.

Classes

Document (inherits from Base)

Represents a document in the InfiniFlow system, encapsulating metadata, parsing configuration, and chunk management.

Properties

Property	Type	Description
`id`	`str`	Unique identifier of the document.
`name`	`str`	Name/title of the document.
`thumbnail`	`Any` (usually `None`)	Optional thumbnail image/preview for the document.
`dataset_id`	`str` or `None`	Identifier of the dataset this document belongs to.
`chunk_method`	`str`	Method used to chunk the document (default `"naive"`).
`parser_config`	`dict`	Configuration related to document parsing (default covers pages).
`source_type`	`str`	Source type of the document (e.g., `"local"`).
`type`	`str`	Type/category of the document.
`created_by`	`str`	Creator/owner of the document.
`size`	`int`	Size of the document in bytes.
`token_count`	`int`	Number of tokens in the document (for NLP purposes).
`chunk_count`	`int`	Number of chunks the document has been divided into.
`progress`	`float`	Progress of document processing (0.0 to 100.0).
`progress_msg`	`str`	Message describing current progress status.
`process_begin_at`	`datetime` or `None`	Timestamp when processing began.
`process_duration`	`float`	Duration in seconds of the processing.
`run`	`str`	Run identifier/version string.
`status`	`str`	Status code of the document (e.g., `"1"` meaning active).
`meta_fields`	`dict`	Dictionary to store arbitrary metadata fields.

Constructor

def __init__(self, rag, res_dict):

rag: Context or client object for API interaction (passed to Base).
res_dict: Dictionary containing initial values for document properties. Only keys matching the class attributes will be retained.

Initializes a Document instance, sanitizing the input dictionary to remove unknown keys before calling the base initializer.

Methods

update

def update(self, update_message: dict) -> "Document":

Updates the document metadata on the server.

Parameters:
- update_message (dict): Dictionary containing fields to update. If it contains meta_fields, it must be a dictionary.
Returns:
- Returns self after updating internal state with server response.
Raises:
- Exception if meta_fields is not a dictionary or if the server response contains an error.
Usage Example:

doc.update({"name": "New Document Title", "meta_fields": {"author": "John Doe"}})

Implementation details:
- Calls HTTP PUT on /datasets/{dataset_id}/documents/{id} with update_message.
- Parses response JSON; on success, updates instance attributes accordingly.

download

def download(self) -> bytes:

Downloads the full content of the document.

Parameters: None.
Returns: Raw bytes content of the document.
Raises: Exception if the server response contains an error.
Usage Example:

content_bytes = doc.download()
with open("document.pdf", "wb") as f:
    f.write(content_bytes)

Implementation details:
- Calls HTTP GET on /datasets/{dataset_id}/documents/{id}.
- Attempts to parse JSON to detect error codes.
- Returns raw content if no error detected or JSON decode fails.

list_chunks

def list_chunks(self, page=1, page_size=30, keywords="", id="") -> list[Chunk]:

Retrieves a paginated list of chunks belonging to this document.

Parameters:
- page (int, default 1): Page number for pagination.
- page_size (int, default 30): Number of chunks per page.
- keywords (str, default ""): Keywords to filter chunks.
- id (str, default ""): Specific chunk ID filter.
Returns: List of Chunk instances.
Raises: Exception if server returns an error.
Usage Example:

chunks = doc.list_chunks(page=2, page_size=10, keywords="important")
for chunk in chunks:
    print(chunk.content)

Implementation details:
- Calls HTTP GET on /datasets/{dataset_id}/documents/{id}/chunks with query params.
- Parses response JSON, instantiates Chunk objects for each chunk data.

add_chunk

def add_chunk(self, content: str, important_keywords: list[str] = [], questions: list[str] = []) -> Chunk:

Adds a new chunk to the document.

Parameters:
- content (str): Text content of the chunk.
- important_keywords (list[str], optional): List of keywords flagged as important.
- questions (list[str], optional): List of associated questions for the chunk.
Returns: A new Chunk instance representing the created chunk.
Raises: Exception if server returns an error.
Usage Example:

new_chunk = doc.add_chunk("This is a new chunk of text.", ["keyword1"], ["What is this?"])

Implementation details:
- Calls HTTP POST on /datasets/{dataset_id}/documents/{id}/chunks with JSON payload including content, keywords, and questions.
- Parses response JSON and returns a Chunk object.

delete_chunks

def delete_chunks(self, ids: list[str] | None = None) -> None:

Deletes chunks from the document.

Parameters:
- ids (list[str] or None): List of chunk IDs to delete. If None, no chunks are deleted.
Returns: None.
Raises: Exception if server returns an error.
Usage Example:

doc.delete_chunks(["chunk_id_1", "chunk_id_2"])

Implementation details:
- Calls HTTP DELETE (via rm method) on /datasets/{dataset_id}/documents/{id}/chunks with a list of chunk IDs.
- Raises exception on failure.

Document.ParserConfig (inherits from Base)

A nested class intended to handle parser configuration details for the document.

Constructor:

def __init__(self, rag, res_dict):

rag: Context or client object.
res_dict: Dictionary of parser configuration values.
Implementation details:
- Currently only calls the base class constructor.
- No additional properties or methods defined explicitly.

Important Implementation Details

The class enforces strict attribute initialization by removing unknown keys from the input dictionary (res_dict) before passing to the base class.
The class relies heavily on backend REST API calls, using paths that reference the dataset and document IDs.
It uses JSON response codes (code field) to determine success or failure, raising exceptions with server messages on failure.
Chunk operations encapsulate chunk data into Chunk objects, indicating a modular object-oriented design.
The download() method gracefully handles both JSON error responses and raw content (such as PDFs or binaries).
The update() method validates the meta_fields field strictly to be a dictionary, ensuring data integrity.

Interaction with Other Parts of the System

Base Class (base.Base): Provides fundamental HTTP methods (get, post, put, rm) and likely common functionality for API interaction.
Chunk Class (chunk.Chunk): Represents document chunks. Document creates and manipulates Chunk instances when dealing with parts of the document.
API Backend: The class methods interact with a RESTful backend API exposing dataset and document endpoints for persistence and retrieval.
RAG (Retrieval-Augmented Generation) Context (rag): Passed to constructors, presumably a context or client managing API calls or document processing workflows.

Usage Summary

The typical workflow with Document involves:

Instantiating a Document object with metadata.
Updating metadata using update().
Downloading full content using download().
Managing chunks via list_chunks(), add_chunk(), and delete_chunks() for fine-grained content operations.

Visual Diagram

classDiagram
    class Document {
        -id: str
        -name: str
        -thumbnail: Any
        -dataset_id: str
        -chunk_method: str
        -parser_config: dict
        -source_type: str
        -type: str
        -created_by: str
        -size: int
        -token_count: int
        -chunk_count: int
        -progress: float
        -progress_msg: str
        -process_begin_at: datetime
        -process_duration: float
        -run: str
        -status: str
        -meta_fields: dict
        +__init__(rag, res_dict)
        +update(update_message: dict): Document
        +download(): bytes
        +list_chunks(page=1, page_size=30, keywords="", id=""): list~Chunk~
        +add_chunk(content: str, important_keywords: list~str~=[], questions: list~str~=[]): Chunk
        +delete_chunks(ids: list~str~|None=None): None
    }

    class ParserConfig {
        +__init__(rag, res_dict)
    }

    class Base {
        <<abstract>>
        +get(url, params=None)
        +post(url, data=None)
        +put(url, data=None)
        +rm(url, data=None)
    }

    class Chunk {
        <<external>>
    }

    Document --> Base
    Document o-- ParserConfig
    Document --> Chunk : manages >

Summary

The document.py module offers a robust object-oriented wrapper around document entities in the InfiniFlow system, enabling metadata management, content download, and chunk-level manipulation through a RESTful API. Its design cleanly separates concerns with base HTTP functionality and chunk encapsulation, supporting scalable document processing workflows.