document.py

Overview

The document.py file defines the Document class, which represents a document entity within the InfiniFlow system. This class provides an interface to manage document metadata, perform updates, download document content, and interact with document chunks (smaller portions or segments of the document). It also contains an inner class ParserConfig for configuration related to document parsing.

The Document class inherits from a base class Base (from .base) which presumably provides common RESTful HTTP methods (get, post, put, rm) to interact with the backend API. The Document class uses these methods to communicate with dataset and document endpoints, facilitating CRUD operations and chunk management.


Classes

Document (inherits from Base)

Represents a document in the InfiniFlow system, encapsulating metadata, parsing configuration, and chunk management.

Properties

Property

Type

Description

id

str

Unique identifier of the document.

name

str

Name/title of the document.

thumbnail

Any (usually None)

Optional thumbnail image/preview for the document.

dataset_id

str or None

Identifier of the dataset this document belongs to.

chunk_method

str

Method used to chunk the document (default "naive").

parser_config

dict

Configuration related to document parsing (default covers pages).

source_type

str

Source type of the document (e.g., "local").

type

str

Type/category of the document.

created_by

str

Creator/owner of the document.

size

int

Size of the document in bytes.

token_count

int

Number of tokens in the document (for NLP purposes).

chunk_count

int

Number of chunks the document has been divided into.

progress

float

Progress of document processing (0.0 to 100.0).

progress_msg

str

Message describing current progress status.

process_begin_at

datetime or None

Timestamp when processing began.

process_duration

float

Duration in seconds of the processing.

run

str

Run identifier/version string.

status

str

Status code of the document (e.g., "1" meaning active).

meta_fields

dict

Dictionary to store arbitrary metadata fields.

Constructor

def __init__(self, rag, res_dict):

Initializes a Document instance, sanitizing the input dictionary to remove unknown keys before calling the base initializer.

Methods


update
def update(self, update_message: dict) -> "Document":

Updates the document metadata on the server.

doc.update({"name": "New Document Title", "meta_fields": {"author": "John Doe"}})

download
def download(self) -> bytes:

Downloads the full content of the document.

content_bytes = doc.download()
with open("document.pdf", "wb") as f:
    f.write(content_bytes)

list_chunks
def list_chunks(self, page=1, page_size=30, keywords="", id="") -> list[Chunk]:

Retrieves a paginated list of chunks belonging to this document.

chunks = doc.list_chunks(page=2, page_size=10, keywords="important")
for chunk in chunks:
    print(chunk.content)

add_chunk
def add_chunk(self, content: str, important_keywords: list[str] = [], questions: list[str] = []) -> Chunk:

Adds a new chunk to the document.

new_chunk = doc.add_chunk("This is a new chunk of text.", ["keyword1"], ["What is this?"])

delete_chunks
def delete_chunks(self, ids: list[str] | None = None) -> None:

Deletes chunks from the document.

doc.delete_chunks(["chunk_id_1", "chunk_id_2"])

Document.ParserConfig (inherits from Base)

A nested class intended to handle parser configuration details for the document.

def __init__(self, rag, res_dict):

Important Implementation Details


Interaction with Other Parts of the System


Usage Summary

The typical workflow with Document involves:

  1. Instantiating a Document object with metadata.

  2. Updating metadata using update().

  3. Downloading full content using download().

  4. Managing chunks via list_chunks(), add_chunk(), and delete_chunks() for fine-grained content operations.


Visual Diagram

classDiagram
    class Document {
        -id: str
        -name: str
        -thumbnail: Any
        -dataset_id: str
        -chunk_method: str
        -parser_config: dict
        -source_type: str
        -type: str
        -created_by: str
        -size: int
        -token_count: int
        -chunk_count: int
        -progress: float
        -progress_msg: str
        -process_begin_at: datetime
        -process_duration: float
        -run: str
        -status: str
        -meta_fields: dict
        +__init__(rag, res_dict)
        +update(update_message: dict): Document
        +download(): bytes
        +list_chunks(page=1, page_size=30, keywords="", id=""): list~Chunk~
        +add_chunk(content: str, important_keywords: list~str~=[], questions: list~str~=[]): Chunk
        +delete_chunks(ids: list~str~|None=None): None
    }

    class ParserConfig {
        +__init__(rag, res_dict)
    }

    class Base {
        <<abstract>>
        +get(url, params=None)
        +post(url, data=None)
        +put(url, data=None)
        +rm(url, data=None)
    }

    class Chunk {
        <<external>>
    }

    Document --> Base
    Document o-- ParserConfig
    Document --> Chunk : manages >

Summary

The document.py module offers a robust object-oriented wrapper around document entities in the InfiniFlow system, enabling metadata management, content download, and chunk-level manipulation through a RESTful API. Its design cleanly separates concerns with base HTTP functionality and chunk encapsulation, supporting scalable document processing workflows.