document.py
Overview
The document.py file defines the Document class, which represents a document entity within the InfiniFlow system. This class provides an interface to manage document metadata, perform updates, download document content, and interact with document chunks (smaller portions or segments of the document). It also contains an inner class ParserConfig for configuration related to document parsing.
The Document class inherits from a base class Base (from .base) which presumably provides common RESTful HTTP methods (get, post, put, rm) to interact with the backend API. The Document class uses these methods to communicate with dataset and document endpoints, facilitating CRUD operations and chunk management.
Classes
Document (inherits from Base)
Represents a document in the InfiniFlow system, encapsulating metadata, parsing configuration, and chunk management.
Properties
Property | Type | Description |
|---|---|---|
|
| Unique identifier of the document. |
|
| Name/title of the document. |
|
| Optional thumbnail image/preview for the document. |
|
| Identifier of the dataset this document belongs to. |
|
| Method used to chunk the document (default |
|
| Configuration related to document parsing (default covers pages). |
|
| Source type of the document (e.g., |
|
| Type/category of the document. |
|
| Creator/owner of the document. |
|
| Size of the document in bytes. |
|
| Number of tokens in the document (for NLP purposes). |
|
| Number of chunks the document has been divided into. |
|
| Progress of document processing (0.0 to 100.0). |
|
| Message describing current progress status. |
|
| Timestamp when processing began. |
|
| Duration in seconds of the processing. |
|
| Run identifier/version string. |
|
| Status code of the document (e.g., |
|
| Dictionary to store arbitrary metadata fields. |
Constructor
def __init__(self, rag, res_dict):
rag: Context or client object for API interaction (passed toBase).res_dict: Dictionary containing initial values for document properties. Only keys matching the class attributes will be retained.
Initializes a Document instance, sanitizing the input dictionary to remove unknown keys before calling the base initializer.
Methods
update
def update(self, update_message: dict) -> "Document":
Updates the document metadata on the server.
Parameters:
update_message(dict): Dictionary containing fields to update. If it containsmeta_fields, it must be a dictionary.
Returns:
Returns
selfafter updating internal state with server response.
Raises:
Exceptionifmeta_fieldsis not a dictionary or if the server response contains an error.
Usage Example:
doc.update({"name": "New Document Title", "meta_fields": {"author": "John Doe"}})
Implementation details:
Calls HTTP PUT on
/datasets/{dataset_id}/documents/{id}withupdate_message.Parses response JSON; on success, updates instance attributes accordingly.
download
def download(self) -> bytes:
Downloads the full content of the document.
Parameters: None.
Returns: Raw bytes content of the document.
Raises:
Exceptionif the server response contains an error.Usage Example:
content_bytes = doc.download()
with open("document.pdf", "wb") as f:
f.write(content_bytes)
Implementation details:
Calls HTTP GET on
/datasets/{dataset_id}/documents/{id}.Attempts to parse JSON to detect error codes.
Returns raw content if no error detected or JSON decode fails.
list_chunks
def list_chunks(self, page=1, page_size=30, keywords="", id="") -> list[Chunk]:
Retrieves a paginated list of chunks belonging to this document.
Parameters:
page(int, default1): Page number for pagination.page_size(int, default30): Number of chunks per page.keywords(str, default""): Keywords to filter chunks.id(str, default""): Specific chunk ID filter.
Returns: List of
Chunkinstances.Raises:
Exceptionif server returns an error.Usage Example:
chunks = doc.list_chunks(page=2, page_size=10, keywords="important")
for chunk in chunks:
print(chunk.content)
Implementation details:
Calls HTTP GET on
/datasets/{dataset_id}/documents/{id}/chunkswith query params.Parses response JSON, instantiates
Chunkobjects for each chunk data.
add_chunk
def add_chunk(self, content: str, important_keywords: list[str] = [], questions: list[str] = []) -> Chunk:
Adds a new chunk to the document.
Parameters:
content(str): Text content of the chunk.important_keywords(list[str], optional): List of keywords flagged as important.questions(list[str], optional): List of associated questions for the chunk.
Returns: A new
Chunkinstance representing the created chunk.Raises:
Exceptionif server returns an error.Usage Example:
new_chunk = doc.add_chunk("This is a new chunk of text.", ["keyword1"], ["What is this?"])
Implementation details:
Calls HTTP POST on
/datasets/{dataset_id}/documents/{id}/chunkswith JSON payload including content, keywords, and questions.Parses response JSON and returns a
Chunkobject.
delete_chunks
def delete_chunks(self, ids: list[str] | None = None) -> None:
Deletes chunks from the document.
Parameters:
ids(list[str]orNone): List of chunk IDs to delete. IfNone, no chunks are deleted.
Returns:
None.Raises:
Exceptionif server returns an error.Usage Example:
doc.delete_chunks(["chunk_id_1", "chunk_id_2"])
Implementation details:
Calls HTTP DELETE (via
rmmethod) on/datasets/{dataset_id}/documents/{id}/chunkswith a list of chunk IDs.Raises exception on failure.
Document.ParserConfig (inherits from Base)
A nested class intended to handle parser configuration details for the document.
Constructor:
def __init__(self, rag, res_dict):
rag: Context or client object.res_dict: Dictionary of parser configuration values.Implementation details:
Currently only calls the base class constructor.
No additional properties or methods defined explicitly.
Important Implementation Details
The class enforces strict attribute initialization by removing unknown keys from the input dictionary (
res_dict) before passing to the base class.The class relies heavily on backend REST API calls, using paths that reference the dataset and document IDs.
It uses JSON response codes (
codefield) to determine success or failure, raising exceptions with server messages on failure.Chunk operations encapsulate chunk data into
Chunkobjects, indicating a modular object-oriented design.The
download()method gracefully handles both JSON error responses and raw content (such as PDFs or binaries).The
update()method validates themeta_fieldsfield strictly to be a dictionary, ensuring data integrity.
Interaction with Other Parts of the System
Base Class (
base.Base): Provides fundamental HTTP methods (get,post,put,rm) and likely common functionality for API interaction.Chunk Class (
chunk.Chunk): Represents document chunks.Documentcreates and manipulatesChunkinstances when dealing with parts of the document.API Backend: The class methods interact with a RESTful backend API exposing dataset and document endpoints for persistence and retrieval.
RAG (Retrieval-Augmented Generation) Context (
rag): Passed to constructors, presumably a context or client managing API calls or document processing workflows.
Usage Summary
The typical workflow with Document involves:
Instantiating a
Documentobject with metadata.Updating metadata using
update().Downloading full content using
download().Managing chunks via
list_chunks(),add_chunk(), anddelete_chunks()for fine-grained content operations.
Visual Diagram
classDiagram
class Document {
-id: str
-name: str
-thumbnail: Any
-dataset_id: str
-chunk_method: str
-parser_config: dict
-source_type: str
-type: str
-created_by: str
-size: int
-token_count: int
-chunk_count: int
-progress: float
-progress_msg: str
-process_begin_at: datetime
-process_duration: float
-run: str
-status: str
-meta_fields: dict
+__init__(rag, res_dict)
+update(update_message: dict): Document
+download(): bytes
+list_chunks(page=1, page_size=30, keywords="", id=""): list~Chunk~
+add_chunk(content: str, important_keywords: list~str~=[], questions: list~str~=[]): Chunk
+delete_chunks(ids: list~str~|None=None): None
}
class ParserConfig {
+__init__(rag, res_dict)
}
class Base {
<<abstract>>
+get(url, params=None)
+post(url, data=None)
+put(url, data=None)
+rm(url, data=None)
}
class Chunk {
<<external>>
}
Document --> Base
Document o-- ParserConfig
Document --> Chunk : manages >
Summary
The document.py module offers a robust object-oriented wrapper around document entities in the InfiniFlow system, enabling metadata management, content download, and chunk-level manipulation through a RESTful API. Its design cleanly separates concerns with base HTTP functionality and chunk encapsulation, supporting scalable document processing workflows.