doc.py
Overview
The doc.py file provides a comprehensive API for managing documents and their chunks within knowledgebase datasets in the InfiniFlow platform. It supports uploading, updating, downloading, listing, deleting documents, and operations on document chunks such as parsing, adding, removing, and updating. Additionally, it implements a retrieval endpoint for searching chunks across datasets based on query parameters.
The file leverages Flask routes for HTTP API endpoints, performing validation, authorization, and interacting with various backend services and a document storage system. It integrates embedding models and tokenizers to process document content and support advanced search and retrieval features.
Classes
Chunk (Pydantic BaseModel)
Represents a chunk of a document with metadata and validation.
Property | Type | Description | Default |
|---|---|---|---|
|
| Unique identifier of the chunk |
|
|
| Text content of the chunk |
|
|
| ID of the parent document |
|
|
| Document name keyword associated with the chunk |
|
|
| List of important keywords in the chunk | Empty list |
|
| List of questions related to the chunk | Empty list |
|
| Tokenized questions string |
|
|
| Associated image ID (if any) |
|
|
| Availability status of the chunk | |
|
| List of positions (each a list of exactly 5 integers) | Empty list |
Validators
validate_positions: Ensures each sublist inpositionscontains exactly 5 integers.
Usage Example
from doc import Chunk
chunk_data = {
"id": "chunk123",
"content": "This is a sample chunk of text.",
"document_id": "doc456",
"important_keywords": ["sample", "text"],
"positions": [[10, 20, 30, 40, 50]],
}
chunk = Chunk(**chunk_data)
print(chunk.content) # Output: This is a sample chunk of text.
API Endpoints
All endpoints are registered with a Flask blueprint (assumed as manager) and protected by the @token_required decorator for authentication. The tenant_id is extracted from the authentication context.
Document Management
POST /datasets/<dataset_id>/documents — Upload documents to a dataset
Parameters: Files sent as form-data under the key
"file".Function: Validates files, uploads them to the knowledgebase dataset, and returns metadata for each uploaded document.
Response: List of documents with fields like
id,name,chunk_count,token_count,dataset_id,chunk_method, and processing statusrun.Errors: File validation errors, dataset not found, server errors.
PUT /datasets/<dataset_id>/documents/<document_id> — Update a document's metadata and processing configuration
Parameters: JSON body with fields such as
name,parser_config,chunk_method,enabled, andmeta_fields.Function: Validates ownership and constraints, updates metadata, chunking method, parser config, or availability status.
Response: Updated document metadata.
Errors: Ownership errors, invalid updates (e.g., changing chunk count), name conflicts, unsupported chunk methods.
GET /datasets/<dataset_id>/documents/<document_id> — Download a document
Function: Streams the document file to the client.
Response: File stream with original filename.
Errors: Ownership errors, missing document, empty files.
GET /datasets/<dataset_id>/documents — List documents in a dataset
Parameters: Query filters like
id,name, pagination (page,page_size), ordering (orderby,desc), and creation time range.Function: Returns a paginated list of documents with metadata.
Response: Total count and list of documents with renamed keys.
Errors: Ownership or dataset access errors.
DELETE /datasets/<dataset_id>/documents — Delete documents from a dataset
Parameters: JSON body with
ids(list of document IDs to delete). If omitted, deletes all documents in the dataset.Function: Deletes documents and their associated files and chunks from storage and databases.
Response: Success or partial success message.
Errors: Ownership errors, database or storage errors.
Chunk Management
POST /datasets/<dataset_id>/chunks — Start parsing documents into chunks
Parameters: JSON body with
document_idsto parse.Function: Validates ownership, resets document chunk state, clears old chunks, queues parsing tasks.
Response: Success or partial success message.
Errors: Ownership errors, parsing conflicts, missing documents.
DELETE /datasets/<dataset_id>/chunks — Stop parsing documents
Parameters: JSON body with
document_idsto stop parsing.Function: Validates ownership, updates document status to cancel parsing, deletes partial chunk data.
Response: Success or partial success message.
Errors: Ownership errors, invalid document state.
GET /datasets/<dataset_id>/documents/<document_id>/chunks — List chunks of a document
Parameters: Pagination parameters (
page,page_size), optional chunkid, and keywords for search.Function: Retrieves chunks from the document storage, supports filtering by chunk ID or search keywords.
Response: Total count, list of chunks, and document metadata.
Errors: Ownership errors, chunk not found.
POST /datasets/<dataset_id>/documents/<document_id>/chunks — Add a new chunk to a document
Parameters: JSON body containing
content(required), and optionallyimportant_keywordsandquestions.Function: Creates a chunk with tokenized content and embedding vector, inserts it into storage, updates document chunk count.
Response: Details of the added chunk.
Errors: Ownership errors, invalid input types.
DELETE /datasets/<dataset_id>/documents/<document_id>/chunks — Remove chunks from a document
Parameters: JSON body containing
chunk_idsto remove. If omitted, removes all chunks of the document.Function: Deletes chunks from storage, decrements chunk count in document metadata.
Response: Success or partial success message.
Errors: Ownership errors, chunk deletion failures.
PUT /datasets/<dataset_id>/documents/<document_id>/chunks/<chunk_id> — Update a chunk
Parameters: JSON body with optional
content,important_keywords,questions, andavailablestatus.Function: Updates chunk content, keywords, and availability; re-computes embeddings; stores updated chunk.
Response: Success message.
Errors: Chunk not found, ownership errors, validation errors.
Retrieval
POST /retrieval — Retrieve chunks based on a query
Parameters: JSON body with
dataset_ids,question, optionaldocument_ids, similarity thresholds, pagination, highlighting, and metadata filters.Function: Performs vector-based retrieval using embedding models, optional reranking, language translation, keyword extraction, and knowledge graph retrieval.
Response: List of matching chunks with similarity scores and metadata.
Errors: Ownership errors, mismatched embedding models among datasets, missing parameters.
Important Implementation Details
Authorization: All endpoints require bearer token authentication validated by
@token_required.Ownership Checks: Uses
KnowledgebaseService.accessibleor.queryto verify tenant ownership of datasets and documents.Document and Chunk Storage:
Documents and chunks are stored using a combination of relational database models (
DocumentService,FileService) and a specialized chunk storage system accessed viasettings.docStoreConn.Chunks are stored with embedding vectors, tokenized content, keywords, and metadata for efficient retrieval.
Chunk Tokenization and Embeddings:
Uses
rag_tokenizerfor tokenization and fine-grained tokenization.Embeddings are computed using tenant-specific embedding models (
TenantLLMService).
Parsing Workflow:
Parsing tasks are queued asynchronously (
queue_tasks) to process documents into chunks.Document status is updated to reflect parsing progress.
Chunk Similarity and Retrieval:
Retrieval supports vector similarity search with thresholding, weighting, reranking, and optional cross-language support.
Supports metadata filtering and knowledge graph integration.
Interactions with Other Components
Database Services: Uses various services like
DocumentService,FileService,KnowledgebaseService, andTaskServiceto manage persistent data.Storage Backend: Accesses chunk content and files via
STORAGE_IMPLandsettings.docStoreConn.Tokenization and NLP: Utilizes
rag_tokenizerand related NLP utilities for chunk processing.Embedding Models: Employs tenant-specific embedding models from
TenantLLMServiceandLLMBundle.Search and Retrieval: Integrates with
settings.retrievalerandsettings.kg_retrievalerfor chunk retrieval and knowledge graph queries.API Utilities: Uses helper functions from
api.utils.api_utilsfor error handling, result formatting, and validation.
Visual Diagram
classDiagram
class Chunk {
+id: str
+content: str
+document_id: str
+docnm_kwd: str
+important_keywords: list
+questions: list
+question_tks: str
+image_id: str
+available: bool
+positions: list~list~int~~
+validate_positions()
}
class DocumentService {
+query()
+update_by_id()
+increment_chunk_num()
+decrement_chunk_num()
+remove_document()
+update_parser_config()
+get_by_id()
+get_meta_by_kbs()
}
class FileService
class KnowledgebaseService
class File2DocumentService
class TaskService
class TenantLLMService
class LLMBundle
Chunk ..> rag_tokenizer : uses for tokenization
Chunk ..> TenantLLMService : uses for embeddings
DocumentService <.. doc.py : manages documents
FileService <.. doc.py : manages files linked to documents
KnowledgebaseService <.. doc.py : manages knowledgebase datasets
File2DocumentService <.. doc.py : links files and documents
TaskService <.. doc.py : queues parsing tasks
TenantLLMService <.. doc.py : provides embedding models
LLMBundle <.. doc.py : wraps language models
doc.py o-- Chunk : defines
doc.py --> DocumentService : CRUD operations
doc.py --> FileService
doc.py --> KnowledgebaseService
doc.py --> File2DocumentService
doc.py --> TaskService
doc.py --> TenantLLMService
doc.py --> LLMBundle
Summary
The doc.py file is a critical component of the InfiniFlow backend API that enables tenants to manage their documents and document chunks within knowledgebase datasets. It facilitates operations ranging from file uploads to complex chunk parsing and retrieval with advanced NLP and embedding techniques. The file heavily interacts with database service layers, storage backends, tokenizers, and embedding models to ensure secure, scalable, and efficient document processing and search capabilities.