doc.py

Overview

The doc.py file provides a comprehensive API for managing documents and their chunks within knowledgebase datasets in the InfiniFlow platform. It supports uploading, updating, downloading, listing, deleting documents, and operations on document chunks such as parsing, adding, removing, and updating. Additionally, it implements a retrieval endpoint for searching chunks across datasets based on query parameters.

The file leverages Flask routes for HTTP API endpoints, performing validation, authorization, and interacting with various backend services and a document storage system. It integrates embedding models and tokenizers to process document content and support advanced search and retrieval features.

Classes

`Chunk` (Pydantic BaseModel)

Represents a chunk of a document with metadata and validation.

Property	Type	Description	Default
`id`	`str`	Unique identifier of the chunk	`""`
`content`	`str`	Text content of the chunk	`""`
`document_id`	`str`	ID of the parent document	`""`
`docnm_kwd`	`str`	Document name keyword associated with the chunk	`""`
`important_keywords`	`list[str]`	List of important keywords in the chunk	Empty list
`questions`	`list[str]`	List of questions related to the chunk	Empty list
`question_tks`	`str`	Tokenized questions string	`""`
`image_id`	`str`	Associated image ID (if any)	`""`
`available`	`bool`	Availability status of the chunk	True
`positions`	`list[list[int]]`	List of positions (each a list of exactly 5 integers)	Empty list

Validators

validate_positions: Ensures each sublist in positions contains exactly 5 integers.

Usage Example

from doc import Chunk

chunk_data = {
    "id": "chunk123",
    "content": "This is a sample chunk of text.",
    "document_id": "doc456",
    "important_keywords": ["sample", "text"],
    "positions": [[10, 20, 30, 40, 50]],
}

chunk = Chunk(**chunk_data)
print(chunk.content)  # Output: This is a sample chunk of text.

API Endpoints

All endpoints are registered with a Flask blueprint (assumed as manager) and protected by the @token_required decorator for authentication. The tenant_id is extracted from the authentication context.

Document Management

`POST /datasets/<dataset_id>/documents` — Upload documents to a dataset

Parameters: Files sent as form-data under the key "file".
Function: Validates files, uploads them to the knowledgebase dataset, and returns metadata for each uploaded document.
Response: List of documents with fields like id, name, chunk_count, token_count, dataset_id, chunk_method, and processing status run.
Errors: File validation errors, dataset not found, server errors.

`PUT /datasets/<dataset_id>/documents/<document_id>` — Update a document's metadata and processing configuration

Parameters: JSON body with fields such as name, parser_config, chunk_method, enabled, and meta_fields.
Function: Validates ownership and constraints, updates metadata, chunking method, parser config, or availability status.
Response: Updated document metadata.
Errors: Ownership errors, invalid updates (e.g., changing chunk count), name conflicts, unsupported chunk methods.

`GET /datasets/<dataset_id>/documents/<document_id>` — Download a document

Function: Streams the document file to the client.
Response: File stream with original filename.
Errors: Ownership errors, missing document, empty files.

`GET /datasets/<dataset_id>/documents` — List documents in a dataset

Parameters: Query filters like id, name, pagination (page, page_size), ordering (orderby, desc), and creation time range.
Function: Returns a paginated list of documents with metadata.
Response: Total count and list of documents with renamed keys.
Errors: Ownership or dataset access errors.

`DELETE /datasets/<dataset_id>/documents` — Delete documents from a dataset

Parameters: JSON body with ids (list of document IDs to delete). If omitted, deletes all documents in the dataset.
Function: Deletes documents and their associated files and chunks from storage and databases.
Response: Success or partial success message.
Errors: Ownership errors, database or storage errors.

Chunk Management

`POST /datasets/<dataset_id>/chunks` — Start parsing documents into chunks

Parameters: JSON body with document_ids to parse.
Function: Validates ownership, resets document chunk state, clears old chunks, queues parsing tasks.
Response: Success or partial success message.
Errors: Ownership errors, parsing conflicts, missing documents.

`DELETE /datasets/<dataset_id>/chunks` — Stop parsing documents

Parameters: JSON body with document_ids to stop parsing.
Function: Validates ownership, updates document status to cancel parsing, deletes partial chunk data.
Response: Success or partial success message.
Errors: Ownership errors, invalid document state.

`GET /datasets/<dataset_id>/documents/<document_id>/chunks` — List chunks of a document

Parameters: Pagination parameters (page, page_size), optional chunk id, and keywords for search.
Function: Retrieves chunks from the document storage, supports filtering by chunk ID or search keywords.
Response: Total count, list of chunks, and document metadata.
Errors: Ownership errors, chunk not found.

`POST /datasets/<dataset_id>/documents/<document_id>/chunks` — Add a new chunk to a document

Parameters: JSON body containing content (required), and optionally important_keywords and questions.
Function: Creates a chunk with tokenized content and embedding vector, inserts it into storage, updates document chunk count.
Response: Details of the added chunk.
Errors: Ownership errors, invalid input types.

`DELETE /datasets/<dataset_id>/documents/<document_id>/chunks` — Remove chunks from a document

Parameters: JSON body containing chunk_ids to remove. If omitted, removes all chunks of the document.
Function: Deletes chunks from storage, decrements chunk count in document metadata.
Response: Success or partial success message.
Errors: Ownership errors, chunk deletion failures.

`PUT /datasets/<dataset_id>/documents/<document_id>/chunks/<chunk_id>` — Update a chunk

Parameters: JSON body with optional content, important_keywords, questions, and available status.
Function: Updates chunk content, keywords, and availability; re-computes embeddings; stores updated chunk.
Response: Success message.
Errors: Chunk not found, ownership errors, validation errors.

Retrieval

`POST /retrieval` — Retrieve chunks based on a query

Parameters: JSON body with dataset_ids, question, optional document_ids, similarity thresholds, pagination, highlighting, and metadata filters.
Function: Performs vector-based retrieval using embedding models, optional reranking, language translation, keyword extraction, and knowledge graph retrieval.
Response: List of matching chunks with similarity scores and metadata.
Errors: Ownership errors, mismatched embedding models among datasets, missing parameters.

Important Implementation Details

Authorization: All endpoints require bearer token authentication validated by @token_required.
Ownership Checks: Uses KnowledgebaseService.accessible or .query to verify tenant ownership of datasets and documents.
Document and Chunk Storage:
- Documents and chunks are stored using a combination of relational database models (DocumentService, FileService) and a specialized chunk storage system accessed via settings.docStoreConn.
- Chunks are stored with embedding vectors, tokenized content, keywords, and metadata for efficient retrieval.
Chunk Tokenization and Embeddings:
- Uses rag_tokenizer for tokenization and fine-grained tokenization.
- Embeddings are computed using tenant-specific embedding models (TenantLLMService).
Parsing Workflow:
- Parsing tasks are queued asynchronously (queue_tasks) to process documents into chunks.
- Document status is updated to reflect parsing progress.
Chunk Similarity and Retrieval:
- Retrieval supports vector similarity search with thresholding, weighting, reranking, and optional cross-language support.
- Supports metadata filtering and knowledge graph integration.

Interactions with Other Components

Database Services: Uses various services like DocumentService, FileService, KnowledgebaseService, and TaskService to manage persistent data.
Storage Backend: Accesses chunk content and files via STORAGE_IMPL and settings.docStoreConn.
Tokenization and NLP: Utilizes rag_tokenizer and related NLP utilities for chunk processing.
Embedding Models: Employs tenant-specific embedding models from TenantLLMService and LLMBundle.
Search and Retrieval: Integrates with settings.retrievaler and settings.kg_retrievaler for chunk retrieval and knowledge graph queries.
API Utilities: Uses helper functions from api.utils.api_utils for error handling, result formatting, and validation.

Visual Diagram

classDiagram
    class Chunk {
        +id: str
        +content: str
        +document_id: str
        +docnm_kwd: str
        +important_keywords: list
        +questions: list
        +question_tks: str
        +image_id: str
        +available: bool
        +positions: list~list~int~~
        +validate_positions()
    }

    class DocumentService {
        +query()
        +update_by_id()
        +increment_chunk_num()
        +decrement_chunk_num()
        +remove_document()
        +update_parser_config()
        +get_by_id()
        +get_meta_by_kbs()
    }
    class FileService
    class KnowledgebaseService
    class File2DocumentService
    class TaskService
    class TenantLLMService
    class LLMBundle

    Chunk ..> rag_tokenizer : uses for tokenization
    Chunk ..> TenantLLMService : uses for embeddings

    DocumentService <.. doc.py : manages documents
    FileService <.. doc.py : manages files linked to documents
    KnowledgebaseService <.. doc.py : manages knowledgebase datasets
    File2DocumentService <.. doc.py : links files and documents
    TaskService <.. doc.py : queues parsing tasks
    TenantLLMService <.. doc.py : provides embedding models
    LLMBundle <.. doc.py : wraps language models

    doc.py o-- Chunk : defines
    doc.py --> DocumentService : CRUD operations
    doc.py --> FileService
    doc.py --> KnowledgebaseService
    doc.py --> File2DocumentService
    doc.py --> TaskService
    doc.py --> TenantLLMService
    doc.py --> LLMBundle

Summary

The doc.py file is a critical component of the InfiniFlow backend API that enables tenants to manage their documents and document chunks within knowledgebase datasets. It facilitates operations ranging from file uploads to complex chunk parsing and retrieval with advanced NLP and embedding techniques. The file heavily interacts with database service layers, storage backends, tokenizers, and embedding models to ensure secure, scalable, and efficient document processing and search capabilities.

doc.py

Overview

Classes

Chunk (Pydantic BaseModel)

Validators

Usage Example

API Endpoints

Document Management

POST /datasets/<dataset_id>/documents — Upload documents to a dataset

PUT /datasets/<dataset_id>/documents/<document_id> — Update a document's metadata and processing configuration

GET /datasets/<dataset_id>/documents/<document_id> — Download a document

GET /datasets/<dataset_id>/documents — List documents in a dataset

DELETE /datasets/<dataset_id>/documents — Delete documents from a dataset