doc.py

Overview

The doc.py file provides a comprehensive API for managing documents and their chunks within knowledgebase datasets in the InfiniFlow platform. It supports uploading, updating, downloading, listing, deleting documents, and operations on document chunks such as parsing, adding, removing, and updating. Additionally, it implements a retrieval endpoint for searching chunks across datasets based on query parameters.

The file leverages Flask routes for HTTP API endpoints, performing validation, authorization, and interacting with various backend services and a document storage system. It integrates embedding models and tokenizers to process document content and support advanced search and retrieval features.


Classes

Chunk (Pydantic BaseModel)

Represents a chunk of a document with metadata and validation.

Property

Type

Description

Default

id

str

Unique identifier of the chunk

""

content

str

Text content of the chunk

""

document_id

str

ID of the parent document

""

docnm_kwd

str

Document name keyword associated with the chunk

""

important_keywords

list[str]

List of important keywords in the chunk

Empty list

questions

list[str]

List of questions related to the chunk

Empty list

question_tks

str

Tokenized questions string

""

image_id

str

Associated image ID (if any)

""

available

bool

Availability status of the chunk

True

positions

list[list[int]]

List of positions (each a list of exactly 5 integers)

Empty list

Validators

Usage Example

from doc import Chunk

chunk_data = {
    "id": "chunk123",
    "content": "This is a sample chunk of text.",
    "document_id": "doc456",
    "important_keywords": ["sample", "text"],
    "positions": [[10, 20, 30, 40, 50]],
}

chunk = Chunk(**chunk_data)
print(chunk.content)  # Output: This is a sample chunk of text.

API Endpoints

All endpoints are registered with a Flask blueprint (assumed as manager) and protected by the @token_required decorator for authentication. The tenant_id is extracted from the authentication context.

Document Management

POST /datasets/<dataset_id>/documents — Upload documents to a dataset


PUT /datasets/<dataset_id>/documents/<document_id> — Update a document's metadata and processing configuration


GET /datasets/<dataset_id>/documents/<document_id> — Download a document


GET /datasets/<dataset_id>/documents — List documents in a dataset


DELETE /datasets/<dataset_id>/documents — Delete documents from a dataset


Chunk Management

POST /datasets/<dataset_id>/chunks — Start parsing documents into chunks


DELETE /datasets/<dataset_id>/chunks — Stop parsing documents


GET /datasets/<dataset_id>/documents/<document_id>/chunks — List chunks of a document


POST /datasets/<dataset_id>/documents/<document_id>/chunks — Add a new chunk to a document


DELETE /datasets/<dataset_id>/documents/<document_id>/chunks — Remove chunks from a document


PUT /datasets/<dataset_id>/documents/<document_id>/chunks/<chunk_id> — Update a chunk


Retrieval

POST /retrieval — Retrieve chunks based on a query


Important Implementation Details


Interactions with Other Components


Visual Diagram

classDiagram
    class Chunk {
        +id: str
        +content: str
        +document_id: str
        +docnm_kwd: str
        +important_keywords: list
        +questions: list
        +question_tks: str
        +image_id: str
        +available: bool
        +positions: list~list~int~~
        +validate_positions()
    }

    class DocumentService {
        +query()
        +update_by_id()
        +increment_chunk_num()
        +decrement_chunk_num()
        +remove_document()
        +update_parser_config()
        +get_by_id()
        +get_meta_by_kbs()
    }
    class FileService
    class KnowledgebaseService
    class File2DocumentService
    class TaskService
    class TenantLLMService
    class LLMBundle

    Chunk ..> rag_tokenizer : uses for tokenization
    Chunk ..> TenantLLMService : uses for embeddings

    DocumentService <.. doc.py : manages documents
    FileService <.. doc.py : manages files linked to documents
    KnowledgebaseService <.. doc.py : manages knowledgebase datasets
    File2DocumentService <.. doc.py : links files and documents
    TaskService <.. doc.py : queues parsing tasks
    TenantLLMService <.. doc.py : provides embedding models
    LLMBundle <.. doc.py : wraps language models

    doc.py o-- Chunk : defines
    doc.py --> DocumentService : CRUD operations
    doc.py --> FileService
    doc.py --> KnowledgebaseService
    doc.py --> File2DocumentService
    doc.py --> TaskService
    doc.py --> TenantLLMService
    doc.py --> LLMBundle

Summary

The doc.py file is a critical component of the InfiniFlow backend API that enables tenants to manage their documents and document chunks within knowledgebase datasets. It facilitates operations ranging from file uploads to complex chunk parsing and retrieval with advanced NLP and embedding techniques. The file heavily interacts with database service layers, storage backends, tokenizers, and embedding models to ensure secure, scalable, and efficient document processing and search capabilities.