chunk_app.py

Overview

chunk_app.py is a Flask-based API module responsible for managing "chunks" of document content within the InfiniFlow platform. These chunks represent smaller semantic pieces or segments extracted from documents stored in knowledge bases. The file provides REST endpoints to create, update, retrieve, delete, and search these chunks. It also supports auxiliary functionalities like toggling chunk availability, performing retrieval tests with ranking and filtering, and extracting knowledge graph structures.

This module integrates tightly with various services such as:

DocumentService: for document metadata and tenant information
KnowledgebaseService: for knowledge base management
SearchService: for searching and retrieving chunk data
LLMBundle: for embedding encoding and reranking using language models
UserTenantService: for user-tenant authorization

The module applies tokenization, embedding, and ranking techniques to efficiently store and retrieve document chunks, enabling downstream question answering, knowledge graph visualization, and semantic search workflows.

Detailed Explanation of Endpoints and Functions

All routes are registered under the manager Flask blueprint (assumed imported elsewhere). Each endpoint requires user authentication via flask_login and validates request parameters using a custom validate_request decorator.

1. `list_chunk()`

Route: /list (POST)
Purpose: List chunks of a specified document with optional keyword filtering and pagination.
Parameters (JSON):

doc_id (str, required): Document identifier to list chunks from.
page (int, optional, default=1): Page number for pagination.
size (int, optional, default=30): Number of chunks per page.
keywords (str, optional): Keywords/question text to filter chunks.
available_int (int, optional): Filter chunks by availability status.

Returns: JSON containing total count, chunk details, and document metadata.

Key Implementation Details:

Retrieves tenant ID from document ID.
Uses settings.retrievaler.search to perform full-text and semantic search over chunks with optional highlighting.
Assembles chunk info including content, keywords, availability, positions, and associated image IDs.
Validates chunk position data structure.

Example Usage:
Request:

{
  "doc_id": "doc123",
  "page": 1,
  "size": 20,
  "keywords": "machine learning",
  "available_int": 1
}

Response:

{
  "total": 45,
  "chunks": [
    {
      "chunk_id": "chunk789",
      "content_with_weight": "...",
      "doc_id": "doc123",
      "docnm_kwd": "Document Title",
      "important_kwd": ["AI", "ML"],
      "question_kwd": ["what", "how"],
      "image_id": "img001",
      "available_int": 1,
      "positions": [[...]]
    },
    ...
  ],
  "doc": {...}
}

2. `get()`

Route: /get (GET)
Purpose: Retrieve detailed information about a single chunk by ID.
Parameters (Query):

chunk_id (str, required): Chunk identifier.

Returns: JSON chunk data with sensitive/internal vector fields removed.

Key Implementation Details:

Iterates over tenants of current user to find the chunk in their knowledge bases.
Deletes internal vector and token fields matching regex patterns before returning the chunk.

3. `set()`

Route: /set (POST)
Purpose: Update chunk content and metadata.
Parameters (JSON):

doc_id (str, required): Document identifier.
chunk_id (str, required): Chunk identifier to update.
content_with_weight (str, required): Updated content text with weight annotations.
Optional: important_kwd (list), question_kwd (list), tag_kwd, tag_feas, available_int.

Returns: JSON success boolean.

Key Implementation Details:

Tokenizes content using rag_tokenizer.
For QA documents, splits content into question and answer parts, processes prefixes.
Encodes content and document name into vector embeddings using LLMBundle.
Combines embeddings with weighted averaging if not QA parser type.
Updates chunk document store via settings.docStoreConn.update.

4. `switch()`

Route: /switch (POST)
Purpose: Batch update availability status (available_int) of multiple chunks.
Parameters (JSON):

chunk_ids (list of str, required): Chunk IDs to update.
available_int (int, required): New availability status.
doc_id (str, required): Document ID.

Returns: JSON success boolean.

5. `rm()`

Route: /rm (POST)
Purpose: Delete multiple chunks from storage and update document chunk counts.
Parameters (JSON):

chunk_ids (list of str, required): Chunk IDs to delete.
doc_id (str, required): Document ID.

Returns: JSON success boolean.

Key Implementation Details:

Deletes chunk records from document store.
Decrements chunk count on document metadata.
Removes chunk objects from underlying storage implementation (STORAGE_IMPL).

6. `create()`

Route: /create (POST)
Purpose: Create a new chunk from content and associate it with a document.
Parameters (JSON):

doc_id (str, required): Document ID.
content_with_weight (str, required): Content text with weights.
Optional: important_kwd (list), question_kwd (list), tag_feas.

Returns: JSON containing new chunk ID.

Key Implementation Details:

Generates chunk ID deterministically using xxhash on content+doc_id.
Tokenizes content, processes keywords, timestamps creation.
Retrieves document, tenant, knowledge base info.
Applies pagerank if available.
Encodes content and document name with embedding model and stores vector.
Inserts chunk into document store and increments chunk count.

7. `retrieval_test()`

Route: /retrieval_test (POST)
Purpose: Test retrieval of chunks from knowledge base(s) for a given question with ranking and optional filters.
Parameters (JSON):

kb_id (list or str, required): Knowledge base IDs.
question (str, required): Query question text.
Optional: page, size, doc_ids, use_kg, top_k, cross_languages, search_id, rerank_id, keyword, similarity_threshold, vector_similarity_weight, highlight.

Returns: JSON retrieval results with ranked chunks and labels.

Key Implementation Details:

Validates user ownership of knowledge bases.
Applies metadata filtering based on search config (auto/manual).
Supports cross-language translation of question.
Uses embedding model for encoding.
Optionally applies reranking model.
Optionally extends results using knowledge graph retrieval.
Removes vector data from returned chunks for security.
Labels question with categories for ranking features.

8. `knowledge_graph()`

Route: /knowledge_graph (GET)
Purpose: Retrieve knowledge graph and mind map representations extracted from document chunks.
Parameters (Query):

doc_id (str, required): Document ID.

Returns: JSON object with graph and mind_map structures.

Key Implementation Details:

Searches chunks tagged with knowledge graph keywords "graph" or "mind_map".
Parses JSON content from chunk content fields.
Handles duplicate node IDs in mind_map recursively to ensure uniqueness.

Important Implementation Details & Algorithms

Chunk Identification: Uses xxHash64 to generate deterministic chunk IDs based on content and document ID, ensuring uniqueness and reproducibility.
Tokenization: Uses rag_tokenizer for multiple levels of tokenization: basic tokens, fine-grained tokens, and keyword tokens to support semantic search and keyword extraction.
Embedding & Vector Storage:
- Uses LLMBundle to encode document names and chunk content into vector embeddings for semantic retrieval.
- Embeddings from document name and content are combined with weighted averaging for non-QA document types.
- Vector embeddings stored under field names like q_{dim}_vec where dim is the embedding dimension.
Search & Retrieval:
- Integrates with a retrieval engine (settings.retrievaler) supporting full-text, semantic, and hybrid search with highlighting.
- Supports metadata filters dynamically generated via language model prompts or manually specified.
- Retrieval results are reranked optionally using a dedicated reranking model.
Knowledge Graph Extraction: Extracts structured knowledge graph or mind map JSON from specially tagged chunks. Handles duplicate node IDs to maintain graph integrity.
Security & Authorization:
- Enforces user login with login_required.
- Limits actions to tenants and knowledge bases owned by the current user.

Interaction with Other System Components

DocumentService: Fetches document metadata, tenant IDs, embedding model IDs, increments/decrements chunk counts.
KnowledgebaseService: Provides knowledge base IDs for tenants, knowledge base metadata (including pagerank).
SearchService & Retrievaler: Performs chunk search and retrieval with optional metadata filtering and reranking.
LLMBundle: Provides embedding and reranking models dynamically per tenant and knowledge base.
UserTenantService: Retrieves tenant information for the current user to enforce authorization.
Storage Layer (settings.docStoreConn & STORAGE_IMPL): Handles low-level CRUD operations for chunks and related storage objects.
RAG Utilities: Tokenizers, prompt generators, text cleaning utilities are used extensively for chunk processing.

Visual Diagram

classDiagram
    class ChunkApp {
        +list_chunk()
        +get()
        +set()
        +switch()
        +rm()
        +create()
        +retrieval_test()
        +knowledge_graph()
    }

    ChunkApp ..> DocumentService : uses
    ChunkApp ..> KnowledgebaseService : uses
    ChunkApp ..> SearchService : uses
    ChunkApp ..> LLMBundle : uses
    ChunkApp ..> UserTenantService : uses
    ChunkApp ..> rag_tokenizer : uses
    ChunkApp ..> settings.docStoreConn : uses
    ChunkApp ..> STORAGE_IMPL : uses

    class DocumentService {
        +get_tenant_id(doc_id)
        +get_by_id(doc_id)
        +get_embd_id(doc_id)
        +increment_chunk_num(...)
        +decrement_chunk_num(...)
    }

    class KnowledgebaseService {
        +get_kb_ids(tenant_id)
        +get_by_id(kb_id)
    }

    class SearchService {
        +get_detail(search_id)
    }

    class LLMBundle {
        +encode(texts)
    }

    class UserTenantService {
        +query(user_id)
    }

Summary

The chunk_app.py module is a critical backend component providing chunk-level management and retrieval APIs for documents in the InfiniFlow system. It leverages advanced NLP techniques including tokenization, embedding generation, and reranking models to enable efficient semantic search and knowledge extraction. The module is tightly integrated with document and knowledge base services, ensuring secure multi-tenant operations. It supports a rich set of operations from chunk creation to retrieval testing and knowledge graph generation, enabling higher-level applications like question answering and knowledge visualization.

chunk_app.py

Overview

Detailed Explanation of Endpoints and Functions

1. list_chunk()

2. get()

3. set()

4. switch()

5. rm()

6. create()

7. retrieval_test()

8. knowledge_graph()

Important Implementation Details & Algorithms

Interaction with Other System Components

Visual Diagram

Summary

1. `list_chunk()`

2. `get()`

3. `set()`

4. `switch()`

5. `rm()`

6. `create()`

7. `retrieval_test()`

8. `knowledge_graph()`