chunk_app.py

Overview

chunk_app.py is a Flask-based API module responsible for managing "chunks" of document content within the InfiniFlow platform. These chunks represent smaller semantic pieces or segments extracted from documents stored in knowledge bases. The file provides REST endpoints to create, update, retrieve, delete, and search these chunks. It also supports auxiliary functionalities like toggling chunk availability, performing retrieval tests with ranking and filtering, and extracting knowledge graph structures.

This module integrates tightly with various services such as:

The module applies tokenization, embedding, and ranking techniques to efficiently store and retrieve document chunks, enabling downstream question answering, knowledge graph visualization, and semantic search workflows.


Detailed Explanation of Endpoints and Functions

All routes are registered under the manager Flask blueprint (assumed imported elsewhere). Each endpoint requires user authentication via flask_login and validates request parameters using a custom validate_request decorator.


1. list_chunk()

Route: /list (POST)
Purpose: List chunks of a specified document with optional keyword filtering and pagination.
Parameters (JSON):

Returns: JSON containing total count, chunk details, and document metadata.

Key Implementation Details:

Example Usage:
Request:

{
  "doc_id": "doc123",
  "page": 1,
  "size": 20,
  "keywords": "machine learning",
  "available_int": 1
}

Response:

{
  "total": 45,
  "chunks": [
    {
      "chunk_id": "chunk789",
      "content_with_weight": "...",
      "doc_id": "doc123",
      "docnm_kwd": "Document Title",
      "important_kwd": ["AI", "ML"],
      "question_kwd": ["what", "how"],
      "image_id": "img001",
      "available_int": 1,
      "positions": [[...]]
    },
    ...
  ],
  "doc": {...}
}

2. get()

Route: /get (GET)
Purpose: Retrieve detailed information about a single chunk by ID.
Parameters (Query):

Returns: JSON chunk data with sensitive/internal vector fields removed.

Key Implementation Details:


3. set()

Route: /set (POST)
Purpose: Update chunk content and metadata.
Parameters (JSON):

Returns: JSON success boolean.

Key Implementation Details:


4. switch()

Route: /switch (POST)
Purpose: Batch update availability status (available_int) of multiple chunks.
Parameters (JSON):

Returns: JSON success boolean.


5. rm()

Route: /rm (POST)
Purpose: Delete multiple chunks from storage and update document chunk counts.
Parameters (JSON):

Returns: JSON success boolean.

Key Implementation Details:


6. create()

Route: /create (POST)
Purpose: Create a new chunk from content and associate it with a document.
Parameters (JSON):

Returns: JSON containing new chunk ID.

Key Implementation Details:


7. retrieval_test()

Route: /retrieval_test (POST)
Purpose: Test retrieval of chunks from knowledge base(s) for a given question with ranking and optional filters.
Parameters (JSON):

Returns: JSON retrieval results with ranked chunks and labels.

Key Implementation Details:


8. knowledge_graph()

Route: /knowledge_graph (GET)
Purpose: Retrieve knowledge graph and mind map representations extracted from document chunks.
Parameters (Query):

Returns: JSON object with graph and mind_map structures.

Key Implementation Details:


Important Implementation Details & Algorithms


Interaction with Other System Components


Visual Diagram

classDiagram
    class ChunkApp {
        +list_chunk()
        +get()
        +set()
        +switch()
        +rm()
        +create()
        +retrieval_test()
        +knowledge_graph()
    }

    ChunkApp ..> DocumentService : uses
    ChunkApp ..> KnowledgebaseService : uses
    ChunkApp ..> SearchService : uses
    ChunkApp ..> LLMBundle : uses
    ChunkApp ..> UserTenantService : uses
    ChunkApp ..> rag_tokenizer : uses
    ChunkApp ..> settings.docStoreConn : uses
    ChunkApp ..> STORAGE_IMPL : uses

    class DocumentService {
        +get_tenant_id(doc_id)
        +get_by_id(doc_id)
        +get_embd_id(doc_id)
        +increment_chunk_num(...)
        +decrement_chunk_num(...)
    }

    class KnowledgebaseService {
        +get_kb_ids(tenant_id)
        +get_by_id(kb_id)
    }

    class SearchService {
        +get_detail(search_id)
    }

    class LLMBundle {
        +encode(texts)
    }

    class UserTenantService {
        +query(user_id)
    }

Summary

The chunk_app.py module is a critical backend component providing chunk-level management and retrieval APIs for documents in the InfiniFlow system. It leverages advanced NLP techniques including tokenization, embedding generation, and reranking models to enable efficient semantic search and knowledge extraction. The module is tightly integrated with document and knowledge base services, ensuring secure multi-tenant operations. It supports a rich set of operations from chunk creation to retrieval testing and knowledge graph generation, enabling higher-level applications like question answering and knowledge visualization.