smoke.py

Overview

smoke.py is a utility script designed to process a specific document within a tenant's knowledge base in the InfiniFlow platform. It extracts and updates a knowledge graph representation of the document’s content by leveraging language models and graph extraction utilities. The script is intended to be run as a standalone asynchronous application that:

Accepts command line arguments specifying the tenant and document IDs.
Retrieves the document and related metadata.
Processes chunks of the document content.
Uses large language models (LLMs) for chat and embedding functionalities.
Updates a knowledge graph representation using the GraphExtractor and update_graph function.
Outputs the resulting graph in a JSON node-link format.

This script is primarily useful for testing or "smoke testing" the graph extraction and update pipeline for a document in a tenant-specific knowledgebase.

Detailed Explanation

Imports and Initialization

Uses standard libraries: argparse, json, logging, trio (for async concurrency).
Imports InfiniFlow internal modules:
- settings to initialize global settings, including retrieval components.
- Database service classes (DocumentService, KnowledgebaseService, TenantService) to fetch tenant, document, and knowledgebase data.
- LLMBundle to instantiate large language model bundles for chat and embedding tasks.
- Graph-related modules: update_graph (to update the knowledge graph) and GraphExtractor (to extract graph from document content).
Calls settings.init_settings() to initialize necessary runtime configurations.

Functions and Methods

`callback(prog=None, msg="Processing...")`

Purpose: A simple logging callback function used to report progress or status messages during graph update.
Parameters:
- prog (optional): progress indicator (not utilized here).
- msg (str): message string to log. Defaults to "Processing...".
Returns: None.
Usage: Passed as a callback to update_graph for real-time logging.

`async def main()`

Purpose: The main asynchronous function orchestrating the retrieval, processing, and updating of the document graph.
Flow & Steps:
1. Parse command line arguments:
  - --tenant_id (required): Tenant identifier.
  - --doc_id (required): Document identifier.
2. Retrieve the document:
  - Calls DocumentService.get_by_id(doc_id).
  - Raises LookupError if document not found.
3. Retrieve knowledgebase ID from the document.
4. Retrieve document content chunks:
  - Uses settings.retrievaler.chunk_list() to get up to 6 chunks of the document's content_with_weight field.
5. Retrieve tenant information:
  - Fetch tenant by tenant_id using TenantService.get_by_id.
6. Initialize LLM bundles:
  - Chat LLM bundle with tenant’s LLM ID.
  - Embedding LLM bundle with knowledgebase embedding ID.
7. Retrieve knowledgebase object by ID.
8. Update the graph:
  - Calls update_graph() asynchronously, passing in:
    - GraphExtractor class.
    - Tenant ID, knowledgebase ID, document ID.
    - Document chunks.
    - Language code ("English").
    - The two LLM bundles.
    - The callback function for logging.
  - Receives the updated graph and a list of document IDs.
9. Output the graph:
  - Prints the graph as a JSON string in node-link format using NetworkX's node_link_data.
Parameters: None (arguments are parsed internally).
Returns: None.
Usage: Entry point executed via trio.run(main) when the script is run directly.

Important Implementation Details

Asynchronous execution: Utilizes trio for async concurrency, allowing non-blocking calls to potentially I/O-bound services.
Graph extraction and update:
- The core processing relies on GraphExtractor to convert document chunks into a graph representation.
- The update_graph function handles integrating this extracted data into existing knowledge graphs.
Chunk retrieval:
- Limits to 6 chunks to control processing size and resource usage.
LLM Bundles:
- Two separate LLM bundles are used:
  - One for chat interactions (likely to interpret or refine content).
  - One for embeddings (to generate vector representations for similarity or linking).
Logging: Uses Python's logging module to track progress or status messages.

Interaction With Other Components

Database Services:
- Interacts extensively with database service layers (DocumentService, TenantService, KnowledgebaseService) to fetch necessary metadata.
Settings and Retrieval:
- Depends on settings.retrievaler to retrieve document content chunks.
Graph and NLP Modules:
- Uses graphrag.general.index.update_graph and graphrag.light.graph_extractor.GraphExtractor for knowledge graph construction and updating.
LLM Integration:
- Utilizes LLMBundle to interface with language models for chat and embedding tasks.
NetworkX:
- Uses NetworkX for graph data manipulation and serialization.

Usage Example

python smoke.py --tenant_id 123 --doc_id 456

This command processes the document with ID 456 belonging to tenant 123, updates its knowledge graph representation, and prints the resulting graph JSON to standard output.

Mermaid Class Diagram

classDiagram
    class smoke {
        +callback(prog=None, msg="Processing...")
        +async main()
    }

    class DocumentService {
        +get_by_id(doc_id)
    }
    class TenantService {
        +get_by_id(tenant_id)
    }
    class KnowledgebaseService {
        +get_by_id(kb_id)
    }
    class LLMBundle {
        +__init__(tenant_id, llm_type, llm_id)
    }
    class GraphExtractor
    class update_graph {
        +__call__(...)
    }

    smoke ..> DocumentService : uses
    smoke ..> TenantService : uses
    smoke ..> KnowledgebaseService : uses
    smoke ..> LLMBundle : uses
    smoke ..> GraphExtractor : uses
    smoke ..> update_graph : calls

This diagram shows the main functions in smoke.py and their relationships to key service classes and functions it utilizes.

Summary

smoke.py is a focused, asynchronous utility script designed to test and update the knowledge graph of a single document within a tenant's knowledge base using InfiniFlow's LLM and graph extraction infrastructure. It combines data retrieval, natural language processing, and graph update workflows, and outputs the final graph structure as JSON for further analysis or visualization.