smoke.py
Overview
smoke.py is a utility script designed to process a specific document within a tenant's knowledge base in the InfiniFlow platform. It extracts and updates a knowledge graph representation of the document’s content by leveraging language models and graph extraction utilities. The script is intended to be run as a standalone asynchronous application that:
Accepts command line arguments specifying the tenant and document IDs.
Retrieves the document and related metadata.
Processes chunks of the document content.
Uses large language models (LLMs) for chat and embedding functionalities.
Updates a knowledge graph representation using the
GraphExtractorandupdate_graphfunction.Outputs the resulting graph in a JSON node-link format.
This script is primarily useful for testing or "smoke testing" the graph extraction and update pipeline for a document in a tenant-specific knowledgebase.
Detailed Explanation
Imports and Initialization
Uses standard libraries:
argparse,json,logging,trio(for async concurrency).Imports InfiniFlow internal modules:
settings to initialize global settings, including retrieval components.
Database service classes (
DocumentService,KnowledgebaseService,TenantService) to fetch tenant, document, and knowledgebase data.LLMBundleto instantiate large language model bundles for chat and embedding tasks.Graph-related modules:
update_graph(to update the knowledge graph) andGraphExtractor(to extract graph from document content).
Calls settings.init_settings() to initialize necessary runtime configurations.
Functions and Methods
callback(prog=None, msg="Processing...")
Purpose: A simple logging callback function used to report progress or status messages during graph update.
Parameters:
prog(optional): progress indicator (not utilized here).msg(str): message string to log. Defaults to"Processing...".
Returns: None.
Usage: Passed as a callback to
update_graphfor real-time logging.
async def main()
Purpose: The main asynchronous function orchestrating the retrieval, processing, and updating of the document graph.
Flow & Steps:
Parse command line arguments:
--tenant_id(required): Tenant identifier.--doc_id(required): Document identifier.
Retrieve the document:
Calls
DocumentService.get_by_id(doc_id).Raises
LookupErrorif document not found.
Retrieve knowledgebase ID from the document.
Retrieve document content chunks:
Uses settings.retrievaler.chunk_list() to get up to 6 chunks of the document's content_with_weight field.
Retrieve tenant information:
Fetch tenant by
tenant_idusingTenantService.get_by_id.
Initialize LLM bundles:
Chat LLM bundle with tenant’s LLM ID.
Embedding LLM bundle with knowledgebase embedding ID.
Retrieve knowledgebase object by ID.
Update the graph:
Calls update_graph() asynchronously, passing in:
GraphExtractorclass.Tenant ID, knowledgebase ID, document ID.
Document chunks.
Language code ("English").
The two LLM bundles.
The
callbackfunction for logging.
Receives the updated graph and a list of document IDs.
Output the graph:
Prints the graph as a JSON string in node-link format using NetworkX's
node_link_data.
Parameters: None (arguments are parsed internally).
Returns: None.
Usage: Entry point executed via trio.run(main) when the script is run directly.
Important Implementation Details
Asynchronous execution: Utilizes
triofor async concurrency, allowing non-blocking calls to potentially I/O-bound services.Graph extraction and update:
The core processing relies on
GraphExtractorto convert document chunks into a graph representation.The
update_graphfunction handles integrating this extracted data into existing knowledge graphs.
Chunk retrieval:
Limits to 6 chunks to control processing size and resource usage.
LLM Bundles:
Two separate LLM bundles are used:
One for chat interactions (likely to interpret or refine content).
One for embeddings (to generate vector representations for similarity or linking).
Logging: Uses Python's
loggingmodule to track progress or status messages.
Interaction With Other Components
Database Services:
Interacts extensively with database service layers (
DocumentService,TenantService,KnowledgebaseService) to fetch necessary metadata.
Settings and Retrieval:
Depends on settings.retrievaler to retrieve document content chunks.
Graph and NLP Modules:
Uses graphrag.general.index.update_graph and graphrag.light.graph_extractor.GraphExtractor for knowledge graph construction and updating.
LLM Integration:
Utilizes
LLMBundleto interface with language models for chat and embedding tasks.
NetworkX:
Uses NetworkX for graph data manipulation and serialization.
Usage Example
python smoke.py --tenant_id 123 --doc_id 456
This command processes the document with ID 456 belonging to tenant 123, updates its knowledge graph representation, and prints the resulting graph JSON to standard output.
Mermaid Class Diagram
classDiagram
class smoke {
+callback(prog=None, msg="Processing...")
+async main()
}
class DocumentService {
+get_by_id(doc_id)
}
class TenantService {
+get_by_id(tenant_id)
}
class KnowledgebaseService {
+get_by_id(kb_id)
}
class LLMBundle {
+__init__(tenant_id, llm_type, llm_id)
}
class GraphExtractor
class update_graph {
+__call__(...)
}
smoke ..> DocumentService : uses
smoke ..> TenantService : uses
smoke ..> KnowledgebaseService : uses
smoke ..> LLMBundle : uses
smoke ..> GraphExtractor : uses
smoke ..> update_graph : calls
This diagram shows the main functions in smoke.py and their relationships to key service classes and functions it utilizes.
Summary
smoke.py is a focused, asynchronous utility script designed to test and update the knowledge graph of a single document within a tenant's knowledge base using InfiniFlow's LLM and graph extraction infrastructure. It combines data retrieval, natural language processing, and graph update workflows, and outputs the final graph structure as JSON for further analysis or visualization.