smoke.py
Overview
smoke.py is a command-line utility script designed to process a specific document within a tenant's knowledge base in the InfiniFlow system. It extracts graph representations from document chunks, updates these graphs, and performs resolution and community detection analyses using large language model (LLM) services.
The script leverages asynchronous programming (via trio) to efficiently handle I/O-bound tasks such as fetching data and calling LLM APIs. The main output consists of JSON-serialized graph data and community detection results printed to the console, primarily for inspection or debugging purposes.
Detailed Description
Key Functionalities
Argument Parsing: Accepts tenant ID and document ID as mandatory CLI arguments.
Data Retrieval: Fetches document, tenant, and knowledge base information from the database services.
Chunk Extraction: Retrieves up to 6 weighted content chunks from the document.
LLM Setup: Initializes two LLM bundles — one for chat interactions and another for embeddings.
Graph Extraction and Update: Uses a
GraphExtractorto generate and update a graph related to the document content.Resolution and Community Detection: Invokes asynchronous functions to conduct resolution and community analysis on the graph data.
Output: Prints JSON representations of the graph and community structure, plus a community report.
Classes and Functions
callback(prog=None, msg="Processing...")
Purpose: A simple logging callback to report progress messages during processing.
Parameters:
prog(optional): Progress information (not used in this implementation).msg(str): The message to log.
Returns: None.
Usage Example:
callback(msg="Step 1 complete")Implementation Detail: Uses Python's standard logging to output info-level messages.
async def main()
Purpose: The asynchronous main entry point for the script. Orchestrates the entire flow: argument parsing, data fetching, graph updating, resolution, community detection, and output.
Parameters: None (arguments are parsed internally).
Returns: None.
Usage: Invoked by
trio.run(main)when the script is executed.Workflow Steps:
Parse CLI arguments to get
tenant_idanddoc_id.Retrieve the specified document and its knowledge base ID.
Obtain up to 6 content chunks with weights for the document.
Fetch tenant information and instantiate LLM bundles for chat and embedding.
Retrieve knowledge base information for embedding LLM configuration.
Update the graph using
update_graph, passing in theGraphExtractorclass and relevant data.Print the graph in JSON node-link format.
Run
with_resolutionto perform graph resolution tasks.Run
with_communityto perform community detection and print the resulting community structure and reports.
Error Handling: Raises
LookupErrorif the document is not found.Example Run:
python smoke.py -t tenant123 -d doc456
Important Implementation Details and Algorithms
Graph Extraction and Updating:
The script usesupdate_graphwith theGraphExtractorclass, which likely analyzes document chunks to build or update a network graph representing relationships or knowledge extracted from the text. This graph is anetworkxobject serialized later for output.LLM Bundles:
Two distinct LLM bundles are used:Chat LLM (
LLMType.CHAT): For conversational or context-aware processing.Embedding LLM (
LLMType.EMBEDDING): To generate vector representations of text chunks, facilitating similarity measures or clustering.
Asynchronous Calls:
Themainfunction performs asynchronous calls toupdate_graph,with_resolution, andwith_community. These likely involve network or compute-intensive operations that benefit from concurrency.Community Detection:
Thewith_communityfunction seems to perform community detection on the graph, returning both a community structure and reports summarizing these communities.
Interaction with Other System Components
API Database Services (
api.db.services.*):DocumentService: Fetches document data by ID.TenantService: Fetches tenant info, including LLM configuration.KnowledgebaseService: Fetches knowledge base info related to documents.
LLM Services (
LLMBundle):
Initializes and manages large language model clients for chat and embedding tasks.Graph Processing (
graphrag.general.*):GraphExtractor: A class that likely extracts graph data from text chunks.update_graph,with_resolution,with_community: Utility functions that update the graph and analyze it using community detection and resolution algorithms.
Settings (
api.settings):
Provides configuration including theretrievalerinstance used to fetch document chunks.Logging:
Uses Python'sloggingmodule for progress reporting.
Visual Diagram
classDiagram
class smoke_py {
+callback(prog=None, msg="Processing...")
+main()
}
class DocumentService {
+get_by_id(doc_id)
}
class TenantService {
+get_by_id(tenant_id)
}
class KnowledgebaseService {
+get_by_id(kb_id)
}
class LLMBundle {
+__init__(tenant_id, llm_type, llm_id)
}
class GraphExtractor
class update_graph {
+__call__(...)
}
class with_resolution {
+__call__(...)
}
class with_community {
+__call__(...)
}
smoke_py --> DocumentService : calls get_by_id(doc_id)
smoke_py --> TenantService : calls get_by_id(tenant_id)
smoke_py --> KnowledgebaseService : calls get_by_id(kb_id)
smoke_py --> LLMBundle : initializes chat and embedding bundles
smoke_py --> GraphExtractor : passed as parameter to update_graph
smoke_py --> update_graph : updates graph with document data
smoke_py --> with_resolution : performs resolution on graph
smoke_py --> with_community : performs community detection
Summary
smoke.py is a diagnostic or utility script that integrates multiple system components to analyze a document's content from a tenant's knowledge base by building and analyzing a graph representation of the document chunks. It uses advanced LLM-based services for embedding and chat interactions to enrich the graph and extract meaningful community structures, outputting detailed JSON data for further use or debugging.
Notes
The script expects a properly configured environment with access to the InfiniFlow backend services and API.
It uses asynchronous programming to maximize responsiveness during LLM API calls and database queries.
Output is primarily via standard output with JSON serialization, suitable for piping into other tools or manual inspection.