smoke.py

Overview

smoke.py is a command-line utility script designed to process a specific document within a tenant's knowledge base in the InfiniFlow system. It extracts graph representations from document chunks, updates these graphs, and performs resolution and community detection analyses using large language model (LLM) services.

The script leverages asynchronous programming (via trio) to efficiently handle I/O-bound tasks such as fetching data and calling LLM APIs. The main output consists of JSON-serialized graph data and community detection results printed to the console, primarily for inspection or debugging purposes.

Detailed Description

Key Functionalities

Argument Parsing: Accepts tenant ID and document ID as mandatory CLI arguments.
Data Retrieval: Fetches document, tenant, and knowledge base information from the database services.
Chunk Extraction: Retrieves up to 6 weighted content chunks from the document.
LLM Setup: Initializes two LLM bundles — one for chat interactions and another for embeddings.
Graph Extraction and Update: Uses a GraphExtractor to generate and update a graph related to the document content.
Resolution and Community Detection: Invokes asynchronous functions to conduct resolution and community analysis on the graph data.
Output: Prints JSON representations of the graph and community structure, plus a community report.

Classes and Functions

`callback(prog=None, msg="Processing...")`

Purpose: A simple logging callback to report progress messages during processing.
Parameters:
- prog (optional): Progress information (not used in this implementation).
- msg (str): The message to log.
Returns: None.
Usage Example:
```
callback(msg="Step 1 complete")
```
Implementation Detail: Uses Python's standard logging to output info-level messages.

`async def main()`

Purpose: The asynchronous main entry point for the script. Orchestrates the entire flow: argument parsing, data fetching, graph updating, resolution, community detection, and output.
Parameters: None (arguments are parsed internally).
Returns: None.
Usage: Invoked by trio.run(main) when the script is executed.
Workflow Steps:
1. Parse CLI arguments to get tenant_id and doc_id.
2. Retrieve the specified document and its knowledge base ID.
3. Obtain up to 6 content chunks with weights for the document.
4. Fetch tenant information and instantiate LLM bundles for chat and embedding.
5. Retrieve knowledge base information for embedding LLM configuration.
6. Update the graph using update_graph, passing in the GraphExtractor class and relevant data.
7. Print the graph in JSON node-link format.
8. Run with_resolution to perform graph resolution tasks.
9. Run with_community to perform community detection and print the resulting community structure and reports.
Error Handling: Raises LookupError if the document is not found.
Example Run:
```
python smoke.py -t tenant123 -d doc456
```

Important Implementation Details and Algorithms

Graph Extraction and Updating:
The script uses update_graph with the GraphExtractor class, which likely analyzes document chunks to build or update a network graph representing relationships or knowledge extracted from the text. This graph is a networkx object serialized later for output.
LLM Bundles:
Two distinct LLM bundles are used:
- Chat LLM (LLMType.CHAT): For conversational or context-aware processing.
- Embedding LLM (LLMType.EMBEDDING): To generate vector representations of text chunks, facilitating similarity measures or clustering.
Asynchronous Calls:
The main function performs asynchronous calls to update_graph, with_resolution, and with_community. These likely involve network or compute-intensive operations that benefit from concurrency.
Community Detection:
The with_community function seems to perform community detection on the graph, returning both a community structure and reports summarizing these communities.

Interaction with Other System Components

API Database Services (api.db.services.*):
- DocumentService: Fetches document data by ID.
- TenantService: Fetches tenant info, including LLM configuration.
- KnowledgebaseService: Fetches knowledge base info related to documents.
LLM Services (LLMBundle):
Initializes and manages large language model clients for chat and embedding tasks.
Graph Processing (graphrag.general.*):
- GraphExtractor: A class that likely extracts graph data from text chunks.
- update_graph, with_resolution, with_community: Utility functions that update the graph and analyze it using community detection and resolution algorithms.
Settings (api.settings):
Provides configuration including the retrievaler instance used to fetch document chunks.
Logging:
Uses Python's logging module for progress reporting.

Visual Diagram

classDiagram
    class smoke_py {
        +callback(prog=None, msg="Processing...")
        +main()
    }

    class DocumentService {
        +get_by_id(doc_id)
    }

    class TenantService {
        +get_by_id(tenant_id)
    }

    class KnowledgebaseService {
        +get_by_id(kb_id)
    }

    class LLMBundle {
        +__init__(tenant_id, llm_type, llm_id)
    }

    class GraphExtractor

    class update_graph {
        +__call__(...)
    }

    class with_resolution {
        +__call__(...)
    }

    class with_community {
        +__call__(...)
    }

    smoke_py --> DocumentService : calls get_by_id(doc_id)
    smoke_py --> TenantService : calls get_by_id(tenant_id)
    smoke_py --> KnowledgebaseService : calls get_by_id(kb_id)
    smoke_py --> LLMBundle : initializes chat and embedding bundles
    smoke_py --> GraphExtractor : passed as parameter to update_graph
    smoke_py --> update_graph : updates graph with document data
    smoke_py --> with_resolution : performs resolution on graph
    smoke_py --> with_community : performs community detection

Summary

smoke.py is a diagnostic or utility script that integrates multiple system components to analyze a document's content from a tenant's knowledge base by building and analyzing a graph representation of the document chunks. It uses advanced LLM-based services for embedding and chat interactions to enrich the graph and extract meaningful community structures, outputting detailed JSON data for further use or debugging.

Notes

The script expects a properly configured environment with access to the InfiniFlow backend services and API.
It uses asynchronous programming to maximize responsiveness during LLM API calls and database queries.
Output is primarily via standard output with JSON serialization, suitable for piping into other tools or manual inspection.

End of Documentation for smoke.py