raptor.py

Overview

raptor.py implements a recursive abstractive summarization and retrieval mechanism designed to organize and condense large collections of text chunks using hierarchical clustering. The main goal of this file is to process input text chunks by embedding them, clustering semantically similar chunks, and recursively summarizing these clusters into higher-level abstractions. This approach facilitates efficient retrieval of relevant information organized in tree-like layered clusters, suitable for downstream applications such as knowledge retrieval or question answering.

Key features include:

Embedding text chunks via a configurable embedding model with caching.
Dimensionality reduction using UMAP to improve clustering quality.
Gaussian Mixture Model (GMM) based clustering with automatic cluster number selection via Bayesian Information Criterion (BIC).
Recursive summarization of clusters using a large language model (LLM) with prompt templating, rate limiting, and caching.
Asynchronous operations leveraging trio concurrency for efficient I/O and parallel summarization.
Timeout decorators to prevent long-running operations.

This file primarily encapsulates the class RecursiveAbstractiveProcessing4TreeOrganizedRetrieval, which performs the entire pipeline when called with input text chunks.

Class: RecursiveAbstractiveProcessing4TreeOrganizedRetrieval

Purpose

This class orchestrates recursive abstractive summarization of a set of text chunks organized in layers via clustering. It clusters chunk embeddings, summarizes clusters into new chunks, and repeats until a hierarchical tree structure of summarized content is formed.

Constructor

def __init__(
    self, max_cluster, llm_model, embd_model, prompt, max_token=512, threshold=0.1
)

Parameters

max_cluster (int): Maximum number of clusters allowed at each recursion level.
llm_model (object): Language model instance used for generating abstractive summaries. Must expose .chat() method and .llm_name property.
embd_model (object): Embedding model instance used for encoding text chunks. Must expose .encode() method and .llm_name property.
prompt (str): Prompt template string with {cluster_content} placeholder for cluster texts to be summarized.
max_token (int, optional): Maximum token limit for LLM-generated summary (default: 512).
threshold (float, optional): Probability threshold for assigning cluster labels from Gaussian Mixture Model soft predictions (default: 0.1).

Internal Methods

`_chat(self, system, history, gen_conf)`

Asynchronously performs an LLM chat request with caching and error handling.

Parameters:
- system (str): System prompt or context.
- history (list of dict): Conversation history messages, each dict with "role" and "content".
- gen_conf (dict): Generation configuration parameters, e.g., max tokens.
Returns:
- str: LLM-generated response text.
Behavior:
Attempts to retrieve cached LLM output. If not cached, calls the LLM's .chat() method, cleans up response by removing unwanted prefix up to </think>, checks for error strings, caches the response, then returns it.
Timeout: 20 minutes.

`_embedding_encode(self, txt)`

Asynchronously obtains or computes the embedding vector for a text string with caching.

Parameters:
- txt (str): Input text to embed.
Returns:
- np.ndarray: Embedding vector for the text.
Behavior:
Checks cache for embedding; if missing, encodes via embedding model .encode(), validates output, caches result, and returns the embedding.
Timeout: 20 seconds.

`_get_optimal_clusters(self, embeddings: np.ndarray, random_state: int)`

Determines the optimal number of clusters using Gaussian Mixture Model and Bayesian Information Criterion (BIC).

Parameters:
- embeddings (np.ndarray): Array of embedding vectors to cluster.
- random_state (int): Seed for reproducibility.
Returns:
- int: Optimal number of clusters minimizing BIC.
Algorithm:
Iterates over cluster counts from 1 to max_cluster (or number of embeddings), fits GMM, computes BIC, and selects cluster count with minimum BIC.

Main Callable Method

`call(self, chunks, random_state, callback=None)`

This is the main entry point of the class. It recursively clusters and summarizes input text chunks.

Parameters:
- chunks (list of tuples): Each element is (text: str, embedding: np.ndarray) or (text: str, any) where the second element may be embedding or other metadata.
- random_state (int): Seed for clustering algorithms to ensure deterministic results.
- callback (callable, optional): Function called with progress messages during clustering and summarization.
Returns:
- list of (text, embedding) tuples: Original chunks plus recursively generated summary chunks appended at the end, representing hierarchical layers.
Process:
1. Filters out empty or invalid chunks.
2. Initializes layering indices.
3. Defines an inner async function summarize(ck_idx) that:
  - Extracts texts of cluster indices.
  - Truncates texts to fit token limits.
  - Generates abstractive summary via _chat.
  - Embeds the summary.
  - Appends the summary chunk to the chunks list.
4. While more than one chunk remains in the current layer:
  - Extract embeddings for current chunk slice.
  - If only 2 chunks, directly summarize pair.
  - Otherwise, perform dimensionality reduction with UMAP.
  - Compute optimal cluster count with _get_optimal_clusters.
  - Fit GMM for clustering; assign each point a cluster label based on thresholded membership probabilities.
  - Summarize each cluster asynchronously.
  - Update layers and labels.
  - Invoke callback with progress messages if provided.
5. Return the full list of chunks including newly generated summaries.

Implementation Details and Algorithms

Caching: Both LLM calls and embedding computations use caching (get_llm_cache, set_llm_cache, get_embed_cache, set_embed_cache) to reduce redundant computation and improve efficiency.
Timeouts: Decorators ensure that embedding and chat functions do not hang indefinitely.
Clustering:
- Uses dimensionality reduction (UMAP) to project embeddings to a lower-dimensional space optimized for cosine similarity.
- Automatically selects cluster number with BIC applied to Gaussian Mixture Models.
- Soft clustering labels assigned by thresholding posterior probabilities.
Recursive Summarization:
Summarizes clusters into higher-level chunks, gradually building a tree-like hierarchy.
Concurrency:
Uses trio for asynchronous concurrency, allowing parallel summarization of clusters.
Text Truncation:
Ensures input texts fit within LLM token limits by truncating based on estimated token count.
Error Handling:
Checks for error strings in LLM output and raises exceptions accordingly.

Interactions with Other Parts of the System

LLM and Embedding Models:
The class requires externally provided LLM and embedding model objects that conform to expected interfaces (.chat() and .encode() methods, .llm_name property).
Caching Utilities:
Relies on external caching functions from graphrag.utils to store and retrieve cached LLM and embedding results.
Rate Limiting:
Uses chat_limiter from graphrag.utils to control concurrency and avoid overwhelming LLM APIs.
Timeout Decorator:
Uses timeout from api.utils.api_utils to safeguard long-running calls.
Text Utilities:
Uses truncate from rag.utils to trim texts for token limits.
UMAP and sklearn GMM:
Depends on third-party libraries umap-learn and scikit-learn for dimensionality reduction and clustering.

This file is a core component of a larger retrieval-augmented generation or knowledge retrieval system, enabling hierarchical organization and summarization of large text corpora.

Usage Example

from models import llm_model, embd_model  # hypothetical model instances

prompt_template = (
    "Summarize the following cluster content concisely:\n\n{cluster_content}"
)

raptor = RecursiveAbstractiveProcessing4TreeOrganizedRetrieval(
    max_cluster=5,
    llm_model=llm_model,
    embd_model=embd_model,
    prompt=prompt_template,
    max_token=512,
    threshold=0.1,
)

chunks = [
    ("Text chunk one.", embd_model.encode(["Text chunk one."])[0]),
    ("Text chunk two.", embd_model.encode(["Text chunk two."])[0]),
    # ... more chunks
]

import trio

async def main():
    summarized_chunks = await raptor(chunks, random_state=42)
    for text, embd in summarized_chunks:
        print(text)

trio.run(main)

Mermaid Class Diagram

classDiagram
    class RecursiveAbstractiveProcessing4TreeOrganizedRetrieval {
        -int _max_cluster
        -object _llm_model
        -object _embd_model
        -float _threshold
        -str _prompt
        -int _max_token
        +__init__(max_cluster, llm_model, embd_model, prompt, max_token=512, threshold=0.1)
        +__call__(chunks, random_state, callback=None) async
        -_chat(system, history, gen_conf) async
        -_embedding_encode(txt) async
        -_get_optimal_clusters(embeddings: np.ndarray, random_state: int)
    }

Summary

raptor.py provides a sophisticated recursive clustering and abstractive summarization pipeline to organize text chunks into a hierarchical tree structure, leveraging embeddings, dimensionality reduction, Gaussian Mixture clustering, and powerful LLM summarization with caching and concurrency. This module is critical for building efficient multi-layered retrieval systems and knowledge abstractions.