raptor.py


Overview

raptor.py implements a recursive abstractive summarization and retrieval mechanism designed to organize and condense large collections of text chunks using hierarchical clustering. The main goal of this file is to process input text chunks by embedding them, clustering semantically similar chunks, and recursively summarizing these clusters into higher-level abstractions. This approach facilitates efficient retrieval of relevant information organized in tree-like layered clusters, suitable for downstream applications such as knowledge retrieval or question answering.

Key features include:

This file primarily encapsulates the class RecursiveAbstractiveProcessing4TreeOrganizedRetrieval, which performs the entire pipeline when called with input text chunks.


Class: RecursiveAbstractiveProcessing4TreeOrganizedRetrieval

Purpose

This class orchestrates recursive abstractive summarization of a set of text chunks organized in layers via clustering. It clusters chunk embeddings, summarizes clusters into new chunks, and repeats until a hierarchical tree structure of summarized content is formed.

Constructor

def __init__(
    self, max_cluster, llm_model, embd_model, prompt, max_token=512, threshold=0.1
)

Parameters

Internal Methods

_chat(self, system, history, gen_conf)

Asynchronously performs an LLM chat request with caching and error handling.


_embedding_encode(self, txt)

Asynchronously obtains or computes the embedding vector for a text string with caching.


_get_optimal_clusters(self, embeddings: np.ndarray, random_state: int)

Determines the optimal number of clusters using Gaussian Mixture Model and Bayesian Information Criterion (BIC).


Main Callable Method

__call__(self, chunks, random_state, callback=None)

This is the main entry point of the class. It recursively clusters and summarizes input text chunks.


Implementation Details and Algorithms


Interactions with Other Parts of the System

This file is a core component of a larger retrieval-augmented generation or knowledge retrieval system, enabling hierarchical organization and summarization of large text corpora.


Usage Example

from models import llm_model, embd_model  # hypothetical model instances

prompt_template = (
    "Summarize the following cluster content concisely:\n\n{cluster_content}"
)

raptor = RecursiveAbstractiveProcessing4TreeOrganizedRetrieval(
    max_cluster=5,
    llm_model=llm_model,
    embd_model=embd_model,
    prompt=prompt_template,
    max_token=512,
    threshold=0.1,
)

chunks = [
    ("Text chunk one.", embd_model.encode(["Text chunk one."])[0]),
    ("Text chunk two.", embd_model.encode(["Text chunk two."])[0]),
    # ... more chunks
]

import trio

async def main():
    summarized_chunks = await raptor(chunks, random_state=42)
    for text, embd in summarized_chunks:
        print(text)

trio.run(main)

Mermaid Class Diagram

classDiagram
    class RecursiveAbstractiveProcessing4TreeOrganizedRetrieval {
        -int _max_cluster
        -object _llm_model
        -object _embd_model
        -float _threshold
        -str _prompt
        -int _max_token
        +__init__(max_cluster, llm_model, embd_model, prompt, max_token=512, threshold=0.1)
        +__call__(chunks, random_state, callback=None) async
        -_chat(system, history, gen_conf) async
        -_embedding_encode(txt) async
        -_get_optimal_clusters(embeddings: np.ndarray, random_state: int)
    }

Summary

raptor.py provides a sophisticated recursive clustering and abstractive summarization pipeline to organize text chunks into a hierarchical tree structure, leveraging embeddings, dimensionality reduction, Gaussian Mixture clustering, and powerful LLM summarization with caching and concurrency. This module is critical for building efficient multi-layered retrieval systems and knowledge abstractions.