graph_extractor.py

Overview

The graph_extractor.py module provides functionality to extract unipartite graphs from textual content using large language model (LLM)-based entity extraction techniques. It leverages prompt engineering and iterative querying of an LLM to identify entities and their relationships within text chunks, and then constructs graph representations (using NetworkX) from the extracted data.

This module is designed as part of a larger system for information retrieval and graph-based knowledge representation, specifically interfacing with the graphrag project and related components such as Extractor and LLM invocation wrappers.

Classes

`GraphExtractionResult`

A simple data container class using Python's dataclass decorator.

@dataclass
class GraphExtractionResult:
    output: nx.Graph
    source_docs: dict[Any, Any]

Purpose:
Encapsulates the result of a graph extraction process, containing:
- output: The extracted graph as a NetworkX graph object.
- source_docs: A dictionary containing source documents or metadata related to the extraction.
Usage Example:

result = GraphExtractionResult(output=my_graph, source_docs=my_docs)
print(result.output.nodes)

`GraphExtractor(Extractor)`

Extends the Extractor abstract/base class to specialize in extracting graph structures from input text using an LLM.

Initialization

def __init__(
    self,
    llm_invoker: CompletionLLM,
    language: str | None = "English",
    entity_types: list[str] | None = None,
    example_number: int = 2,
    max_gleanings: int | None = None,
)

Parameters:
- llm_invoker (CompletionLLM): An instance responsible for invoking the LLM to generate completions.
- language (str | None, default "English"): Language preference for entity extraction.
- entity_types (list[str] | None): List of entity types to focus on during extraction.
- example_number (int, default 2): Number of example prompts to include for few-shot prompting.
- max_gleanings (int | None): Maximum number of iterative "gleanings" or passes to refine extraction results; defaults to a constant ENTITY_EXTRACTION_MAX_GLEANINGS.
Functionality:
- Sets up prompt templates based on PROMPTS constants and fills in dynamic context such as entity types, language, and delimiters.
- Prepares initial prompt messages and calculates token budget for LLM queries to manage cost and token limits.
- Configures iterative extraction logic by setting continuation and loop-check prompts.

Method: `_process_single_content`

async def _process_single_content(self, chunk_key_dp: tuple[str, str], chunk_seq: int, num_chunks: int, out_results)

Purpose:
Asynchronously processes a single chunk of text to extract entities and relations using the LLM and accumulates the results.
Parameters:
- chunk_key_dp (tuple[str, str]): A tuple where the first element is a chunk identifier and the second is the content string.
- chunk_seq (int): The sequence number of the chunk being processed.
- num_chunks (int): Total number of chunks.
- out_results (list): A shared list to append extraction results (nodes, edges, token_count) tuples.
Process Workflow:
1. Formats an initial extraction prompt with the chunk content.
2. Makes an initial LLM call to extract raw entity data.
3. Performs iterative "gleaning" passes (up to _max_gleanings), where it:
  - Sends prior history and prompts to the LLM to refine extractions.
  - Checks with the LLM if further iteration ("loop") is necessary via a specialized prompt.
4. Parses the final combined LLM output to extract records containing entity/relationship tuples.
5. Extracts nodes and edges by calling an inherited method _entities_and_relations.
6. Appends the extracted graph data and token usage to out_results.
7. Logs progress and invokes optional callbacks for monitoring.
Return Value:
None (results appended to out_results list).
Usage Example:

# Assuming async context and an instance `extractor` of GraphExtractor
results = []
await extractor._process_single_content(("chunk1", "Some text data..."), 1, 5, results)
print(results)

Important Implementation Details

Prompt Engineering:
The class relies heavily on prompt templates defined in PROMPTS, which include example extractions, delimiters, and continuation instructions to guide the LLM in producing structured outputs.
Iterative Refinement ("Gleanings"):
To improve extraction quality, the extractor performs multiple passes (gleanings) where the LLM is asked to continue or refine previous output. After each gleaning, the LLM is queried whether another iteration is needed.
Token Budget Management:
The number of tokens sent to the LLM is tracked and controlled to avoid exceeding model limits. The class calculates leftover tokens after accounting for prompt size.
Concurrency:
Uses trio for asynchronous execution and runs the LLM calls in a separate thread to avoid blocking the async event loop.
Callback Hooks:
Supports optional callbacks (self.callback) to provide progress updates or logging hooks externally.
Parsing Output:
Extracted raw strings are split using custom delimiters, then regex is applied to parse tuples representing entities and relationships.

Interactions with Other System Components

graphrag.general.extractor.Extractor:
GraphExtractor extends this base class, inheriting helper methods and interface contracts for extraction.
rag.llm.chat_model.Base (CompletionLLM):
The LLM invocation engine is injected via this interface, abstracting underlying OpenAI or similar API calls.
graphrag.light.graph_prompt.PROMPTS:
Supplies prompt templates for guiding the LLM during extraction.
graphrag.utils:
Utilities such as chat_limiter (probably a concurrency rate limiter), message packing, and string splitting are used.
networkx:
Used to represent extracted graphs.
Asynchronous Execution:
Uses trio for async operations to efficiently manage concurrent processing of multiple text chunks.

Mermaid Class Diagram

classDiagram
    class GraphExtractionResult {
        +output: nx.Graph
        +source_docs: dict[Any, Any]
    }

    class GraphExtractor {
        -_max_gleanings: int
        -_example_number: int
        -_entity_extract_prompt: str
        -_context_base: dict
        -_continue_prompt: str
        -_if_loop_prompt: str
        -_left_token_count: float
        +__init__(llm_invoker, language, entity_types, example_number, max_gleanings)
        +_process_single_content(chunk_key_dp, chunk_seq, num_chunks, out_results)
    }

    GraphExtractor --|> Extractor
    GraphExtractor ..> CompletionLLM : uses
    GraphExtractor ..> "networkx.Graph" : creates

Summary

The graph_extractor.py file is a critical component that converts raw textual data into structured graph representations by leveraging advanced LLM prompting and iterative refinement techniques. Its asynchronous design and modular prompts enable scalable and customizable extraction workflows, making it suitable for knowledge graph construction, entity-relationship extraction, and downstream graph analytics in the broader graphrag ecosystem.