graph_extractor.py

Overview

The graph_extractor.py file provides an implementation for extracting unipartite graphs from textual content using large language models (LLMs). It defines classes and methods that convert raw input text into graph representations, identifying entities as nodes and their relationships as edges. This extraction is performed via iterative prompting and completion cycles with an LLM, allowing the system to "glean" multiple entity relations from a single text chunk.

This file is part of a broader system that leverages LLMs for information extraction tasks, specifically focusing on graph structure extraction from unstructured text. It integrates with other components such as prompt templates, tokenization utilities, and LLM invocation abstractions to achieve its functionality.

Classes and Functions

`GraphExtractionResult`

@dataclass
class GraphExtractionResult:
    output: nx.Graph
    source_docs: dict[Any, Any]

Description:
A simple data container class (using Python's dataclass) that encapsulates the output of the graph extraction process.
Attributes:
- output (nx.Graph): The resulting NetworkX graph constructed from the extracted entities and relations.
- source_docs (dict[Any, Any]): Mapping of source documents or content chunks used in the extraction.

Usage Example:

result = GraphExtractionResult(output=my_graph, source_docs=my_docs)
print(result.output.nodes())

`GraphExtractor`

class GraphExtractor(Extractor):
    ...

Description:
Main class responsible for extracting a unipartite graph from input text by interacting with an LLM. It inherits from a base Extractor class and uses prompt engineering to guide the language model to produce structured graph tuples representing nodes and edges.
Key Attributes:
- _llm: The LLM invoker instance used to send prompts and receive completions.
- _join_descriptions: Controls whether descriptions of entities are joined.
- _prompt_variables: Dictionary of variables used to customize prompts (delimiters, entity types, etc.).
- _max_gleanings: Maximum number of iterative extraction passes to perform per text chunk.
- _loop_args: Contains parameters for controlling the continuation loop behavior via logit biasing.
- _extraction_prompt: The template prompt string used for initial extraction.
- _on_error: Callback function to handle errors during extraction.
- prompt_token_count: Number of tokens in the initial extraction prompt (used for cost/token budgeting).

`init`

def __init__(
    self,
    llm_invoker: CompletionLLM,
    language: str | None = "English",
    entity_types: list[str] | None = None,
    tuple_delimiter_key: str | None = None,
    record_delimiter_key: str | None = None,
    input_text_key: str | None = None,
    entity_types_key: str | None = None,
    completion_delimiter_key: str | None = None,
    join_descriptions=True,
    max_gleanings: int | None = None,
    on_error: ErrorHandlerFn | None = None,
):

Description:
Initializes the GraphExtractor instance with various configuration parameters and dependencies.
Parameters:
- llm_invoker (CompletionLLM): The language model interface used to issue prompts and receive responses.
- language (str | None): Language for extraction; defaults to English.
- entity_types (list[str] | None): List of entity types to extract.
- tuple_delimiter_key (str | None): Key name for tuple delimiter in prompts.
- record_delimiter_key (str | None): Key name for record delimiter in prompts.
- input_text_key (str | None): Key name for input text variable in prompts.
- entity_types_key (str | None): Key name for entity types in prompts.
- completion_delimiter_key (str | None): Key name for completion delimiter in prompts.
- join_descriptions (bool): Whether to join descriptions of entities.
- max_gleanings (int | None): Maximum number of extraction loops to perform.
- on_error (ErrorHandlerFn | None): Callback to handle errors encountered during processing.
Returns:
- Instance of GraphExtractor.
Implementation Details:
- Uses default constants for delimiters if none provided.
- Sets up prompt variables with default delimiters and entity types.
- Calculates token count for the extraction prompt to help manage token budgets.
- Prepares logit_bias for controlling continuation prompts in the loop using token encoding of "YES" and "NO".

Usage Example:

from rag.llm.chat_model import Base as CompletionLLM

llm_instance = CompletionLLM(...)
extractor = GraphExtractor(
    llm_invoker=llm_instance,
    entity_types=["Person", "Organization"],
    max_gleanings=5
)

`_process_single_content`

async def _process_single_content(self, chunk_key_dp: tuple[str, str], chunk_seq: int, num_chunks: int, out_results):

Description:
Internal asynchronous method to process a single chunk of text and extract graph entities and relations by interacting with the LLM. It performs multi-turn prompting to maximize entity extraction ("gleaning") from the content.
Parameters:
- chunk_key_dp (tuple[str, str]): Tuple containing a unique chunk key and the text content string.
- chunk_seq (int): The sequence number/index of this content chunk.
- num_chunks (int): Total number of chunks being processed in the batch.
- out_results (list): A shared list where the function appends the extraction results as tuples (nodes, edges, token_count).
Returns:
- None (results are appended to out_results).
Implementation Details:
- Prepares prompt variables and performs variable replacements in the extraction prompt.
- Sends an initial prompt to the LLM to extract entities and relations.
- Enters a loop to continue prompting the LLM with continuation prompts (CONTINUE_PROMPT and LOOP_PROMPT) to extract additional entities up to _max_gleanings.
- Uses token counting utilities to keep track of tokens used for cost and quota management.
- Splits the combined LLM response into records using configured delimiters.
- Extracts tuples from records and calls _entities_and_relations to parse nodes and edges.
- Optionally invokes a callback to report progress.

Usage Example:

import trio

async def extract_graph():
    extractor = GraphExtractor(llm_invoker=llm_instance)
    out_results = []
    await extractor._process_single_content(("chunk1", "Alice works at Contoso."), 1, 1, out_results)
    nodes, edges, token_count = out_results[0]
    print(f"Extracted {len(nodes)} nodes and {len(edges)} edges using {token_count} tokens.")
trio.run(extract_graph)

Important Implementation Details

Iterative Extraction ("Gleaning") Loop:
The _process_single_content method repeatedly prompts the LLM using specially crafted prompts to maximize the number of extracted entities and relationships. This loop uses a combination of CONTINUE_PROMPT (to request additional output) and LOOP_PROMPT (to ask if more data is available) with logit biasing to encourage clear "YES"/"NO" responses.
Prompt Variable Replacement:
The extraction prompt uses variable placeholders (like delimiters, entity types, and input text) that are dynamically replaced to customize the prompt per chunk.
Token Counting:
The system counts tokens consumed by prompts and LLM responses using num_tokens_from_string to manage costs or limits on LLM usage.
Error Handling:
The constructor accepts an on_error callback that can be used to handle exceptions or errors during extraction, although this is not explicitly used in the shown code snippet.
Concurrency and Rate Limiting:
Uses trio for asynchronous execution and chat_limiter to throttle requests to the LLM, preventing overloading or rate limit violations.
Parsing of Extracted Records:
Extracted text records are split by multiple delimiters and parsed via regex to extract tuples in the form of (entity1, relation, entity2) or similar. These tuples are converted into graph nodes and edges.

Interaction with Other System Components

Extractor Base Class:
GraphExtractor extends a general Extractor class, likely providing shared extraction interfaces and utilities.
LLM Interface (CompletionLLM):
Uses a standardized LLM interface for issuing chat completions. This abstraction allows plugging in different underlying LLM implementations.
Prompt Templates:
Uses prompt templates imported from graphrag.general.graph_prompt such as GRAPH_EXTRACTION_PROMPT, CONTINUE_PROMPT, and LOOP_PROMPT to control LLM behavior.
Tokenization and Encoding:
Uses tiktoken package to encode tokens and calculate token counts, as well as to set logit biases.
Utility Functions:
Employs utility functions like perform_variable_replacements for prompt customization, split_string_by_multi_markers for parsing output, and num_tokens_from_string for token counting.
Graph Representation:
The extracted entities and relations are represented as a NetworkX graph (nx.Graph).

Visual Diagram: Class Diagram for `GraphExtractor` and `GraphExtractionResult`

classDiagram
    class GraphExtractionResult {
        +output: nx.Graph
        +source_docs: dict[Any, Any]
    }

    class GraphExtractor {
        -_llm: CompletionLLM
        -_join_descriptions: bool
        -_tuple_delimiter_key: str
        -_record_delimiter_key: str
        -_entity_types_key: str
        -_input_text_key: str
        -_completion_delimiter_key: str
        -_entity_name_key: str
        -_input_descriptions_key: str
        -_extraction_prompt: str
        -_summarization_prompt: str
        -_loop_args: dict[str, Any]
        -_max_gleanings: int
        -_on_error: ErrorHandlerFn
        +__init__(llm_invoker, language, entity_types, tuple_delimiter_key, record_delimiter_key, input_text_key, entity_types_key, completion_delimiter_key, join_descriptions, max_gleanings, on_error)
        +_process_single_content(chunk_key_dp, chunk_seq, num_chunks, out_results)
    }

    GraphExtractor --|> Extractor

Summary

The graph_extractor.py file implements a specialized extractor that converts text into graph structures using iterative prompting of language models. It manages prompt customization, multi-turn extraction loops, token counting, and error handling. Results are returned as NetworkX graphs annotated with source document metadata. This component fits into a larger system for advanced information extraction leveraging LLMs and prompt engineering.

End of Documentation for graph_extractor.py

graph_extractor.py

Overview

Classes and Functions

GraphExtractionResult

GraphExtractor

__init__

_process_single_content

Important Implementation Details

Interaction with Other System Components

Visual Diagram: Class Diagram for GraphExtractor and GraphExtractionResult

Summary

End of Documentation for graph_extractor.py

`GraphExtractionResult`

`GraphExtractor`

`init`

`_process_single_content`

Visual Diagram: Class Diagram for `GraphExtractor` and `GraphExtractionResult`