graph_extractor.py
Overview
The graph_extractor.py file provides an implementation for extracting unipartite graphs from textual content using large language models (LLMs). It defines classes and methods that convert raw input text into graph representations, identifying entities as nodes and their relationships as edges. This extraction is performed via iterative prompting and completion cycles with an LLM, allowing the system to "glean" multiple entity relations from a single text chunk.
This file is part of a broader system that leverages LLMs for information extraction tasks, specifically focusing on graph structure extraction from unstructured text. It integrates with other components such as prompt templates, tokenization utilities, and LLM invocation abstractions to achieve its functionality.
Classes and Functions
GraphExtractionResult
@dataclass
class GraphExtractionResult:
output: nx.Graph
source_docs: dict[Any, Any]
Description:
A simple data container class (using Python'sdataclass) that encapsulates the output of the graph extraction process.Attributes:
output(nx.Graph): The resulting NetworkX graph constructed from the extracted entities and relations.source_docs(dict[Any, Any]): Mapping of source documents or content chunks used in the extraction.
Usage Example:
result = GraphExtractionResult(output=my_graph, source_docs=my_docs) print(result.output.nodes())
GraphExtractor
class GraphExtractor(Extractor):
...
Description:
Main class responsible for extracting a unipartite graph from input text by interacting with an LLM. It inherits from a baseExtractorclass and uses prompt engineering to guide the language model to produce structured graph tuples representing nodes and edges.Key Attributes:
_llm: The LLM invoker instance used to send prompts and receive completions._join_descriptions: Controls whether descriptions of entities are joined._prompt_variables: Dictionary of variables used to customize prompts (delimiters, entity types, etc.)._max_gleanings: Maximum number of iterative extraction passes to perform per text chunk._loop_args: Contains parameters for controlling the continuation loop behavior via logit biasing._extraction_prompt: The template prompt string used for initial extraction._on_error: Callback function to handle errors during extraction.prompt_token_count: Number of tokens in the initial extraction prompt (used for cost/token budgeting).
__init__
def __init__(
self,
llm_invoker: CompletionLLM,
language: str | None = "English",
entity_types: list[str] | None = None,
tuple_delimiter_key: str | None = None,
record_delimiter_key: str | None = None,
input_text_key: str | None = None,
entity_types_key: str | None = None,
completion_delimiter_key: str | None = None,
join_descriptions=True,
max_gleanings: int | None = None,
on_error: ErrorHandlerFn | None = None,
):
Description:
Initializes theGraphExtractorinstance with various configuration parameters and dependencies.Parameters:
llm_invoker(CompletionLLM): The language model interface used to issue prompts and receive responses.language(str | None): Language for extraction; defaults to English.entity_types(list[str] | None): List of entity types to extract.tuple_delimiter_key(str | None): Key name for tuple delimiter in prompts.record_delimiter_key(str | None): Key name for record delimiter in prompts.input_text_key(str | None): Key name for input text variable in prompts.entity_types_key(str | None): Key name for entity types in prompts.completion_delimiter_key(str | None): Key name for completion delimiter in prompts.join_descriptions(bool): Whether to join descriptions of entities.max_gleanings(int | None): Maximum number of extraction loops to perform.on_error(ErrorHandlerFn | None): Callback to handle errors encountered during processing.
Returns:
Instance of
GraphExtractor.
Implementation Details:
Uses default constants for delimiters if none provided.
Sets up prompt variables with default delimiters and entity types.
Calculates token count for the extraction prompt to help manage token budgets.
Prepares
logit_biasfor controlling continuation prompts in the loop using token encoding of "YES" and "NO".
Usage Example:
from rag.llm.chat_model import Base as CompletionLLM llm_instance = CompletionLLM(...) extractor = GraphExtractor( llm_invoker=llm_instance, entity_types=["Person", "Organization"], max_gleanings=5 )
_process_single_content
async def _process_single_content(self, chunk_key_dp: tuple[str, str], chunk_seq: int, num_chunks: int, out_results):
Description:
Internal asynchronous method to process a single chunk of text and extract graph entities and relations by interacting with the LLM. It performs multi-turn prompting to maximize entity extraction ("gleaning") from the content.Parameters:
chunk_key_dp(tuple[str, str]): Tuple containing a unique chunk key and the text content string.chunk_seq(int): The sequence number/index of this content chunk.num_chunks(int): Total number of chunks being processed in the batch.out_results(list): A shared list where the function appends the extraction results as tuples(nodes, edges, token_count).
Returns:
None(results are appended toout_results).
Implementation Details:
Prepares prompt variables and performs variable replacements in the extraction prompt.
Sends an initial prompt to the LLM to extract entities and relations.
Enters a loop to continue prompting the LLM with continuation prompts (
CONTINUE_PROMPTandLOOP_PROMPT) to extract additional entities up to_max_gleanings.Uses token counting utilities to keep track of tokens used for cost and quota management.
Splits the combined LLM response into records using configured delimiters.
Extracts tuples from records and calls
_entities_and_relationsto parse nodes and edges.Optionally invokes a callback to report progress.
Usage Example:
import trio async def extract_graph(): extractor = GraphExtractor(llm_invoker=llm_instance) out_results = [] await extractor._process_single_content(("chunk1", "Alice works at Contoso."), 1, 1, out_results) nodes, edges, token_count = out_results[0] print(f"Extracted {len(nodes)} nodes and {len(edges)} edges using {token_count} tokens.") trio.run(extract_graph)
Important Implementation Details
Iterative Extraction ("Gleaning") Loop:
The_process_single_contentmethod repeatedly prompts the LLM using specially crafted prompts to maximize the number of extracted entities and relationships. This loop uses a combination ofCONTINUE_PROMPT(to request additional output) andLOOP_PROMPT(to ask if more data is available) with logit biasing to encourage clear "YES"/"NO" responses.Prompt Variable Replacement:
The extraction prompt uses variable placeholders (like delimiters, entity types, and input text) that are dynamically replaced to customize the prompt per chunk.Token Counting:
The system counts tokens consumed by prompts and LLM responses usingnum_tokens_from_stringto manage costs or limits on LLM usage.Error Handling:
The constructor accepts anon_errorcallback that can be used to handle exceptions or errors during extraction, although this is not explicitly used in the shown code snippet.Concurrency and Rate Limiting:
Usestriofor asynchronous execution andchat_limiterto throttle requests to the LLM, preventing overloading or rate limit violations.Parsing of Extracted Records:
Extracted text records are split by multiple delimiters and parsed via regex to extract tuples in the form of(entity1, relation, entity2)or similar. These tuples are converted into graph nodes and edges.
Interaction with Other System Components
ExtractorBase Class:GraphExtractorextends a generalExtractorclass, likely providing shared extraction interfaces and utilities.LLM Interface (
CompletionLLM):
Uses a standardized LLM interface for issuing chat completions. This abstraction allows plugging in different underlying LLM implementations.Prompt Templates:
Uses prompt templates imported fromgraphrag.general.graph_promptsuch asGRAPH_EXTRACTION_PROMPT,CONTINUE_PROMPT, andLOOP_PROMPTto control LLM behavior.Tokenization and Encoding:
Usestiktokenpackage to encode tokens and calculate token counts, as well as to set logit biases.Utility Functions:
Employs utility functions likeperform_variable_replacementsfor prompt customization,split_string_by_multi_markersfor parsing output, andnum_tokens_from_stringfor token counting.Graph Representation:
The extracted entities and relations are represented as a NetworkX graph (nx.Graph).
Visual Diagram: Class Diagram for GraphExtractor and GraphExtractionResult
classDiagram
class GraphExtractionResult {
+output: nx.Graph
+source_docs: dict[Any, Any]
}
class GraphExtractor {
-_llm: CompletionLLM
-_join_descriptions: bool
-_tuple_delimiter_key: str
-_record_delimiter_key: str
-_entity_types_key: str
-_input_text_key: str
-_completion_delimiter_key: str
-_entity_name_key: str
-_input_descriptions_key: str
-_extraction_prompt: str
-_summarization_prompt: str
-_loop_args: dict[str, Any]
-_max_gleanings: int
-_on_error: ErrorHandlerFn
+__init__(llm_invoker, language, entity_types, tuple_delimiter_key, record_delimiter_key, input_text_key, entity_types_key, completion_delimiter_key, join_descriptions, max_gleanings, on_error)
+_process_single_content(chunk_key_dp, chunk_seq, num_chunks, out_results)
}
GraphExtractor --|> Extractor
Summary
The graph_extractor.py file implements a specialized extractor that converts text into graph structures using iterative prompting of language models. It manages prompt customization, multi-turn extraction loops, token counting, and error handling. Results are returned as NetworkX graphs annotated with source document metadata. This component fits into a larger system for advanced information extraction leveraging LLMs and prompt engineering.