entity_resolution.py


Overview

The entity_resolution.py file is a core module of the InfiniFlow project responsible for entity resolution within graphs. It extends an Extractor base class to identify and merge nodes in a graph that refer to the same real-world entity. This is achieved by leveraging a language model (LLM) to evaluate candidate pairs of entities, decide whether they represent the same entity, and then merging the graph nodes accordingly.

The module supports asynchronous batch processing of candidate pairs with concurrency control, uses heuristics to pre-filter pairs likely to be duplicates, and updates the graph structure and node rankings (PageRank) as a result of the resolution.


Classes and Functions

Class: EntityResolutionResult

@dataclass
class EntityResolutionResult:
    graph: nx.Graph
    change: GraphChange

Class: EntityResolution

class EntityResolution(Extractor):

Constructor: __init__

def __init__(self, llm_invoker: CompletionLLM)

Method: __call__

async def __call__(
    self,
    graph: nx.Graph,
    subgraph_nodes: set[str],
    prompt_variables: dict[str, Any] | None = None,
    callback: Callable | None = None,
) -> EntityResolutionResult

Method: _resolve_candidate

async def _resolve_candidate(
    self,
    candidate_resolution_i: tuple[str, list[tuple[str, str]]],
    resolution_result: set[str],
    resolution_result_lock: trio.Lock
)

Method: _process_results

def _process_results(
    self,
    records_length: int,
    results: str,
    record_delimiter: str,
    entity_index_delimiter: str,
    resolution_result_delimiter: str
) -> list

Method: _has_digit_in_2gram_diff

def _has_digit_in_2gram_diff(self, a: str, b: str) -> bool

Method: is_similarity

def is_similarity(self, a: str, b: str) -> bool

Important Implementation Details and Algorithms


Interactions with Other System Components


Usage Example

import networkx as nx
from rag.llm.chat_model import Base as CompletionLLM
from entity_resolution import EntityResolution

# Assume graph is an existing networkx Graph with nodes having 'entity_type' attributes
graph = nx.Graph()
# ... populate graph ...

# LLM invoker instance (implementation dependent)
llm = CompletionLLM(...)

# Create EntityResolution instance
entity_resolver = EntityResolution(llm)

# Define nodes of interest
subgraph_nodes = set(graph.nodes)

# Define optional callback to track progress
def progress_callback(msg):
    print(msg)

import trio

async def resolve_entities():
    result = await entity_resolver(graph, subgraph_nodes, callback=progress_callback)
    resolved_graph = result.graph
    changes = result.change
    # Use resolved_graph and changes as needed

trio.run(resolve_entities)

Mermaid Class Diagram

classDiagram
    class EntityResolutionResult {
        +graph: nx.Graph
        +change: GraphChange
    }

    class EntityResolution {
        -_resolution_prompt: str
        -_output_formatter_prompt: str
        -_record_delimiter_key: str
        -_entity_index_delimiter_key: str
        -_resolution_result_delimiter_key: str
        -_llm: CompletionLLM
        +__init__(llm_invoker: CompletionLLM)
        +__call__(graph: nx.Graph, subgraph_nodes: set[str], prompt_variables: dict | None, callback: Callable | None) async EntityResolutionResult
        -_resolve_candidate(candidate_resolution_i: tuple, resolution_result: set, resolution_result_lock: trio.Lock) async
        -_process_results(records_length: int, results: str, record_delimiter: str, entity_index_delimiter: str, resolution_result_delimiter: str) list
        -_has_digit_in_2gram_diff(a: str, b: str) bool
        -is_similarity(a: str, b: str) bool
    }

    EntityResolutionResult <|-- EntityResolution

Summary

The entity_resolution.py module is a critical component for deduplicating and merging entities in graph data structures by leveraging natural language model inference combined with graph heuristics and asynchronous batch processing. It abstracts the complexity of interacting with LLMs for entity comparison, provides robust concurrency management, and updates the graph structure and metadata to reflect resolved entities. This module integrates tightly with InfiniFlow's graph processing and LLM infrastructure.