extractor.py


Overview

The extractor.py file is a core utility component in the InfiniFlow system responsible for extracting, merging, and summarizing entities and relationships from chunks of textual data using a Large Language Model (LLM). It provides an asynchronous pipeline that processes multiple text chunks concurrently, identifies meaningful nodes (entities) and edges (relationships), merges redundant information, and uses LLM-powered summarization to condense descriptions. This file interfaces heavily with the LLM abstraction layer and graph data structures, integrating natural language understanding into graph-based knowledge representations.


Classes and Functions

Class: Extractor

The Extractor class encapsulates the logic to extract entities and relationships from text chunks by leveraging an LLM invoker, managing concurrency, caching, and summarization.


Initialization

def __init__(
    self,
    llm_invoker: CompletionLLM,
    language: str | None = "English",
    entity_types: list[str] | None = None,
)

Private Method: _chat

@timeout(60 * 20)
def _chat(self, system, history, gen_conf={})

Private Method: _entities_and_relations

def _entities_and_relations(self, chunk_key: str, records: list, tuple_delimiter: str)

Asynchronous Callable: __call__

async def __call__(self, doc_id: str, chunks: list[str], callback: Callable | None = None)

Private Async Method: _merge_nodes

async def _merge_nodes(self, entity_name: str, entities: list[dict], all_relationships_data)

Private Async Method: _merge_edges

async def _merge_edges(self, src_id: str, tgt_id: str, edges_data: list[dict], all_relationships_data=None)

Private Async Method: _merge_graph_nodes

async def _merge_graph_nodes(self, graph: nx.Graph, nodes: list[str], change: GraphChange)

Private Async Method: _handle_entity_relation_summary

async def _handle_entity_relation_summary(self, entity_or_relation_name: str, description: str) -> str

Important Implementation Details and Algorithms


Interaction with Other Parts of the System


Visual Diagram: Extractor Class Structure

classDiagram
    class Extractor {
        -_llm: CompletionLLM
        -_language: str | None
        -_entity_types: list[str]
        +__init__(llm_invoker, language=None, entity_types=None)
        +__call__(doc_id: str, chunks: list[str], callback: Callable | None = None)
        -_chat(system, history, gen_conf={})
        -_entities_and_relations(chunk_key: str, records: list, tuple_delimiter: str)
        -_merge_nodes(entity_name: str, entities: list[dict], all_relationships_data)
        -_merge_edges(src_id: str, tgt_id: str, edges_data: list[dict], all_relationships_data=None)
        -_merge_graph_nodes(graph: nx.Graph, nodes: list[str], change: GraphChange)
        -_handle_entity_relation_summary(entity_or_relation_name: str, description: str) -> str
    }

Summary

The extractor.py module is a sophisticated, asynchronous entity and relationship extraction component leveraging LLMs for understanding and summarization. It manages text chunk processing at scale, merges overlapping entity and relationship data intelligently, and produces refined graph nodes and edges ready for integration into graph-based knowledge systems. Its design balances concurrency, caching, and error resilience, making it a critical part of the InfiniFlow pipeline.