search.py

Overview

search.py implements a knowledge graph search utility within the InfiniFlow system, providing semantic retrieval capabilities over entities and relations stored in a knowledge base (KB). It extends a generic search Dealer class to support:

Query rewriting into keywords and entity types via large language models (LLMs).
Retrieval of relevant entities and relations based on query text, keywords, and entity types.
Integration of multi-hop neighborhood information to enrich relation relevance.
Community report retrieval related to identified entities.
Caching and error handling in chat interactions with LLMs.

This file primarily facilitates answering user questions by semantically searching knowledge graph content and returning structured, scored entity and relation data, enriched with descriptions and community reports.

Classes and Functions

Class: `KGSearch`

KGSearch extends the Dealer class from rag.nlp.search to implement knowledge graph-specific search and retrieval functionalities.

Methods

`_chat(self, llm_bdl, system, history, gen_conf) -> str`

Interact with a language model chat interface with caching support.

Parameters:
- llm_bdl: LLM bundle instance providing .chat() method.
- system: System prompt string.
- history: List of chat messages (dicts) representing conversation history.
- gen_conf: Generation configuration dictionary.
Returns: Generated response string from LLM.
Raises: Exception if response contains an error token "ERROR".
Description:
Checks cache for previous response to the same inputs. If found, returns cached result; otherwise, queries the LLM and caches the output.

`query_rewrite(self, llm, question: str, idxnms: list, kb_ids: list) -> (list, list)`

Rewrite a user question into keywords and entity types using an LLM prompt.

Parameters:
- llm: LLM bundle used for chat interaction.
- question: User's natural language question string.
- idxnms: List of index names (typically tenant-specific indices).
- kb_ids: List of knowledge base IDs.
Returns: Tuple (type_keywords, entities_from_query)
- type_keywords: List of answer type keywords extracted by the LLM.
- entities_from_query: List of entities (up to 5) extracted from the query.
Description:
Fetches entity type samples, formats a prompt using PROMPTS["minirag_query2kwd"], sends it to LLM, and parses JSON output with repair fallback on parsing errors.

Usage Example:

type_keywords, entities = kg_search.query_rewrite(llm_bundle, "Who is the CEO of Tesla?", idxnms, kb_ids)

`_ent_info_from_(self, es_res: dict, sim_thr: float=0.3) -> dict`

Extract entity information from Elasticsearch (or similar) search results.

Parameters:
- es_res: Raw search result dictionary from the datastore.
- sim_thr: Similarity threshold below which entities are discarded.
Returns: Dictionary mapping entity keywords to info dicts with keys:
- "sim": Similarity score (float)
- "pagerank": PageRank-like importance score (float)
- "n_hop_ents": List of neighboring entities with weights (list)
- "description": Entity description string (JSON)
Description:
Filters out entities below sim_thr, normalizes fields, and parses nested weights.

`_relation_info_from_(self, es_res: dict, sim_thr: float=0.3) -> dict`

Extract relation information from search results.

Parameters:
- es_res: Raw search result dict.
- sim_thr: Similarity threshold.
Returns: Dictionary keyed by tuple (from_entity_kwd, to_entity_kwd) with values:
- "sim": Similarity score.
- "pagerank": Relation weight.
- "description": Relation description string.
Description:
Normalizes entity keywords and filters out low-similarity relations.

`get_relevant_ents_by_keywords(self, keywords: list, filters: dict, idxnms: list, kb_ids: list, emb_mdl, sim_thr=0.3, N=56) -> dict`

Retrieve relevant entities matching input keywords.

Parameters:
- keywords: List of keywords extracted from query.
- filters: Search filters (e.g., KB constraints).
- idxnms: Index names.
- kb_ids: Knowledge base IDs.
- emb_mdl: Embedding model for vector search.
- sim_thr: Similarity threshold.
- N: Maximum number of entities to retrieve.
Returns: Dictionary of entity info as from ent_info_from.
Description:
Performs vector search over entities with dense embeddings based on keyword concatenation.

`get_relevant_relations_by_txt(self, txt: str, filters: dict, idxnms: list, kb_ids: list, emb_mdl, sim_thr=0.3, N=56) -> dict`

Retrieve relevant relations based on input text.

Parameters: Same as get_relevant_ents_by_keywords but for relation retrieval.
Returns: Dictionary of relation info as from relation_info_from.

`get_relevant_ents_by_types(self, types: list, filters: dict, idxnms: list, kb_ids: list, N=56) -> dict`

Retrieve relevant entities filtered by entity types.

Parameters:
- types: List of entity type keywords.
- filters, idxnms, kb_ids, N: As previous.
Returns: Dictionary of entity info.
Description:
Uses filters to search for entities of given types, ordered by rank.

`retrieval(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], emb_mdl, llm, max_token: int=8196, ent_topn: int=6, rel_topn: int=6, comm_topn: int=1, ent_sim_threshold: float=0.3, rel_sim_threshold: float=0.3, **kwargs) -> dict`

Main method performing full knowledge graph retrieval workflow for a question.

Parameters:
- question: User question string.
- tenant_ids: Tenant ID(s) as string or list of strings.
- kb_ids: List of knowledge base IDs.
- emb_mdl: Embedding model bundle.
- llm: LLM bundle.
- max_token: Max token budget for output content.
- ent_topn: Number of top entities to return.
- rel_topn: Number of top relations to return.
- comm_topn: Number of community reports to return.
- ent_sim_threshold: Entity similarity threshold.
- rel_sim_threshold: Relation similarity threshold.
- **kwargs: Additional optional parameters.
Returns: Dictionary containing:
- chunk_id: Unique ID for the retrieval chunk.
- content_with_weight: Aggregated string with entities, relations, and community reports.
- docnm_kwd: Static descriptor string.
- Other metadata fields (mostly empty or default).
Description:
Performs:
1. Query rewriting to extract types and keywords.
2. Retrieval of entities by keywords and types.
3. Retrieval of relations by textual similarity.
4. Incorporation of multi-hop relation paths.
5. Combining and scoring entities and relations.
6. Formatting results with descriptions and token budget management.
7. Community report retrieval related to entities.

Usage Example:

results = kg_search.retrieval(
    question="What are the applications of graphene?",
    tenant_ids="tenant123",
    kb_ids=["kb001"],
    emb_mdl=embedding_bundle,
    llm=llm_bundle
)
print(results["content_with_weight"])

`_community_retrieval_(self, entities: list, condition: dict, kb_ids: list, idxnms: list, topn: int, max_token: int) -> str`

Retrieve community reports related to entities.

Parameters:
- entities: List of entity keywords.
- condition: Filter conditions dictionary.
- kb_ids: Knowledge base IDs.
- idxnms: Index names.
- topn: Maximum number of reports.
- max_token: Token budget.
Returns: Concatenated string of community reports formatted with headers.
Description:
Queries the datastore for community reports linked to entities, orders by weight, parses JSON content, and formats human-readable report summaries.

Implementation Details

Caching: The _chat method uses get_llm_cache and set_llm_cache to minimize repeated LLM calls for the same prompts.
JSON Parsing Robustness: query_rewrite uses json_repair and fallback parsing to handle malformed JSON responses from LLM outputs.
Similarity Filtering: Both entity and relation retrieval methods use a configurable similarity threshold to prune low-confidence results.
Multi-hop Relations: The retrieval method enriches relation scores by considering multi-hop neighborhood paths derived from entity data.
Token Budget Management: Results are truncated to stay within a maximum token count (max_token), ensuring manageable output size.
Data Store Integration: The class interacts heavily with self.dataStore (inherited from Dealer) to perform searches and fetch fields.
LLM and Embedding Models: The file uses separate LLM bundles for chat and embedding tasks, enabling semantic search and query rewriting.

Interactions with Other Components

rag.nlp.search.Dealer: Base class providing datastore connectivity and vector search utilities.
api.utils.get_uuid: Generates unique IDs for retrieval chunks.
graphrag.query_analyze_prompt.PROMPTS: Provides prompt templates for LLM query rewriting.
graphrag.utils: Utilities for entity sampling and LLM caching.
rag.utils.doc_store_conn.OrderByExpr: Used for specifying ordering in datastore queries.
api.db.services: Used in the CLI section for tenant, knowledge base, and LLM bundle management.
pandas: Used to format tabular entity and relation data into CSV for output content.
json_repair: Robust JSON parsing for LLM outputs.

The file acts as a core semantic search module connecting natural language queries to knowledge graph data and serving enriched contextual results.

Command Line Interface

When run as a script, the file parses CLI arguments for tenant ID, knowledge base ID, and question, then performs a retrieval using KGSearch and prints the result. This enables quick testing or integration in pipelines.

Visual Diagram

classDiagram
    class KGSearch {
        +_chat(llm_bdl, system, history, gen_conf) str
        +query_rewrite(llm, question, idxnms, kb_ids) (list, list)
        +_ent_info_from_(es_res, sim_thr=0.3) dict
        +_relation_info_from_(es_res, sim_thr=0.3) dict
        +get_relevant_ents_by_keywords(keywords, filters, idxnms, kb_ids, emb_mdl, sim_thr=0.3, N=56) dict
        +get_relevant_relations_by_txt(txt, filters, idxnms, kb_ids, emb_mdl, sim_thr=0.3, N=56) dict
        +get_relevant_ents_by_types(types, filters, idxnms, kb_ids, N=56) dict
        +retrieval(question, tenant_ids, kb_ids, emb_mdl, llm, max_token=8196, ent_topn=6, rel_topn=6, comm_topn=1, ent_sim_threshold=0.3, rel_sim_threshold=0.3, **kwargs) dict
        +_community_retrieval_(entities, condition, kb_ids, idxnms, topn, max_token) str
    }
    KGSearch --|> Dealer

Summary

The search.py file provides a sophisticated knowledge graph search engine that leverages LLMs for query understanding, embedding models for semantic retrieval, and multi-hop neighborhood information to return relevant entities, relations, and community insights. It forms a crucial backend component for knowledge-driven question answering and information discovery within the InfiniFlow system.

search.py

Overview

Classes and Functions

Class: KGSearch

Methods

_chat(self, llm_bdl, system, history, gen_conf) -> str

query_rewrite(self, llm, question: str, idxnms: list, kb_ids: list) -> (list, list)

_ent_info_from_(self, es_res: dict, sim_thr: float=0.3) -> dict

_relation_info_from_(self, es_res: dict, sim_thr: float=0.3) -> dict

get_relevant_ents_by_keywords(self, keywords: list, filters: dict, idxnms: list, kb_ids: list, emb_mdl, sim_thr=0.3, N=56) -> dict

get_relevant_relations_by_txt(self, txt: str, filters: dict, idxnms: list, kb_ids: list, emb_mdl, sim_thr=0.3, N=56) -> dict

get_relevant_ents_by_types(self, types: list, filters: dict, idxnms: list, kb_ids: list, N=56) -> dict

retrieval(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], emb_mdl, llm, max_token: int=8196, ent_topn: int=6, rel_topn: int=6, comm_topn: int=1, ent_sim_threshold: float=0.3, rel_sim_threshold: float=0.3, **kwargs) -> dict

_community_retrieval_(self, entities: list, condition: dict, kb_ids: list, idxnms: list, topn: int, max_token: int) -> str