search.py


Overview

The search.py file is a core utility module within the InfiniFlow system, designed to facilitate advanced semantic and keyword-based search capabilities over document stores. It provides mechanisms to build complex search queries, perform vector similarity matching with embedding models, apply hybrid reranking techniques, and enrich search results with metadata such as tags and highlights. This module acts as a bridge between natural language queries and the underlying document storage, supporting retrieval, ranking, and annotation of relevant text chunks for knowledge base applications.


Detailed Documentation

Functions

index_name(uid: str) -> str


Class: Dealer

The Dealer class encapsulates the logic to query, rank, and manipulate document chunks stored in a document store. It integrates with embedding models for vector search, supports hybrid similarity scoring, and adds contextual citations to answers.


Constructor: __init__(self, dataStore: DocStoreConnection)


Nested Dataclass: SearchResult


Method: get_vector(self, txt: str, emb_mdl, topk=10, similarity=0.1) -> MatchDenseExpr


Method: get_filters(self, req: dict) -> dict


Method: search(self, req: dict, idx_names: str | list[str], kb_ids: list[str], emb_mdl=None, highlight=False, rank_feature: dict | None = None) -> SearchResult


Method: trans2floats(txt: str) -> list[float]


Method: insert_citations(self, answer: str, chunks: list[str], chunk_v: list[list[float]], embd_mdl, tkweight=0.1, vtweight=0.9) -> tuple[str, set[str]]


Method: _rank_feature_scores(self, query_rfea: dict, search_res: SearchResult) -> np.ndarray


Method: rerank(self, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]


Method: rerank_by_model(self, rerank_mdl, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]


Method: hybrid_similarity(self, ans_embd, ins_embd, ans, inst) -> tuple


Method: retrieval(self, question: str, embd_mdl, tenant_ids, kb_ids, page, page_size, similarity_threshold=0.2, vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True, rerank_mdl=None, highlight=False, rank_feature: dict | None = {PAGERANK_FLD: 10}) -> dict


Method: sql_retrieval(self, sql: str, fetch_size=128, format="json") -> any


Method: chunk_list(self, doc_id: str, tenant_id: str, kb_ids: list[str], max_count=1024, offset=0, fields=None) -> list[dict]


Method: all_tags(self, tenant_id: str, kb_ids: list[str], S=1000) -> list


Method: all_tags_in_portion(self, tenant_id: str, kb_ids: list[str], S=1000) -> dict


Method: tag_content(self, tenant_id: str, kb_ids: list[str], doc: dict, all_tags: dict, topn_tags=3, keywords_topn=30, S=1000) -> bool


Method: tag_query(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], all_tags: dict, topn_tags=3, S=1000) -> dict


Important Implementation Details and Algorithms


Interactions With Other System Components