search.py

Overview

The search.py file is a core utility module within the InfiniFlow system, designed to facilitate advanced semantic and keyword-based search capabilities over document stores. It provides mechanisms to build complex search queries, perform vector similarity matching with embedding models, apply hybrid reranking techniques, and enrich search results with metadata such as tags and highlights. This module acts as a bridge between natural language queries and the underlying document storage, supporting retrieval, ranking, and annotation of relevant text chunks for knowledge base applications.

Detailed Documentation

Functions

`index_name(uid: str) -> str`

Purpose: Constructs a standardized index name string for a given tenant or user ID.
Parameters:
- uid: Unique identifier (e.g., tenant ID).
Returns: A string in the format "ragflow_{uid}".

Example:

index = index_name("tenant123")
print(index)  # Output: ragflow_tenant123

Class: `Dealer`

The Dealer class encapsulates the logic to query, rank, and manipulate document chunks stored in a document store. It integrates with embedding models for vector search, supports hybrid similarity scoring, and adds contextual citations to answers.

Constructor: `init(self, dataStore: DocStoreConnection)`

Purpose: Initializes a Dealer instance with a connection to the document store.
Parameters:
- dataStore: An instance of DocStoreConnection which abstracts data storage and search operations.

Usage:

data_store = DocStoreConnection(...)
dealer = Dealer(data_store)

Nested Dataclass: `SearchResult`

Purpose: Defines the structure for search results returned by the Dealer.
Fields:
- total (int): Total number of matched documents.
- ids (list of str): List of document chunk IDs.
- query_vector (list of float | None): Embedding vector of the query, if applicable.
- field (dict | None): Dictionary mapping chunk IDs to their fields.
- highlight (dict | None): Highlighted snippets for matched documents.
- aggregation (list | dict | None): Aggregated metadata (e.g., tag counts).
- keywords (list of str | None): Extracted keywords from the query.
- group_docs (list of list | None): Grouped documents, if applicable.

Method: `get_vector(self, txt: str, emb_mdl, topk=10, similarity=0.1) -> MatchDenseExpr`

Purpose: Encodes a query text into an embedding vector and creates a dense match expression for cosine similarity search.
Parameters:
- txt: Raw query text.
- emb_mdl: Embedding model with encode_queries method.
- topk: Number of top matches to retrieve.
- similarity: Similarity threshold (cosine).
Returns: A MatchDenseExpr object representing the vector similarity expression.
Implementation Details:
- The method ensures the encoded vector is one-dimensional.
- Converts vector elements to floats.
- Constructs a vector column name based on vector dimensionality.

Example:

match_expr = dealer.get_vector("What is AI?", embedding_model, topk=5, similarity=0.2)

Method: `get_filters(self, req: dict) -> dict`

Purpose: Extracts filter conditions from a search request dictionary.
Parameters:
- req: Request dictionary possibly containing filter keys.
Returns: A dictionary mapping field names to filter values.
Details:
- Supports filters on knowledge base IDs (kb_ids), document IDs (doc_ids), knowledge graph keywords, entity keywords, availability flags, and removal keywords.

Example:

filters = dealer.get_filters({"kb_ids": ["kb1", "kb2"], "entity_kwd": "AI"})

Method: `search(self, req: dict, idx_names: str | list[str], kb_ids: list[str], emb_mdl=None, highlight=False, rank_feature: dict | None = None) -> SearchResult`

Purpose: Performs a complex search over document chunks with optional embedding vector ranking, filtering, highlighting, and aggregation.
Parameters:
- req: Search request dictionary including question, pagination, filters, fields, and similarity parameters.
- idx_names: Index name(s) for searching.
- kb_ids: List of knowledge base IDs to restrict search.
- emb_mdl: Optional embedding model to use vector similarity.
- highlight: Whether to return highlighted matched snippets.
- rank_feature: Optional feature weights for ranking.
Returns: A SearchResult object containing search metadata and results.
Implementation Highlights:
- Parses pagination and field selection.
- Builds filters and ordering expressions.
- If no query text, returns a sorted result.
- If query text is present:
  - Uses full-text matching and optionally vector similarity.
  - Performs fusion of text and vector matches.
  - Adjusts minimum match threshold if no results found.
- Extracts keywords for reranking and highlighting.

Usage Example:

req = {"question": "Explain machine learning", "page": 1, "topk": 10}
result = dealer.search(req, idx_names="ragflow_tenant1", kb_ids=["kb1"], emb_mdl=embedding_model)
print(result.total, result.ids)

Method: `trans2floats(txt: str) -> list[float]`

Purpose: Converts tab-separated string of numbers into a list of floats.
Parameters:
- txt: Tab-separated string like "0.1\t0.2\t0.3".
Returns: List of floats.

Example:

floats = Dealer.trans2floats("0.1\t0.2\t0.3")

Method: `insert_citations(self, answer: str, chunks: list[str], chunk_v: list[list[float]], embd_mdl, tkweight=0.1, vtweight=0.9) -> tuple[str, set[str]]`

Purpose: Inserts inline citations referencing document chunks into an answer text by measuring semantic similarity between answer pieces and chunk embeddings.
Parameters:
- answer: The answer text string.
- chunks: List of text chunks from documents.
- chunk_v: List of embedding vectors for each chunk.
- embd_mdl: Embedding model to encode answer pieces.
- tkweight: Token similarity weight (default 0.1).
- vtweight: Vector similarity weight (default 0.9).
Returns: Tuple containing the answer with inserted citations and a set of cited chunk IDs.
Implementation Details:
- Splits the answer into sentences and code blocks to avoid inserting citations inside code.
- Encodes answer pieces into vectors.
- Computes hybrid similarity (token + vector) with chunk embeddings.
- Inserts citations where similarity exceeds a threshold.

Example:

cited_answer, cited_ids = dealer.insert_citations(answer_str, chunks, chunk_vectors, embd_model)

Method: `_rank_feature_scores(self, query_rfea: dict, search_res: SearchResult) -> np.ndarray`

Purpose: Computes ranking feature scores based on tag features and pagerank for search results.
Parameters:
- query_rfea: Dictionary of query feature weights.
- search_res: The SearchResult instance to score.
Returns: Numpy array of combined ranking scores.
Details:
- Combines pagerank field and tag feature similarity scores weighted and normalized.

Usage:

scores = dealer._rank_feature_scores(rank_feature_dict, sres)

Method: `rerank(self, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]`

Purpose: Performs hybrid reranking of search results based on token similarity, vector similarity, and optional rank features.
Parameters:
- sres: Search results to rerank.
- query: Original query string.
- tkweight: Weight for token similarity (default 0.3).
- vtweight: Weight for vector similarity (default 0.7).
- cfield: Content field name for tokens (default "content_ltks").
- rank_feature: Optional rank feature dictionary.
Returns: Tuple of overall similarity scores, token similarity list, and vector similarity list.
Implementation:
- Extracts embeddings and tokens from search results.
- Computes hybrid similarity using the internal qryr.hybrid_similarity.
- Combines with rank features.

Example:

sim_scores, tk_sim, vt_sim = dealer.rerank(search_result, "What is AI?")

Method: `rerank_by_model(self, rerank_mdl, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]`

Purpose: Reranks search results using an external reranking model alongside token similarity and rank features.
Parameters:
- rerank_mdl: External model with similarity method.
- Others same as rerank.
Returns: Tuple of combined similarity scores, token similarity, and vector similarity.

Example:

sim_scores, tk_sim, vt_sim = dealer.rerank_by_model(rerank_model, search_result, query)

Method: `hybrid_similarity(self, ans_embd, ins_embd, ans, inst) -> tuple`

Purpose: Proxy method to internal qryr.hybrid_similarity for similarity between answer embedding and instance embeddings.
Parameters:
- ans_embd: List of answer embeddings.
- ins_embd: List of instance embeddings.
- ans: Tokenized answer string.
- inst: Tokenized instance string.
Returns: Similarity scores.

Example:

sim_scores = dealer.hybrid_similarity(ans_embd, ins_embd, ans_tokens, inst_tokens)

Method: `retrieval(self, question: str, embd_mdl, tenant_ids, kb_ids, page, page_size, similarity_threshold=0.2, vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True, rerank_mdl=None, highlight=False, rank_feature: dict | None = {PAGERANK_FLD: 10}) -> dict`

Purpose: High-level method to retrieve and rank document chunks in response to a question.
Parameters:
- question: User query string.
- embd_mdl: Embedding model for vector search.
- tenant_ids: Single or list of tenant IDs.
- kb_ids: List of knowledge base IDs.
- page: Current page number (1-based).
- page_size: Number of results per page.
- similarity_threshold: Minimum similarity score to include.
- vector_similarity_weight: Weight for vector similarity in reranking.
- top: Maximum number of results to retrieve.
- doc_ids: Optional list of document IDs to filter.
- aggs: Whether to aggregate by document.
- rerank_mdl: Optional reranking model.
- highlight: Whether to include highlighted snippets.
- rank_feature: Rank feature dictionary.
Returns: Dictionary containing total results, chunk list, and aggregated document stats.
Implementation Details:
- Performs initial search with pagination.
- Applies reranking (model-based or hybrid).
- Filters and paginates final results.
- Constructs chunk metadata and aggregations.

Example:

results = dealer.retrieval("Explain AI", embedding_model, "tenant1", ["kb1"], page=1, page_size=10)
print(results["total"], results["chunks"])

Method: `sql_retrieval(self, sql: str, fetch_size=128, format="json") -> any`

Purpose: Executes a raw SQL query on the document store.
Parameters:
- sql: SQL query string.
- fetch_size: Number of rows to fetch.
- format: Result format (default "json").
Returns: Query results in specified format.

Example:

table = dealer.sql_retrieval("SELECT * FROM documents LIMIT 10")

Method: `chunk_list(self, doc_id: str, tenant_id: str, kb_ids: list[str], max_count=1024, offset=0, fields=None) -> list[dict]`

Purpose: Retrieves a paginated list of chunks for a specific document.
Parameters:
- doc_id: Document identifier.
- tenant_id: Tenant identifier.
- kb_ids: Knowledge base IDs.
- max_count: Maximum number of chunks to retrieve.
- offset: Offset to start retrieval.
- fields: List of fields to return (default includes docnm_kwd, content_with_weight, img_id).
Returns: List of chunk dictionaries.

Example:

chunks = dealer.chunk_list("doc123", "tenant1", ["kb1"])

Method: `all_tags(self, tenant_id: str, kb_ids: list[str], S=1000) -> list`

Purpose: Retrieves all tags aggregated from the document store for a tenant and knowledge base.
Parameters:
- tenant_id: Tenant identifier.
- kb_ids: Knowledge base IDs.
- S: Smoothing constant (not used here).
Returns: List of tag-frequency tuples.
Details:
- Checks if the index exists before querying.

Example:

tags = dealer.all_tags("tenant1", ["kb1"])

Method: `all_tags_in_portion(self, tenant_id: str, kb_ids: list[str], S=1000) -> dict`

Purpose: Returns tag frequencies normalized as portions (smoothed probabilities).
Parameters: Same as all_tags.
Returns: Dictionary mapping tags to smoothed frequency ratios.

Example:

tag_portions = dealer.all_tags_in_portion("tenant1", ["kb1"])

Method: `tag_content(self, tenant_id: str, kb_ids: list[str], doc: dict, all_tags: dict, topn_tags=3, keywords_topn=30, S=1000) -> bool`

Purpose: Tags a document's content by extracting relevant tags based on keyword matching and aggregation statistics.
Parameters:
- tenant_id, kb_ids: Tenant and knowledge base IDs.
- doc: Document dictionary to tag (with fields like "title_tks", "content_ltks").
- all_tags: Dictionary of all tags with frequencies.
- topn_tags: Number of tags to assign.
- keywords_topn: Number of keywords to consider.
- S: Smoothing constant.
Returns: True if tagging was successful, False otherwise.
Behavior:
- Generates a matching expression from the document text.
- Queries the store for tag aggregations.
- Computes normalized tag scores and assigns top tags to the document field TAG_FLD.

Example:

success = dealer.tag_content("tenant1", ["kb1"], document, all_tags)

Method: `tag_query(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], all_tags: dict, topn_tags=3, S=1000) -> dict`

Purpose: Retrieves tag features relevant to a query by analyzing tags aggregated from matching documents.
Parameters:
- question: Query string.
- tenant_ids: Single or multiple tenant IDs.
- kb_ids: Knowledge base IDs.
- all_tags: All tags with frequencies.
- topn_tags: Number of tags to return.
- S: Smoothing constant.
Returns: Dictionary of top tags with adjusted counts.

Example:

query_tags = dealer.tag_query("Explain AI", "tenant1", ["kb1"], all_tags)

Important Implementation Details and Algorithms

Hybrid Search Strategy: Combines traditional keyword full-text search and dense vector similarity (cosine similarity) to improve relevance.
Fusion Expression: Uses a weighted sum to combine text-based and vector-based matching scores, with weights typically favoring vector similarity.
Adaptive Query Relaxation: If initial vector similarity search yields no results, the minimum match threshold is lowered to recover results.
Reranking: Supports both internal hybrid similarity reranking and external model-based reranking that combines token similarity and vector similarity with rank features.
Citation Insertion: Splits answers into segments avoiding code blocks and applies semantic similarity to insert chunk references inline.
Tag Aggregation and Scoring: Uses frequency-based smoothing and normalization to assign relevant tags to documents and queries for better feature-based ranking.
Pagination & Efficiency: Search and rerank operations are carefully paginated and limited to ensure performance and scalability.

Interactions With Other System Components

DocStoreConnection: The Dealer relies heavily on this abstraction to perform all underlying database and search engine operations.
rag_tokenizer and query modules: Used for tokenization, keyword extraction, query preprocessing, and similarity computations.
**Embedding Models

search.py

Overview

Detailed Documentation

Functions

index_name(uid: str) -> str

Class: Dealer

Constructor: __init__(self, dataStore: DocStoreConnection)

Nested Dataclass: SearchResult

Method: get_vector(self, txt: str, emb_mdl, topk=10, similarity=0.1) -> MatchDenseExpr

Method: get_filters(self, req: dict) -> dict

Method: search(self, req: dict, idx_names: str | list[str], kb_ids: list[str], emb_mdl=None, highlight=False, rank_feature: dict | None = None) -> SearchResult

Method: trans2floats(txt: str) -> list[float]

Method: insert_citations(self, answer: str, chunks: list[str], chunk_v: list[list[float]], embd_mdl, tkweight=0.1, vtweight=0.9) -> tuple[str, set[str]]

Method: _rank_feature_scores(self, query_rfea: dict, search_res: SearchResult) -> np.ndarray

Method: rerank(self, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]

Method: rerank_by_model(self, rerank_mdl, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]

Method: hybrid_similarity(self, ans_embd, ins_embd, ans, inst) -> tuple

Method: retrieval(self, question: str, embd_mdl, tenant_ids, kb_ids, page, page_size, similarity_threshold=0.2, vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True, rerank_mdl=None, highlight=False, rank_feature: dict | None = {PAGERANK_FLD: 10}) -> dict

Method: sql_retrieval(self, sql: str, fetch_size=128, format="json") -> any

Method: chunk_list(self, doc_id: str, tenant_id: str, kb_ids: list[str], max_count=1024, offset=0, fields=None) -> list[dict]

Method: all_tags(self, tenant_id: str, kb_ids: list[str], S=1000) -> list

Method: all_tags_in_portion(self, tenant_id: str, kb_ids: list[str], S=1000) -> dict

Method: tag_content(self, tenant_id: str, kb_ids: list[str], doc: dict, all_tags: dict, topn_tags=3, keywords_topn=30, S=1000) -> bool

Method: tag_query(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], all_tags: dict, topn_tags=3, S=1000) -> dict

Important Implementation Details and Algorithms

Interactions With Other System Components

`index_name(uid: str) -> str`

Class: `Dealer`

Constructor: `init(self, dataStore: DocStoreConnection)`

Nested Dataclass: `SearchResult`

Method: `get_vector(self, txt: str, emb_mdl, topk=10, similarity=0.1) -> MatchDenseExpr`

Method: `get_filters(self, req: dict) -> dict`

Method: `search(self, req: dict, idx_names: str | list[str], kb_ids: list[str], emb_mdl=None, highlight=False, rank_feature: dict | None = None) -> SearchResult`

Method: `trans2floats(txt: str) -> list[float]`

Method: `insert_citations(self, answer: str, chunks: list[str], chunk_v: list[list[float]], embd_mdl, tkweight=0.1, vtweight=0.9) -> tuple[str, set[str]]`

Method: `_rank_feature_scores(self, query_rfea: dict, search_res: SearchResult) -> np.ndarray`

Method: `rerank(self, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]`

Method: `rerank_by_model(self, rerank_mdl, sres: SearchResult, query: str, tkweight=0.3, vtweight=0.7, cfield="content_ltks", rank_feature: dict | None = None) -> tuple[np.ndarray, list, list]`

Method: `hybrid_similarity(self, ans_embd, ins_embd, ans, inst) -> tuple`

Method: `retrieval(self, question: str, embd_mdl, tenant_ids, kb_ids, page, page_size, similarity_threshold=0.2, vector_similarity_weight=0.3, top=1024, doc_ids=None, aggs=True, rerank_mdl=None, highlight=False, rank_feature: dict | None = {PAGERANK_FLD: 10}) -> dict`

Method: `sql_retrieval(self, sql: str, fetch_size=128, format="json") -> any`

Method: `chunk_list(self, doc_id: str, tenant_id: str, kb_ids: list[str], max_count=1024, offset=0, fields=None) -> list[dict]`

Method: `all_tags(self, tenant_id: str, kb_ids: list[str], S=1000) -> list`

Method: `all_tags_in_portion(self, tenant_id: str, kb_ids: list[str], S=1000) -> dict`

Method: `tag_content(self, tenant_id: str, kb_ids: list[str], doc: dict, all_tags: dict, topn_tags=3, keywords_topn=30, S=1000) -> bool`

Method: `tag_query(self, question: str, tenant_ids: str | list[str], kb_ids: list[str], all_tags: dict, topn_tags=3, S=1000) -> dict`