query.py
Overview
The query.py file is a core utility module within the InfiniFlow system's retrieval-augmented generation (RAG) framework. It provides the FulltextQueryer class, which is responsible for constructing sophisticated full-text search queries, computing various types of text similarities, and preprocessing textual inputs to improve search accuracy. These functionalities are primarily used to query document stores and knowledge bases efficiently by leveraging token weights, synonyms, and language-specific heuristics.
Key capabilities include:
Generating weighted full-text queries from natural language questions.
Handling multilingual text with specific support for Chinese and English tokenization nuances.
Computing hybrid similarity scores combining vector embeddings and token-level similarities.
Producing refined paragraph-level search queries with controlled keyword expansion.
Preprocessing input text to remove noise and special characters affecting search quality.
FulltextQueryer interacts closely with other components such as the rag_tokenizer, term_weight.Dealer, and synonym.Dealer modules for tokenization, weighting, and synonym lookup respectively, and outputs MatchTextExpr objects used for querying the document store.
Class and Method Documentation
Class: FulltextQueryer
Encapsulates methods and utilities for building full-text search queries and calculating text similarities.
Properties:
tw(term_weight.Dealer): Instance managing token weighting.syn(synonym.Dealer): Instance managing synonym retrieval.query_fields(list[str]): List of document fields with associated boost factors used in full-text queries.
Methods
__init__()
Constructor initializes token weighting and synonym dealers and sets default query fields with relevance boosts.
Usage:
queryer = FulltextQueryer()
subSpecialChar(line: str) -> str (Static Method)
Escapes special characters in a string that could interfere with full-text query syntax.
Parameters:
line: Input string potentially containing special characters.
Returns: Escaped string with special characters prefixed by backslashes.
Example:
escaped = FulltextQueryer.subSpecialChar("price: $100 (special)")
# Output: "price\: $100 \(special\)"
isChinese(line: str) -> bool (Static Method)
Determines if a text line is predominantly Chinese based on token content and language heuristics.
Parameters:
line: Text string to analyze.
Returns:
Trueif the text is mostly Chinese; otherwise,False.
rmWWW(txt: str) -> str (Static Method)
Removes common question words, pronouns, and stopwords from input text to clean it for search.
Parameters:
txt: Input query string.
Returns: Cleaned string with noise words removed.
add_space_between_eng_zh(txt: str) -> str (Static Method)
Inserts spaces between English and Chinese characters to improve tokenization accuracy.
Parameters:
txt: Original text string.
Returns: Modified text with spaces added between English and Chinese character boundaries.
question(txt: str, tbl: str = "qa", min_match: float = 0.6) -> tuple[MatchTextExpr, list[str]]
Generates a MatchTextExpr query expression and keyword list from a natural language question.
Parameters:
txt: The question or query string.tbl: (Unused in current implementation) Intended query target table.min_match: Minimum should match threshold for query clauses.
Returns: Tuple containing:
MatchTextExpr: The constructed full-text search expression.keywords: List of extracted keywords and synonyms used in the query.
Details:
The method:Normalizes and tokenizes input text.
Applies language-specific processing (Chinese or non-Chinese).
Expands query tokens with synonyms and weighted boosts.
Constructs complex boolean and phrase queries to improve recall and precision.
Usage:
query_expr, keywords = queryer.question("What is InfiniFlow?")
hybrid_similarity(avec, bvecs, atks, btkss, tkweight=0.3, vtweight=0.7) -> tuple[np.ndarray, list[float], np.ndarray]
Computes a hybrid similarity score combining vector embedding similarity and token similarity.
Parameters:
avec: Vector embedding of the query.bvecs: List/array of vector embeddings of candidate documents.atks: Tokens from the query text.btkss: List of token lists from candidate documents.tkweight: Weight factor for token similarity (default 0.3).vtweight: Weight factor for vector similarity (default 0.7).
Returns: Tuple containing:
Weighted combined similarity scores (numpy array).
Token similarity scores (list of floats).
Vector similarity scores (numpy array).
Algorithm:
Uses cosine similarity for vector embeddings and a custom token similarity metric, then blends them by weighted sum.
token_similarity(atks, btkss) -> list[float]
Calculates similarity scores between query tokens and multiple candidate token sets.
Parameters:
atks: Tokens from the query text (string or list).btkss: List of token lists from candidate documents.
Returns: List of similarity scores (float).
Details:
Converts tokens into weighted dictionaries and compares them pairwise usingsimilarity().
similarity(qtwt, dtwt) -> float
Computes a normalized similarity score between two token-weight dictionaries or strings.
Parameters:
qtwt: Query token weights (dict or string).dtwt: Document token weights (dict or string).
Returns: Similarity score as a float.
Algorithm:
Calculates the sum of query weights for tokens found in the document, normalized by the total query weight.
paragraph(content_tks: str, keywords: list = [], keywords_topn=30) -> MatchTextExpr
Constructs a paragraph-level search query using the content tokens and optionally provided keywords.
Parameters:
content_tks: Tokenized content as a string or list.keywords: Optional list of keyword strings to boost.keywords_topn: Number of top weighted tokens to consider (default 30).
Returns: A
MatchTextExprrepresenting the paragraph query.Details:
The method:Weights tokens and expands keywords with synonyms.
Applies boosting factors to keywords based on weights.
Sets a minimum should match constraint to control precision.
Important Implementation Details
Language Handling:
The class distinguishes between Chinese and non-Chinese queries to apply appropriate tokenization and query construction strategies.Token Weighting and Synonym Expansion:
Tokens are weighted by importance, and synonyms are incorporated with scaled boosts to improve recall.Query Construction:
Queries are built as complex boolean expressions combining individual tokens, phrase matches, and synonym groups with weighted boosts.Similarity Calculations:
Uses cosine similarity for vector embeddings and a custom weighted token similarity metric, combined in a hybrid similarity score.Regular Expressions and Text Cleaning:
Uses regex extensively to sanitize input text, remove stopwords, and add spaces between English and Chinese characters for better tokenization.
Interaction with Other System Components
rag_tokenizer:
Used for tokenizing input strings, converting traditional to simplified Chinese characters (tradi2simp), and fine-grained tokenization.term_weight.Dealer:
Provides token weighting functions that assign importance scores to tokens based on frequency or other heuristics.synonym.Dealer:
Retrieves synonyms for given tokens to expand queries and improve matching.rag.utils.doc_store_conn.MatchTextExpr:
Represents the structured query expression built byFulltextQueryerfor querying the backend document store.
The module forms a critical link between raw user queries and the document retrieval system, enabling semantically rich and weighted full-text search queries.
Visual Diagram
classDiagram
class FulltextQueryer {
-tw: term_weight.Dealer
-syn: synonym.Dealer
-query_fields: list
+__init__()
+subSpecialChar(line: str) str
+isChinese(line: str) bool
+rmWWW(txt: str) str
+add_space_between_eng_zh(txt: str) str
+question(txt: str, tbl: str = "qa", min_match: float = 0.6) tuple
+hybrid_similarity(avec, bvecs, atks, btkss, tkweight=0.3, vtweight=0.7) tuple
+token_similarity(atks, btkss) list
+similarity(qtwt, dtwt) float
+paragraph(content_tks: str, keywords: list = [], keywords_topn=30) MatchTextExpr
}
Summary
The query.py file provides the FulltextQueryer class, a comprehensive utility for transforming user queries into precise, weighted full-text search expressions, supporting both Chinese and English with synonym expansion and token weighting. It also offers hybrid similarity metrics combining semantic vector and token-based similarities, essential for ranking and retrieval in the InfiniFlow knowledge system.
This module integrates tightly with tokenization, synonym, and document store query components, serving as a vital bridge between natural language input and effective document retrieval.