query.py


Overview

The query.py file is a core utility module within the InfiniFlow system's retrieval-augmented generation (RAG) framework. It provides the FulltextQueryer class, which is responsible for constructing sophisticated full-text search queries, computing various types of text similarities, and preprocessing textual inputs to improve search accuracy. These functionalities are primarily used to query document stores and knowledge bases efficiently by leveraging token weights, synonyms, and language-specific heuristics.

Key capabilities include:

FulltextQueryer interacts closely with other components such as the rag_tokenizer, term_weight.Dealer, and synonym.Dealer modules for tokenization, weighting, and synonym lookup respectively, and outputs MatchTextExpr objects used for querying the document store.


Class and Method Documentation

Class: FulltextQueryer

Encapsulates methods and utilities for building full-text search queries and calculating text similarities.

Properties:


Methods

__init__()

Constructor initializes token weighting and synonym dealers and sets default query fields with relevance boosts.

Usage:
queryer = FulltextQueryer()

subSpecialChar(line: str) -> str (Static Method)

Escapes special characters in a string that could interfere with full-text query syntax.

Example:
escaped = FulltextQueryer.subSpecialChar("price: $100 (special)")
# Output: "price\: $100 \(special\)"

isChinese(line: str) -> bool (Static Method)

Determines if a text line is predominantly Chinese based on token content and language heuristics.


rmWWW(txt: str) -> str (Static Method)

Removes common question words, pronouns, and stopwords from input text to clean it for search.


add_space_between_eng_zh(txt: str) -> str (Static Method)

Inserts spaces between English and Chinese characters to improve tokenization accuracy.


question(txt: str, tbl: str = "qa", min_match: float = 0.6) -> tuple[MatchTextExpr, list[str]]

Generates a MatchTextExpr query expression and keyword list from a natural language question.

Usage:
query_expr, keywords = queryer.question("What is InfiniFlow?")

hybrid_similarity(avec, bvecs, atks, btkss, tkweight=0.3, vtweight=0.7) -> tuple[np.ndarray, list[float], np.ndarray]

Computes a hybrid similarity score combining vector embedding similarity and token similarity.


token_similarity(atks, btkss) -> list[float]

Calculates similarity scores between query tokens and multiple candidate token sets.


similarity(qtwt, dtwt) -> float

Computes a normalized similarity score between two token-weight dictionaries or strings.


paragraph(content_tks: str, keywords: list = [], keywords_topn=30) -> MatchTextExpr

Constructs a paragraph-level search query using the content tokens and optionally provided keywords.


Important Implementation Details


Interaction with Other System Components

The module forms a critical link between raw user queries and the document retrieval system, enabling semantically rich and weighted full-text search queries.


Visual Diagram

classDiagram
    class FulltextQueryer {
        -tw: term_weight.Dealer
        -syn: synonym.Dealer
        -query_fields: list
        +__init__()
        +subSpecialChar(line: str) str
        +isChinese(line: str) bool
        +rmWWW(txt: str) str
        +add_space_between_eng_zh(txt: str) str
        +question(txt: str, tbl: str = "qa", min_match: float = 0.6) tuple
        +hybrid_similarity(avec, bvecs, atks, btkss, tkweight=0.3, vtweight=0.7) tuple
        +token_similarity(atks, btkss) list
        +similarity(qtwt, dtwt) float
        +paragraph(content_tks: str, keywords: list = [], keywords_topn=30) MatchTextExpr
    }

Summary

The query.py file provides the FulltextQueryer class, a comprehensive utility for transforming user queries into precise, weighted full-text search expressions, supporting both Chinese and English with synonym expansion and token weighting. It also offers hybrid similarity metrics combining semantic vector and token-based similarities, essential for ranking and retrieval in the InfiniFlow knowledge system.

This module integrates tightly with tokenization, synonym, and document store query components, serving as a vital bridge between natural language input and effective document retrieval.


End of Documentation for query.py