doc_store_conn.py
Overview
doc_store_conn.py defines an abstract interface and related data structures for interacting with document-oriented vector search databases within the InfiniFlow system. It specifies a contract for database connectivity, index management, CRUD (Create, Read, Update, Delete) operations, and search functionalities that support various vector search modalities, including dense vectors, sparse vectors, text, and tensor data.
The file primarily provides:
Abstract base classes representing different types of vector/text matching expressions.
A
SparseVectordata class for representing sparse vectors used in sparse matching.An abstract base class
DocStoreConnectionwhich defines the essential methods any concrete document store connection must implement.Support for ordering expressions and fusion expressions in search queries.
This modular and extensible design allows different backend database implementations (e.g., Elasticsearch, Milvus, Faiss) to provide their own concrete implementations while adhering to a unified API.
Classes and Data Structures
SparseVector
Description:
Represents a sparse vector with indices and optional values, useful for sparse vector similarity search.
Attributes:
indices: list[int]— The indices of non-zero elements in the sparse vector.values: list[float] | list[int] | None— Corresponding values at those indices; optional.
Methods:
post_init()
Validates that ifvaluesis provided, it matches the length ofindices.to_dict_old()→dict
Converts the sparse vector to a dictionary with keys "indices" and optional"values".to_dict()→dict
Converts the sparse vector to a dictionary mapping stringified indices to their values. RaisesValueErrorifvaluesisNone.from_dict(d: dict)→SparseVector(static method)
Creates aSparseVectorinstance from a dictionary containing "indices" and optional"values".str()andrepr()
String representations for debugging.
Usage Example:
sv = SparseVector(indices=[1, 3, 5], values=[0.5, 0.8, 0.9])
print(sv.to_dict())
# Output: {'1': 0.5, '3': 0.8, '5': 0.9}
Abstract Classes for Match Expressions
These classes represent different types of matching expressions used in search queries. They are abstract base classes (ABC) meant to be subclassed or instantiated with the required data for query construction.
MatchTextExpr
Purpose: Represent text-based matching criteria.
Constructor Parameters:
fields: list[str]— Fields to match against.matching_text: str— Text query string.topn: int— Number of top results to return.extra_options: dict— Additional matching options (optional).
MatchDenseExpr
Purpose: Represent dense vector-based matching criteria.
Constructor Parameters:
vector_column_name: str— Name of the vector column in the database.embedding_data: VEC— Dense vector data (list or numpy array).embedding_data_type: str— Type of the embedding data (e.g., "float32").distance_type: str— Distance metric (e.g., "cosine", "l2").topn: int— Number of top results to return (default 10).extra_options: dict— Additional options (optional).
MatchSparseExpr
Purpose: Represent sparse vector-based matching criteria.
Constructor Parameters:
vector_column_name: str— Name of the sparse vector column.sparse_data: SparseVector | dict— Sparse vector data or dictionary.distance_type: str— Distance metric.topn: int— Number of top results to return.opt_params: dict | None— Optional parameters.
MatchTensorExpr
Purpose: Represent tensor-based matching criteria.
Constructor Parameters:
column_name: str— Name of the tensor column.query_data: VEC— Query tensor data (list or numpy array).query_data_type: str— Data type of the tensor.topn: int— Number of top results to return.extra_option: dict | None— Additional options.
FusionExpr
Purpose: Represent fusion of multiple match results.
Constructor Parameters:
method: str— Fusion method name.topn: int— Number of top results to return.fusion_params: dict | None— Additional fusion parameters.
MatchExpr (Type Alias)
A union type representing any of the matching expression classes:
MatchExpr = MatchTextExpr | MatchDenseExpr | MatchSparseExpr | MatchTensorExpr | FusionExpr
OrderByExpr
Purpose: Define sorting order for search results.
Attributes:
fields: list[tuple[str, int]]— List of(field_name, order_flag)tuples, whereorder_flagis0for ascending and1for descending.
Methods:
asc(field: str)→OrderByExpr
Adds an ascending order onfield.desc(field: str)→OrderByExpr
Adds a descending order onfield.fields()→list
Returns the current list of order fields.
Usage Example:
order = OrderByExpr().asc("timestamp").desc("relevance")
print(order.fields)
# Output: [("timestamp", 0), ("relevance", 1)]
DocStoreConnection (Abstract Base Class)
Purpose: Defines the interface for all document store database connections. Concrete subclasses must implement these methods to interact with specific backend databases.
Database-level Methods
dbType() -> str
Returns the database type as a string.health() -> dict
Returns health status information about the database.
Index/Table Management
createIdx(indexName: str, knowledgebaseId: str, vectorSize: int)
Creates an index with the specified name and vector size.deleteIdx(indexName: str, knowledgebaseId: str)
Deletes the specified index.indexExist(indexName: str, knowledgebaseId: str) -> bool
Checks if an index exists.
CRUD Operations
search(selectFields: list[str], highlightFields: list[str], condition: dict, matchExprs: list[MatchExpr], orderBy: OrderByExpr, offset: int, limit: int, indexNames: str|list[str], knowledgebaseIds: list[str], aggFields: list[str] = [], rank_feature: dict | None = None)
Performs a search query with filtering, matching expressions, ordering, pagination, and aggregation.get(chunkId: str, indexName: str, knowledgebaseIds: list[str]) -> dict | None
Retrieves a single document chunk by ID.insert(rows: list[dict], indexName: str, knowledgebaseId: str = None) -> list[str]
Inserts or updates multiple documents in bulk.update(condition: dict, newValue: dict, indexName: str, knowledgebaseId: str) -> bool
Updates documents matching the condition.delete(condition: dict, indexName: str, knowledgebaseId: str) -> int
Deletes documents matching the condition.
Search Result Helpers
getTotal(res)
Extracts total number of matching documents from a search response.getChunkIds(res)
Extracts document chunk IDs from a response.getFields(res, fields: list[str]) -> dict[str, dict]
Extracts specified fields from search results.getHighlight(res, keywords: list[str], fieldnm: str)
Extracts highlighted snippets for given keywords in a field.getAggregation(res, fieldnm: str)
Extracts aggregation results.
SQL Execution
sql(sql: str, fetch_size: int, format: str)
Executes SQL queries generated by text-to-SQL modules.
Important Implementation Details
The use of abstract base classes (
ABC) enforces implementation of all critical methods in subclasses, ensuring consistency across different database backend implementations.SparseVectorsupports flexible sparse vector representations and provides serialization/deserialization helpers.Matching expression classes encapsulate different vector/text search modalities, allowing the search engine to process diverse query types uniformly.
OrderByExprsupports chaining ascending and descending order fields fluently.The
DocStoreConnectioninterface abstracts database details, allowing the rest of the InfiniFlow system to interact with various document stores interchangeably.Methods like
searchsupport complex queries combining filtering, vector matching, ordering, pagination, and aggregation, reflecting the needs of modern vector search applications.
Interaction with Other System Components
This file acts as the foundational interface layer between the InfiniFlow application logic and the underlying document store databases.
Concrete implementations of
DocStoreConnectionwould interact with specific backend databases (e.g., Elasticsearch, Milvus).Search expressions (
MatchExprsubclasses) are probably constructed by higher-level query builders or user inputs and passed to the connection for execution.The results extracted via helper methods (
getTotal,getChunkIds, etc.) are consumed by application services for display or further processing.The SQL method is intended to support integration with text-to-SQL modules, enabling hybrid search over structured and unstructured data.
Visual Diagram
classDiagram
class SparseVector {
+indices: list[int]
+values: list[float] | list[int] | None
+to_dict_old()
+to_dict()
+from_dict(d: dict)
}
class MatchTextExpr {
+fields: list[str]
+matching_text: str
+topn: int
+extra_options: dict
}
class MatchDenseExpr {
+vector_column_name: str
+embedding_data: VEC
+embedding_data_type: str
+distance_type: str
+topn: int
+extra_options: dict
}
class MatchSparseExpr {
+vector_column_name: str
+sparse_data: SparseVector | dict
+distance_type: str
+topn: int
+opt_params: dict | None
}
class MatchTensorExpr {
+column_name: str
+query_data: VEC
+query_data_type: str
+topn: int
+extra_option: dict | None
}
class FusionExpr {
+method: str
+topn: int
+fusion_params: dict | None
}
class OrderByExpr {
+fields: list[tuple[str,int]]
+asc(field: str)
+desc(field: str)
+fields()
}
class DocStoreConnection {
<<abstract>>
+dbType() str
+health() dict
+createIdx(indexName, knowledgebaseId, vectorSize)
+deleteIdx(indexName, knowledgebaseId)
+indexExist(indexName, knowledgebaseId) bool
+search(...)
+get(chunkId, indexName, knowledgebaseIds) dict|None
+insert(rows, indexName, knowledgebaseId) list[str]
+update(condition, newValue, indexName, knowledgebaseId) bool
+delete(condition, indexName, knowledgebaseId) int
+getTotal(res)
+getChunkIds(res)
+getFields(res, fields)
+getHighlight(res, keywords, fieldnm)
+getAggregation(res, fieldnm)
+sql(sql, fetch_size, format)
}
SparseVector <|-- MatchSparseExpr : uses
MatchTextExpr <|-- MatchExpr
MatchDenseExpr <|-- MatchExpr
MatchSparseExpr <|-- MatchExpr
MatchTensorExpr <|-- MatchExpr
FusionExpr <|-- MatchExpr
Summary
doc_store_conn.py is a critical component in the InfiniFlow architecture, defining an extensible and abstract interface for vector and text search operations on document stores. It enables supporting multiple data types and search modalities while providing a standard API for database backend implementations. This abstraction facilitates modularity, scalability, and flexibility in building advanced knowledge base search systems.