doc_store_conn.py

Overview

doc_store_conn.py defines an abstract interface and related data structures for interacting with document-oriented vector search databases within the InfiniFlow system. It specifies a contract for database connectivity, index management, CRUD (Create, Read, Update, Delete) operations, and search functionalities that support various vector search modalities, including dense vectors, sparse vectors, text, and tensor data.

The file primarily provides:

Abstract base classes representing different types of vector/text matching expressions.
A SparseVector data class for representing sparse vectors used in sparse matching.
An abstract base class DocStoreConnection which defines the essential methods any concrete document store connection must implement.
Support for ordering expressions and fusion expressions in search queries.

This modular and extensible design allows different backend database implementations (e.g., Elasticsearch, Milvus, Faiss) to provide their own concrete implementations while adhering to a unified API.

Classes and Data Structures

`SparseVector`

Description:
Represents a sparse vector with indices and optional values, useful for sparse vector similarity search.

Attributes:

indices: list[int] — The indices of non-zero elements in the sparse vector.
values: list[float] | list[int] | None — Corresponding values at those indices; optional.

Methods:

post_init()
Validates that if values is provided, it matches the length of indices.
to_dict_old() → dict
Converts the sparse vector to a dictionary with keys "indices" and optional "values".
to_dict() → dict
Converts the sparse vector to a dictionary mapping stringified indices to their values. Raises ValueError if values is None.
from_dict(d: dict) → SparseVector (static method)
Creates a SparseVector instance from a dictionary containing "indices" and optional "values".
str() and repr()
String representations for debugging.

Usage Example:

sv = SparseVector(indices=[1, 3, 5], values=[0.5, 0.8, 0.9])
print(sv.to_dict())
# Output: {'1': 0.5, '3': 0.8, '5': 0.9}

Abstract Classes for Match Expressions

These classes represent different types of matching expressions used in search queries. They are abstract base classes (ABC) meant to be subclassed or instantiated with the required data for query construction.

`MatchTextExpr`

Purpose: Represent text-based matching criteria.

Constructor Parameters:

fields: list[str] — Fields to match against.
matching_text: str — Text query string.
topn: int — Number of top results to return.
extra_options: dict — Additional matching options (optional).

`MatchDenseExpr`

Purpose: Represent dense vector-based matching criteria.

Constructor Parameters:

vector_column_name: str — Name of the vector column in the database.
embedding_data: VEC — Dense vector data (list or numpy array).
embedding_data_type: str — Type of the embedding data (e.g., "float32").
distance_type: str — Distance metric (e.g., "cosine", "l2").
topn: int — Number of top results to return (default 10).
extra_options: dict — Additional options (optional).

`MatchSparseExpr`

Purpose: Represent sparse vector-based matching criteria.

Constructor Parameters:

vector_column_name: str — Name of the sparse vector column.
sparse_data: SparseVector | dict — Sparse vector data or dictionary.
distance_type: str — Distance metric.
topn: int — Number of top results to return.
opt_params: dict | None — Optional parameters.

`MatchTensorExpr`

Purpose: Represent tensor-based matching criteria.

Constructor Parameters:

column_name: str — Name of the tensor column.
query_data: VEC — Query tensor data (list or numpy array).
query_data_type: str — Data type of the tensor.
topn: int — Number of top results to return.
extra_option: dict | None — Additional options.

`FusionExpr`

Purpose: Represent fusion of multiple match results.

Constructor Parameters:

method: str — Fusion method name.
topn: int — Number of top results to return.
fusion_params: dict | None — Additional fusion parameters.

`MatchExpr` (Type Alias)

A union type representing any of the matching expression classes:

MatchExpr = MatchTextExpr | MatchDenseExpr | MatchSparseExpr | MatchTensorExpr | FusionExpr

`OrderByExpr`

Purpose: Define sorting order for search results.

Attributes:

fields: list[tuple[str, int]] — List of (field_name, order_flag) tuples, where order_flag is 0 for ascending and 1 for descending.

Methods:

asc(field: str) → OrderByExpr
Adds an ascending order on field.
desc(field: str) → OrderByExpr
Adds a descending order on field.
fields() → list
Returns the current list of order fields.

Usage Example:

order = OrderByExpr().asc("timestamp").desc("relevance")
print(order.fields)
# Output: [("timestamp", 0), ("relevance", 1)]

`DocStoreConnection` (Abstract Base Class)

Purpose: Defines the interface for all document store database connections. Concrete subclasses must implement these methods to interact with specific backend databases.

Database-level Methods

dbType() -> str
Returns the database type as a string.
health() -> dict
Returns health status information about the database.

Index/Table Management

createIdx(indexName: str, knowledgebaseId: str, vectorSize: int)
Creates an index with the specified name and vector size.
deleteIdx(indexName: str, knowledgebaseId: str)
Deletes the specified index.
indexExist(indexName: str, knowledgebaseId: str) -> bool
Checks if an index exists.

CRUD Operations

search(selectFields: list[str], highlightFields: list[str], condition: dict, matchExprs: list[MatchExpr], orderBy: OrderByExpr, offset: int, limit: int, indexNames: str|list[str], knowledgebaseIds: list[str], aggFields: list[str] = [], rank_feature: dict | None = None)
Performs a search query with filtering, matching expressions, ordering, pagination, and aggregation.
get(chunkId: str, indexName: str, knowledgebaseIds: list[str]) -> dict | None
Retrieves a single document chunk by ID.
insert(rows: list[dict], indexName: str, knowledgebaseId: str = None) -> list[str]
Inserts or updates multiple documents in bulk.
update(condition: dict, newValue: dict, indexName: str, knowledgebaseId: str) -> bool
Updates documents matching the condition.
delete(condition: dict, indexName: str, knowledgebaseId: str) -> int
Deletes documents matching the condition.

Search Result Helpers

getTotal(res)
Extracts total number of matching documents from a search response.
getChunkIds(res)
Extracts document chunk IDs from a response.
getFields(res, fields: list[str]) -> dict[str, dict]
Extracts specified fields from search results.
getHighlight(res, keywords: list[str], fieldnm: str)
Extracts highlighted snippets for given keywords in a field.
getAggregation(res, fieldnm: str)
Extracts aggregation results.

SQL Execution

sql(sql: str, fetch_size: int, format: str)
Executes SQL queries generated by text-to-SQL modules.

Important Implementation Details

The use of abstract base classes (ABC) enforces implementation of all critical methods in subclasses, ensuring consistency across different database backend implementations.
SparseVector supports flexible sparse vector representations and provides serialization/deserialization helpers.
Matching expression classes encapsulate different vector/text search modalities, allowing the search engine to process diverse query types uniformly.
OrderByExpr supports chaining ascending and descending order fields fluently.
The DocStoreConnection interface abstracts database details, allowing the rest of the InfiniFlow system to interact with various document stores interchangeably.
Methods like search support complex queries combining filtering, vector matching, ordering, pagination, and aggregation, reflecting the needs of modern vector search applications.

Interaction with Other System Components

This file acts as the foundational interface layer between the InfiniFlow application logic and the underlying document store databases.
Concrete implementations of DocStoreConnection would interact with specific backend databases (e.g., Elasticsearch, Milvus).
Search expressions (MatchExpr subclasses) are probably constructed by higher-level query builders or user inputs and passed to the connection for execution.
The results extracted via helper methods (getTotal, getChunkIds, etc.) are consumed by application services for display or further processing.
The SQL method is intended to support integration with text-to-SQL modules, enabling hybrid search over structured and unstructured data.

Visual Diagram

classDiagram
    class SparseVector {
        +indices: list[int]
        +values: list[float] | list[int] | None
        +to_dict_old()
        +to_dict()
        +from_dict(d: dict)
    }

    class MatchTextExpr {
        +fields: list[str]
        +matching_text: str
        +topn: int
        +extra_options: dict
    }
    class MatchDenseExpr {
        +vector_column_name: str
        +embedding_data: VEC
        +embedding_data_type: str
        +distance_type: str
        +topn: int
        +extra_options: dict
    }
    class MatchSparseExpr {
        +vector_column_name: str
        +sparse_data: SparseVector | dict
        +distance_type: str
        +topn: int
        +opt_params: dict | None
    }
    class MatchTensorExpr {
        +column_name: str
        +query_data: VEC
        +query_data_type: str
        +topn: int
        +extra_option: dict | None
    }
    class FusionExpr {
        +method: str
        +topn: int
        +fusion_params: dict | None
    }

    class OrderByExpr {
        +fields: list[tuple[str,int]]
        +asc(field: str)
        +desc(field: str)
        +fields()
    }

    class DocStoreConnection {
        <<abstract>>
        +dbType() str
        +health() dict
        +createIdx(indexName, knowledgebaseId, vectorSize)
        +deleteIdx(indexName, knowledgebaseId)
        +indexExist(indexName, knowledgebaseId) bool
        +search(...)
        +get(chunkId, indexName, knowledgebaseIds) dict|None
        +insert(rows, indexName, knowledgebaseId) list[str]
        +update(condition, newValue, indexName, knowledgebaseId) bool
        +delete(condition, indexName, knowledgebaseId) int
        +getTotal(res)
        +getChunkIds(res)
        +getFields(res, fields)
        +getHighlight(res, keywords, fieldnm)
        +getAggregation(res, fieldnm)
        +sql(sql, fetch_size, format)
    }

    SparseVector <|-- MatchSparseExpr : uses
    MatchTextExpr <|-- MatchExpr
    MatchDenseExpr <|-- MatchExpr
    MatchSparseExpr <|-- MatchExpr
    MatchTensorExpr <|-- MatchExpr
    FusionExpr <|-- MatchExpr

Summary

doc_store_conn.py is a critical component in the InfiniFlow architecture, defining an extensible and abstract interface for vector and text search operations on document stores. It enables supporting multiple data types and search modalities while providing a standard API for database backend implementations. This abstraction facilitates modularity, scalability, and flexibility in building advanced knowledge base search systems.

doc_store_conn.py

Overview

Classes and Data Structures

SparseVector

Abstract Classes for Match Expressions

MatchTextExpr

MatchDenseExpr

MatchSparseExpr

MatchTensorExpr

FusionExpr

MatchExpr (Type Alias)

OrderByExpr

DocStoreConnection (Abstract Base Class)

Database-level Methods

Index/Table Management

CRUD Operations

Search Result Helpers

SQL Execution

Important Implementation Details

Interaction with Other System Components

Visual Diagram

Summary

`SparseVector`

`MatchTextExpr`

`MatchDenseExpr`

`MatchSparseExpr`

`MatchTensorExpr`

`FusionExpr`

`MatchExpr` (Type Alias)

`OrderByExpr`

`DocStoreConnection` (Abstract Base Class)