doc_store_conn.py


Overview

doc_store_conn.py defines an abstract interface and related data structures for interacting with document-oriented vector search databases within the InfiniFlow system. It specifies a contract for database connectivity, index management, CRUD (Create, Read, Update, Delete) operations, and search functionalities that support various vector search modalities, including dense vectors, sparse vectors, text, and tensor data.

The file primarily provides:

This modular and extensible design allows different backend database implementations (e.g., Elasticsearch, Milvus, Faiss) to provide their own concrete implementations while adhering to a unified API.


Classes and Data Structures

SparseVector

Description:
Represents a sparse vector with indices and optional values, useful for sparse vector similarity search.

Attributes:

Methods:

Usage Example:

sv = SparseVector(indices=[1, 3, 5], values=[0.5, 0.8, 0.9])
print(sv.to_dict())
# Output: {'1': 0.5, '3': 0.8, '5': 0.9}

Abstract Classes for Match Expressions

These classes represent different types of matching expressions used in search queries. They are abstract base classes (ABC) meant to be subclassed or instantiated with the required data for query construction.

MatchTextExpr

Purpose: Represent text-based matching criteria.

Constructor Parameters:


MatchDenseExpr

Purpose: Represent dense vector-based matching criteria.

Constructor Parameters:


MatchSparseExpr

Purpose: Represent sparse vector-based matching criteria.

Constructor Parameters:


MatchTensorExpr

Purpose: Represent tensor-based matching criteria.

Constructor Parameters:


FusionExpr

Purpose: Represent fusion of multiple match results.

Constructor Parameters:


MatchExpr (Type Alias)

A union type representing any of the matching expression classes:

MatchExpr = MatchTextExpr | MatchDenseExpr | MatchSparseExpr | MatchTensorExpr | FusionExpr

OrderByExpr

Purpose: Define sorting order for search results.

Attributes:

Methods:

Usage Example:

order = OrderByExpr().asc("timestamp").desc("relevance")
print(order.fields)
# Output: [("timestamp", 0), ("relevance", 1)]

DocStoreConnection (Abstract Base Class)

Purpose: Defines the interface for all document store database connections. Concrete subclasses must implement these methods to interact with specific backend databases.

Database-level Methods

Index/Table Management

CRUD Operations

Search Result Helpers

SQL Execution


Important Implementation Details


Interaction with Other System Components


Visual Diagram

classDiagram
    class SparseVector {
        +indices: list[int]
        +values: list[float] | list[int] | None
        +to_dict_old()
        +to_dict()
        +from_dict(d: dict)
    }

    class MatchTextExpr {
        +fields: list[str]
        +matching_text: str
        +topn: int
        +extra_options: dict
    }
    class MatchDenseExpr {
        +vector_column_name: str
        +embedding_data: VEC
        +embedding_data_type: str
        +distance_type: str
        +topn: int
        +extra_options: dict
    }
    class MatchSparseExpr {
        +vector_column_name: str
        +sparse_data: SparseVector | dict
        +distance_type: str
        +topn: int
        +opt_params: dict | None
    }
    class MatchTensorExpr {
        +column_name: str
        +query_data: VEC
        +query_data_type: str
        +topn: int
        +extra_option: dict | None
    }
    class FusionExpr {
        +method: str
        +topn: int
        +fusion_params: dict | None
    }

    class OrderByExpr {
        +fields: list[tuple[str,int]]
        +asc(field: str)
        +desc(field: str)
        +fields()
    }

    class DocStoreConnection {
        <<abstract>>
        +dbType() str
        +health() dict
        +createIdx(indexName, knowledgebaseId, vectorSize)
        +deleteIdx(indexName, knowledgebaseId)
        +indexExist(indexName, knowledgebaseId) bool
        +search(...)
        +get(chunkId, indexName, knowledgebaseIds) dict|None
        +insert(rows, indexName, knowledgebaseId) list[str]
        +update(condition, newValue, indexName, knowledgebaseId) bool
        +delete(condition, indexName, knowledgebaseId) int
        +getTotal(res)
        +getChunkIds(res)
        +getFields(res, fields)
        +getHighlight(res, keywords, fieldnm)
        +getAggregation(res, fieldnm)
        +sql(sql, fetch_size, format)
    }

    SparseVector <|-- MatchSparseExpr : uses
    MatchTextExpr <|-- MatchExpr
    MatchDenseExpr <|-- MatchExpr
    MatchSparseExpr <|-- MatchExpr
    MatchTensorExpr <|-- MatchExpr
    FusionExpr <|-- MatchExpr

Summary

doc_store_conn.py is a critical component in the InfiniFlow architecture, defining an extensible and abstract interface for vector and text search operations on document stores. It enables supporting multiple data types and search modalities while providing a standard API for database backend implementations. This abstraction facilitates modularity, scalability, and flexibility in building advanced knowledge base search systems.