benchmark.py
Overview
benchmark.py is a utility module within the InfiniFlow project designed to perform benchmarking and evaluation of retrieval-based natural language processing models against standard datasets. It supports indexing and evaluating models on popular retrieval benchmarks such as MS MARCO v1.1, Trivia QA, and MIRACL, leveraging vector embeddings and similarity-based search.
The core functionality involves:
Loading dataset documents and queries.
Tokenizing and embedding documents using a configured embedding model.
Indexing embedded documents into a vector search index.
Executing retrieval queries to generate ranked lists of documents.
Evaluating retrieval quality using metrics like NDCG@10, MAP@5, and MRR@10.
Saving detailed benchmarking results for analysis.
This file interacts with several components of the InfiniFlow system, including:
KnowledgebaseService for fetching knowledgebase configurations.
LLMBundlefor embedding model access.settings.docStoreConn for vector index management and document insertion.
An external retrieval service (
settings.retrievaler) for searching indexed documents.The
ranxlibrary for evaluation metrics computation.
Classes and Methods
Class: Benchmark
Main class orchestrating the benchmarking process over different datasets.
__init__(self, kb_id)
Initializes a benchmark instance for a specific knowledgebase.
Parameters:
kb_id(str): Identifier of the knowledgebase to use for embeddings and indexing.
Behavior:
Fetches knowledgebase details and settings.
Creates an embedding model instance (
LLMBundle) configured for the knowledgebase.Initializes internal state variables like
tenant_id,index_name, andinitialized_index.
_get_retrieval(self, qrels) -> dict
Performs retrieval queries on the indexed documents and prepares a run dictionary for evaluation.
Parameters:
qrels(dict): A dictionary mapping queries to relevant document IDs and their relevance scores.
Returns:
run(defaultdict(dict)): Mapping from queries to retrieved document IDs and similarity scores.
Notes:
Waits 20 seconds before querying to allow index readiness.
Uses settings.retrievaler.retrieval to perform vector similarity search.
Removes queries with no retrieval results from
qrels.
embedding(self, docs) -> (list, int)
Generates vector embeddings for a batch of documents and attaches them to the documents.
Parameters:
docs(list of dict): Each document dict must contain a "content_with_weight" field.
Returns:
Tuple of
(docs, vector_size):docs: The input documents augmented with embedding vectors under keys likeq_<vec_len>_vec.vector_size: Dimensionality of the embedding vectors.
Usage:
Called internally before indexing to embed documents.
init_index(self, vector_size: int)
Initializes a new vector index with the specified vector size, deleting any existing index with the same name.
Parameters:
vector_size(int): Dimensionality of the vectors to be indexed.
Notes:
Uses settings.docStoreConn to check existence, delete, and create index.
Ensures index initialization happens only once per benchmark run.
ms_marco_index(self, file_path: str, index_name: str) -> (dict, dict)
Processes MS MARCO v1.1 dataset files to index documents and build qrels.
Parameters:
file_path(str): Directory containing.parquetdataset files.index_name(str): Name for the index to create.
Returns:
Tuple
(qrels, texts):qrels: Query-to-document relevance mappings.texts: Document ID to original text mappings.
Implementation Details:
Reads
.parquetfiles containing query and passage data.For each passage, tokenizes and prepares document dicts.
Embeds and batches documents into the index in chunks.
Limits total processed documents by global max_docs.
trivia_qa_index(self, file_path: str, index_name: str) -> (dict, dict)
Indexes Trivia QA dataset similarly to MS MARCO but adapted to its schema.
Parameters:
file_path(str): Directory containing.parquetdataset files.index_name(str): Name for the index.
Returns:
(qrels, texts)as above.
Notes:
Processes question and search results fields.
Uses similar batching, embedding, and indexing as MS MARCO.
miracl_index(self, file_path: str, corpus_path: str, index_name: str) -> (dict, dict)
Indexes MIRACL multilingual dataset.
Parameters:
file_path(str): Path to MIRACL query and relevance files.corpus_path(str): Path to MIRACL corpus files.index_name(str): Name for index.
Returns:
(qrels, texts).
Implementation Details:
Loads full corpus from JSON files.
Loads topics and qrels from CSV files.
Tokenizes, embeds, and indexes documents in batches.
Supports multiple languages by directory structure.
save_results(self, qrels: dict, run: dict, texts: dict, dataset: str, file_path: str) -> None
Saves evaluation results to markdown and JSON files for detailed analysis.
Parameters:
qrels(dict): Ground truth relevance judgments.run(dict): Retrieved ranked documents and scores.texts(dict): Document ID to text mapping.dataset(str): Dataset name used in file naming.file_path(str): Directory to save result files.
Behavior:
Calculates and sorts NDCG@10 per query.
Writes a markdown file with per-query scores and top retrieved texts.
Dumps qrels and run dictionaries into JSON files.
__call__(self, dataset: str, file_path: str, miracl_corpus: str = '') -> None
Entry point method to run benchmarking for a specified dataset.
Parameters:
dataset(str): Name of the dataset. Supports"ms_marco_v1.1","trivia_qa","miracl".file_path(str): Path to dataset files.miracl_corpus(str, optional): Required only for"miracl", path to MIRACL corpus.
Behavior:
For each dataset:
Sets tenant and index names.
Calls corresponding indexing method.
Performs retrieval and evaluation.
Saves results.
For MIRACL, performs per-language benchmarking with directory existence checks.
Important Implementation Details
Batch Processing: Documents are processed in batches of 32 to optimize embedding and indexing operations.
Index Management: Existing indexes are deleted before creating new ones to maintain a clean state.
Embedding Model: Uses
LLMBundleconfigured from knowledgebase settings for generating document embeddings.Evaluation Metrics: Uses
ranxlibrary to compute NDCG@10, MAP@5, and MRR@10, standard metrics in information retrieval evaluation.Dataset Adaptations: Separate methods handle dataset-specific data formats and structures (e.g., parquet for MS MARCO, JSON+CSV for MIRACL).
Global Limit: The max_docs global variable limits the total number of documents indexed to prevent excessive processing.
Tokenization: Uses tokenize from
rag.nlpfor document preprocessing.Sleep Delay: A 20-second wait is introduced before retrieval to allow index and search service readiness.
System Interaction
KnowledgebaseService: Retrieves knowledgebase metadata and configurations.
LLMBundle: Provides embedding encoders for document vectors.
settings.docStoreConn: Interface to the document store for index existence checks, creation, deletion, and document insertion.
settings.retrievaler: Performs retrieval queries against the indexed vectors.
ranx (Qrels, Run, evaluate): Evaluates retrieval results against ground truth.
External datasets: Reads from files in various formats (
.parquet,.json,.csv) located at user-supplied paths.Command-Line Interface: Parses CLI arguments to trigger benchmarking runs with specific parameters.
Usage Example
python benchmark.py 1000 my_kb_id ms_marco_v1.1 /path/to/ms_marco_dataset
This command benchmarks the MS MARCO v1.1 dataset, indexing up to 1000 documents, using the knowledgebase with ID my_kb_id. Results will be saved in the dataset directory.
For MIRACL, an additional corpus path is required:
python benchmark.py 500 my_kb_id miracl /path/to/miracl_dataset /path/to/miracl_corpus
Mermaid Class Diagram
classDiagram
class Benchmark {
-kb_id: str
-kb: Knowledgebase
-similarity_threshold: float
-vector_similarity_weight: float
-embd_mdl: LLMBundle
-tenant_id: str
-index_name: str
-initialized_index: bool
+__init__(kb_id)
-_get_retrieval(qrels) dict
+embedding(docs) tuple
+init_index(vector_size)
+ms_marco_index(file_path, index_name) tuple
+trivia_qa_index(file_path, index_name) tuple
+miracl_index(file_path, corpus_path, index_name) tuple
+save_results(qrels, run, texts, dataset, file_path)
+__call__(dataset, file_path, miracl_corpus='')
}
Summary
benchmark.py is a specialized benchmarking tool within InfiniFlow that supports indexing and evaluating retrieval models over several major NLP datasets using vector embeddings. Its modular design accommodates dataset-specific preprocessing and indexing while providing consistent evaluation reporting. It integrates tightly with InfiniFlow’s knowledgebase, embedding models, document store, and retrieval services, making it a critical utility for assessing and validating retrieval model performance.