test_retrieval_chunks.py

Overview

This file contains a suite of automated tests designed to validate the functionality of the retrieval_chunks API endpoint in the InfiniFlow system. The tests focus on verifying authorization, parameter handling, pagination, filtering, and concurrency aspects of chunk retrieval from datasets or documents.

The tests are implemented using the pytest framework and include multiple parameterized test cases to cover a wide range of inputs and expected outputs. This ensures robustness and correctness of the chunk retrieval API under different scenarios, including error handling for invalid inputs, and concurrent access.

Detailed Explanation

Imported Modules

os: Used to retrieve environment variables for conditional test skipping.
concurrent.futures.ThreadPoolExecutor and as_completed: For executing concurrent retrieval tests.
pytest: Testing framework used for writing and running tests.
retrieval_chunks (from common): The function under test, responsible for retrieving chunks based on parameters.
INVALID_API_TOKEN (from configs): A constant representing an invalid API token used in authorization tests.
RAGFlowHttpApiAuth (from libs.auth): Class representing HTTP API authorization credentials.

Classes and Their Methods

`TestAuthorization`

Tests related to API authorization.

Methods

test_invalid_auth(self, invalid_auth, expected_code, expected_message)
Tests the behavior of the retrieval_chunks function when provided with invalid or missing authorization.
- Parameters:
  - invalid_auth: Either None or an instance of RAGFlowHttpApiAuth initialized with an invalid token.
  - expected_code: Expected error code returned by the API.
  - expected_message: Expected error message string.
- Returns: None. Uses assertions to validate API responses.
- Usage Example:
```
auth = RAGFlowHttpApiAuth(INVALID_API_TOKEN)
response = retrieval_chunks(auth)
assert response["code"] == 109
assert "Authentication error" in response["message"]
```

`TestChunksRetrieval`

Comprehensive tests for chunk retrieval functionality, including parameter validation and concurrency.

Common Parameters in Tests

HttpApiAuth: Valid authorization credentials fixture.
add_chunks: Fixture that adds chunks to the system and returns identifiers (dataset_id, document_id, ...).
payload: Dictionary containing query parameters for chunk retrieval.
expected_code: Expected API response code.
expected_page_size: Expected number of chunks returned.
expected_message: Expected error or status message.

Methods

test_basic_scenarios(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
Tests basic parameter combinations related to dataset_ids and document_ids.
test_page(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
Tests pagination parameters page and page_size, including invalid and edge cases. Some tests are skipped due to known issues.
test_page_size(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
Tests different values of page_size parameter, including string inputs and invalid values.
test_vector_similarity_weight(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
Tests the vector_similarity_weight parameter influencing retrieval ranking, including invalid type handling.
test_top_k(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
Tests the top_k parameter, which controls the number of top chunks returned. Tests include handling of negative values, strings, and environment-specific skips.
test_rerank_id(self, HttpApiAuth, add_chunks, payload, expected_code, expected_message)
(Skipped) Tests the rerank_id parameter for reranking models. Includes test for unknown rerank IDs.
test_keyword(self, HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
(Skipped) Tests the keyword boolean/string parameter controlling keyword search behavior.
test_highlight(self, HttpApiAuth, add_chunks, payload, expected_code, expected_highlight, expected_message)
Tests the highlight parameter that controls whether chunks include highlighted search terms.
test_invalid_params(self, HttpApiAuth, add_chunks)
Tests API behavior with unexpected parameters, expecting graceful handling.
test_concurrent_retrieval(self, HttpApiAuth, add_chunks)
Tests concurrent execution of retrieval requests (100 parallel requests) to validate thread-safety and consistency.

Important Implementation Details and Algorithms

Parameterized Testing: Most tests use pytest.mark.parametrize to run the same test logic with multiple input/output pairs, improving coverage and reducing code duplication.
Conditional Test Skips: Some test cases are skipped based on known issues (issues/6646, issues/6648) or environment variables (DOC_ENGINE), allowing flexible testing across different deployment contexts.
Concurrent Requests: test_concurrent_retrieval uses ThreadPoolExecutor to simulate multiple clients querying the retrieval API simultaneously, ensuring correct behavior under load.
Error Handling Validation: Tests extensively check for correct error codes and messages when invalid inputs are provided, ensuring robustness.
Dynamic Payload Mutation: Tests dynamically update payload dictionaries with IDs obtained from test fixtures to ensure realistic and valid test inputs.

Interaction with Other Parts of the System

retrieval_chunks function (from common): This is the core API function under test, responsible for fetching text chunks based on provided parameters.
Authorization (RAGFlowHttpApiAuth): The tests interact with the authentication subsystem by providing valid or invalid API tokens.
Configurations (configs): Use of constants like INVALID_API_TOKEN to simulate authentication failures.
Test Fixtures (HttpApiAuth, add_chunks): External fixtures provide setup data such as authorized credentials and pre-added chunks in datasets/documents.
Environment Variables (os.getenv("DOC_ENGINE")): Influence test behavior based on the configured document engine backend.

Usage Example

A typical test case flow in this file:

Prepare a payload dictionary with parameters such as "question", "dataset_ids", "page", "page_size", etc.
Obtain valid API authorization from a fixture.
Call retrieval_chunks with the authorization and payload.
Assert that the response's code, message, and returned chunk count match expected values.

def test_example(HttpApiAuth, add_chunks):
    dataset_id, _, _ = add_chunks
    payload = {"question": "example", "dataset_ids": [dataset_id], "page_size": 3}
    res = retrieval_chunks(HttpApiAuth, payload)
    assert res["code"] == 0
    assert len(res["data"]["chunks"]) == 3

Mermaid Diagram: Class Structure

classDiagram
    class TestAuthorization {
        +test_invalid_auth(invalid_auth, expected_code, expected_message)
    }
    class TestChunksRetrieval {
        +test_basic_scenarios(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_page(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_page_size(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_vector_similarity_weight(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_top_k(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_rerank_id(HttpApiAuth, add_chunks, payload, expected_code, expected_message)
        +test_keyword(HttpApiAuth, add_chunks, payload, expected_code, expected_page_size, expected_message)
        +test_highlight(HttpApiAuth, add_chunks, payload, expected_code, expected_highlight, expected_message)
        +test_invalid_params(HttpApiAuth, add_chunks)
        +test_concurrent_retrieval(HttpApiAuth, add_chunks)
    }

Summary

This file is a comprehensive test suite for the chunk retrieval API in InfiniFlow.
It validates authorization, parameter correctness, pagination, filtering, ranking, highlighting, and concurrency.
Uses pytest features like parameterization and fixtures for modular, maintainable tests.
Skips some tests conditionally to accommodate known issues and environment differences.
Includes concurrency testing to ensure thread-safe and performant retrieval operations.