test_list_chunks.py

Overview

test_list_chunks.py is a comprehensive test suite designed to verify the functionality, robustness, and correctness of the list_chunks API endpoint in the InfiniFlow system. This API is responsible for retrieving document chunks based on various query parameters such as pagination, page size, keywords filtering, and authorization credentials.

The tests are implemented using the pytest framework and cover scenarios including:

Authorization validation with valid and invalid tokens.
Pagination behavior and edge cases.
Page size constraints and limits.
Keyword-based filtering of chunks.
Handling of invalid or unexpected parameters.
Concurrent access to the list_chunks API.
Default behavior and consistency after batch chunk additions.

By exercising these cases, the test suite ensures that the chunk listing functionality behaves as expected under normal, boundary, and erroneous conditions.

Detailed Explanation

Imports and Dependencies

os: Used for environment variable checks.
concurrent.futures.ThreadPoolExecutor, as_completed: Facilitate concurrent execution of API calls.
pytest: Testing framework for structuring and running tests.
batch_add_chunks and list_chunks (from common): Utility functions to add chunks and list chunks respectively.
INVALID_API_TOKEN (from configs): A constant representing an invalid API token used for testing authorization failures.
RAGFlowWebApiAuth (from libs.auth): Authentication class used to create API auth tokens.

Test Classes

1. `TestAuthorization`

This class validates the authorization mechanism of the list_chunks API.

Method: test_invalid_auth
- Purpose: Tests API response when called without valid authentication credentials.
- Parameters (via pytest.mark.parametrize):
  - invalid_auth: Either None (no auth) or a RAGFlowWebApiAuth instance with an invalid token.
  - expected_code: Expected HTTP-like response code (401 Unauthorized).
  - expected_message: Expected message string indicating unauthorized access.
- Test: Calls list_chunks with invalid auth and expects a 401 error.
- Usage Example:
```
res = list_chunks(None, {"doc_id": "document_id"})
assert res["code"] == 401
assert "<Unauthorized" in res["message"]
```

2. `TestChunksList`

This class contains multiple tests focusing on the behavior of the chunk listing API with respect to pagination, page size, keyword filtering, parameter validation, concurrency, and default chunk listing.

Pagination Tests

Method: test_page
- Purpose: Validates correct handling of the page parameter.
- Parameters:
  - params: Dict containing page and size.
  - expected_code: Expected response code (0 for success, 100 for error).
  - expected_page_size: Number of chunks expected in the response.
  - expected_message: Expected error message if the code is not 0.
- Notes: Some edge cases are marked to skip due to current limitations.
- Behavior:
  - Checks response when page is None, negative, zero, a string, or valid integers.
  - Asserts the count of returned chunks matches expectations.
- Example Usage:
```
payload = {"doc_id": doc_id, "page": 2, "size": 2}
res = list_chunks(WebApiAuth, payload)
assert res["code"] == 0
assert len(res["data"]["chunks"]) == 2
```

Page Size Tests

Method: test_page_size
- Purpose: Tests API response to various size parameter values.
- Parameters & Behavior:
  - Handles size as None, zero, positive integers, strings, and negative values.
  - Validates whether the number of chunks returned matches the expected page size.
  - Some invalid inputs are skipped in tests due to current implementation constraints.
- Example:
```
payload = {"doc_id": doc_id, "size": 1}
res = list_chunks(WebApiAuth, payload)
assert len(res["data"]["chunks"]) == 1
```

Keyword Filtering Tests

Method: test_keywords
- Purpose: Checks filtering of chunks by keywords.
- Parameters:
  - params: Dict with different keyword values (None, empty string, specific keywords).
  - expected_page_size: Expected number of filtered chunks returned.
- Notes: Some tests are conditionally skipped based on environment due to known issues.
- Behavior:
  - Verifies that valid keywords filter the chunk list accordingly.
  - Returns all chunks if keyword is empty or None.
- Example:
```
payload = {"doc_id": doc_id, "keywords": "content"}
res = list_chunks(WebApiAuth, payload)
assert len(res["data"]["chunks"]) == 1
```

Invalid Parameters Test

Method: test_invalid_params
- Purpose: Ensures that unexpected parameters do not break functionality.
- Behavior: Sends a payload with an unknown parameter "a": "b" and asserts normal operation with default chunk count.
- Example:
```
payload = {"doc_id": doc_id, "a": "b"}
res = list_chunks(WebApiAuth, payload)
assert res["code"] == 0
```

Concurrent Requests Test

Method: test_concurrent_list

Purpose: Checks the API robustness under concurrent access.
Behavior:
- Launches 100 parallel requests with up to 5 worker threads.
- Asserts all responses return the expected number of chunks.

Example:

with ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(list_chunks, WebApiAuth, {"doc_id": doc_id}) for _ in range(100)]
for future in as_completed(futures):
    assert len(future.result()["data"]["chunks"]) == 5

Default Behavior Test

Method: test_default
- Purpose: Validates default chunk listing behavior and after batch chunk additions.
- Behavior:
  - Lists chunks for a document.
  - Adds 31 chunks in batch.
  - Waits 3 seconds (to allow eventual consistency).
  - Lists chunks again and verifies chunk count increased accordingly.
- Example:
```
res = list_chunks(WebApiAuth, {"doc_id": doc_id})
batch_add_chunks(WebApiAuth, doc_id, 31)
time.sleep(3)
res = list_chunks(WebApiAuth, {"doc_id": doc_id})
assert len(res["data"]["chunks"]) == 30
```

Important Implementation Details

Use of Parameterization: The tests use pytest.mark.parametrize extensively to cover multiple input variations efficiently.
Skipping Tests: Some tests with invalid inputs or edge cases are marked to skip, indicating current known limitations or external dependencies.
Concurrency Testing: Employs ThreadPoolExecutor to simulate high-load concurrent requests, ensuring thread-safety and API scalability.
Sleep for Eventual Consistency: The default behavior test incorporates a delay to accommodate asynchronous chunk addition propagation.
Error Handling: Tests verify that the API returns appropriate error codes and messages for invalid inputs, enhancing robustness.

Interaction With Other System Components

list_chunks Function: Core API call under test, responsible for retrieving document chunks.
batch_add_chunks Function: Utility to add multiple chunks to a document, used to setup test data.
Authentication (RAGFlowWebApiAuth): Used to simulate authorized and unauthorized API requests.
Configuration (INVALID_API_TOKEN): Used to test invalid authorization flows.
Environment Variables: Some tests conditionally skip parts based on the environment (e.g., DOC_ENGINE).

This file acts as a validation layer ensuring that the document chunk listing API behaves correctly within the larger InfiniFlow system, especially relating to document chunk management and authentication.

Visual Diagram

classDiagram
    class TestAuthorization {
        +test_invalid_auth(invalid_auth, expected_code, expected_message)
    }
    class TestChunksList {
        +test_page(WebApiAuth, add_chunks, params, expected_code, expected_page_size, expected_message)
        +test_page_size(WebApiAuth, add_chunks, params, expected_code, expected_page_size, expected_message)
        +test_keywords(WebApiAuth, add_chunks, params, expected_page_size)
        +test_invalid_params(WebApiAuth, add_chunks)
        +test_concurrent_list(WebApiAuth, add_chunks)
        +test_default(WebApiAuth, add_document)
    }
    TestAuthorization ..> list_chunks : calls
    TestChunksList ..> list_chunks : calls
    TestChunksList ..> batch_add_chunks : calls
    TestAuthorization ..> RAGFlowWebApiAuth : uses
    TestChunksList ..> RAGFlowWebApiAuth : uses

Summary

This test module test_list_chunks.py provides rigorous testing for the chunk listing API in InfiniFlow, covering authorization, pagination, filtering, concurrency, and error cases. It ensures the API returns correct data slices, handles invalid inputs gracefully, and maintains consistent behavior under load. The use of parameterized tests and concurrency simulations strengthens confidence in the system's chunk retrieval mechanisms.