test_parse_documents.py
Overview
test_parse_documents.py is an automated test suite designed to verify the correctness, robustness, and concurrency handling of the document parsing API in the InfiniFlow system. It leverages the pytest framework to structure and execute test cases that cover authorization validation, input payload validation, dataset and document ID handling, and performance under load and concurrent conditions.
The file tests the endpoint/functionality responsible for parsing documents within datasets, ensuring that parsing operations complete successfully and that document metadata reflects the expected processing states.
Detailed Description of Classes and Functions
Imported Modules and Utilities
concurrent.futures.ThreadPoolExecutor: Used for running concurrent parsing requests in parallel.pytest: Testing framework used for structuring tests.common: Provides constants and API utility functions likebulk_upload_documents,list_documnets, andparse_documnets.libs.auth.RAGFlowHttpApiAuth: Authentication class to create authorized API access.libs.utils.wait_for: Decorator to wait for a condition to be true within a timeout, retrying at intervals.
Functions
condition
@wait_for(30, 1, "Document parsing timeout")
def condition(_auth, _dataset_id, _document_ids=None) -> bool:
Purpose: Checks if documents in a dataset have finished parsing (
run == "DONE").Parameters:
_auth: Authentication object for API calls._dataset_id(str): The dataset identifier._document_ids(Optional[List[str]]): Specific document IDs to check. IfNone, checks all documents in the dataset.
Returns:
Trueif all targeted documents haverun == "DONE", otherwiseFalse.Implementation Notes:
Uses the
list_documnetsAPI to get document statuses.Decorated with
wait_forto retry every 1 second, up to 30 seconds.
Usage Example:
if condition(auth, "dataset123", ["doc1", "doc2"]): print("Documents parsed successfully")
validate_document_details
def validate_document_details(auth, dataset_id, document_ids) -> None:
Purpose: Validates key metadata fields of parsed documents.
Parameters:
auth: Authentication object.dataset_id(str): Dataset identifier.document_ids(List[str]): List of document IDs to validate.
Returns: None, but asserts that:
Document parsing status is
"DONE".process_begin_attimestamp is present.process_durationis greater than zero.progressis positive.progress_msgcontains "Task done".
Usage Example:
validate_document_details(auth, "dataset123", ["doc1", "doc2"])
Classes
TestAuthorization
Scope: Tests authorization and API key validation for the parsing endpoint.
Tests:
test_invalid_auth: Parameterized test verifying behavior when no auth or invalid API token is provided.
Parameters for test cases:
auth: Authentication object orNone.expected_code: Expected error code returned.expected_message: Expected error message string.
Example Test Case:
def test_invalid_auth(self, auth, expected_code, expected_message): res = parse_documnets(auth, "dataset_id") assert res["code"] == expected_code assert res["message"] == expected_message
TestDocumentsParse
Scope: Tests various scenarios of document parsing payloads and dataset/document ID handling.
Tests:
test_basic_scenarios: Parameterized tests covering:Null payload.
Empty document IDs.
Invalid document IDs.
Non-JSON payloads.
Valid single and multiple document IDs.
test_invalid_dataset_id: Tests invalid or empty dataset IDs.test_parse_partial_invalid_document_id: Tests mixed valid and invalid document IDs.test_repeated_parse: Tests repeated parsing requests on the same documents.test_duplicate_parse: Tests parsing requests with duplicate document IDs.
Test Mechanics:
Uses fixtures
get_http_api_authandadd_documents_functo prepare authentication and datasets with documents.Calls
parse_documnetswith various payloads and asserts expected responses.Uses
conditionandvalidate_document_detailsto ensure parsing completion and correctness.
Example Usage:
def test_basic_scenarios(self, get_http_api_auth, add_documents_func, payload, expected_code, expected_message): dataset_id, document_ids = add_documents_func if callable(payload): payload = payload(document_ids) res = parse_documnets(get_http_api_auth, dataset_id, payload) assert res["code"] == expected_code
Standalone Tests
test_parse_100_files
Purpose: Tests bulk parsing of 100 documents to validate scalability and performance.
Parameters:
get_http_api_auth,add_dataset_func,tmp_path: pytest fixtures for auth, dataset creation, and temporary file path.
Workflow:
Uploads 100 documents using
bulk_upload_documents.Calls
parse_documnetson all documents.Waits for parsing completion and validates document details.
Timeout: Waits up to 100 seconds for completion.
test_concurrent_parse
Purpose: Tests concurrent parsing of 100 documents using 5 parallel threads.
Parameters:
Same as
test_parse_100_files.
Workflow:
Uploads documents.
Uses
ThreadPoolExecutorto send parallel parse requests, each for one document.Asserts all parse requests succeed.
Waits for completion and validates document details.
Timeout: Waits up to 120 seconds for completion.
Important Implementation Details and Algorithms
Polling with Timeout: The
conditionfunction uses thewait_fordecorator which implements a polling mechanism. It retries the condition every 1 second up to a maximum timeout (30s, 100s, 120s depending on the test). This ensures asynchronous parsing completes before assertions.Concurrent Requests: The concurrent parsing test uses Python's
ThreadPoolExecutorto simulate multiple users or processes parsing documents in parallel, verifying thread safety and backend concurrency handling.Parameterized Testing: Many tests use
pytest.mark.parametrizeto cover a wide range of input scenarios without code duplication, improving test coverage and maintainability.Fixtures Usage: Tests rely on external pytest fixtures like
get_http_api_auth,add_documents_func, andadd_dataset_functo abstract setup tasks such as authentication and dataset/document creation.Error Handling Validation: Tests explicitly check for expected error codes and messages for invalid inputs, ensuring the API returns meaningful feedback to clients.
Interaction with Other System Components
API Functions (from
common):parse_documnets(auth, dataset_id, payload): The primary function under test, which triggers document parsing.list_documnets(auth, dataset_id, params): Used to retrieve current document statuses for validation.bulk_upload_documents(auth, dataset_id, count, tmp_path): Used to upload test documents before parsing.
Authentication (
libs.auth):RAGFlowHttpApiAuth: Used to create authentication tokens for API requests.
Utilities (
libs.utils):wait_for: Utility for polling with timeout, crucial for asynchronous operation validation.
Pytest Framework:
Provides the testing infrastructure, fixtures, and parameterization.
The file fits into the larger InfiniFlow system as a quality assurance component that ensures the document parsing API behaves correctly under various conditions, including error scenarios, large datasets, and concurrent usage.
Visual Diagram
classDiagram
class TestAuthorization {
+test_invalid_auth(auth, expected_code, expected_message)
}
class TestDocumentsParse {
+test_basic_scenarios(get_http_api_auth, add_documents_func, payload, expected_code, expected_message)
+test_invalid_dataset_id(get_http_api_auth, add_documents_func, dataset_id, expected_code, expected_message)
+test_parse_partial_invalid_document_id(get_http_api_auth, add_documents_func, payload)
+test_repeated_parse(get_http_api_auth, add_documents_func)
+test_duplicate_parse(get_http_api_auth, add_documents_func)
}
class Functions {
+condition(_auth, _dataset_id, _document_ids=None) bool
+validate_document_details(auth, dataset_id, document_ids) void
}
class StandaloneTests {
+test_parse_100_files(get_http_api_auth, add_dataset_func, tmp_path)
+test_concurrent_parse(get_http_api_auth, add_dataset_func, tmp_path)
}
TestAuthorization --> Functions : uses
TestDocumentsParse --> Functions : uses
StandaloneTests --> Functions : uses
StandaloneTests --> ThreadPoolExecutor : uses
Summary
test_parse_documents.py is a comprehensive, well-structured test suite validating the document parsing API behavior of the InfiniFlow platform. It uses parameterized tests, fixtures, and concurrency utilities to verify correctness, error handling, and scalability. Its polling mechanism via wait_for ensures asynchronous parsing completes before assertions, making tests reliable and robust.
This file is essential for maintaining API integrity during development, preventing regressions, and ensuring the backend can handle complex real-world usage scenarios involving bulk and concurrent document parsing.