test_parse_documents.py
Overview
test_parse_documents.py is a comprehensive test suite designed to validate the document parsing functionality of the InfiniFlow system. It uses the pytest framework to test various scenarios related to parsing documents within datasets, focusing on authorization, input validation, concurrency, and bulk operations. The file ensures that the document parsing API behaves correctly under normal conditions, edge cases, and error states.
The tests cover:
Authorization checks with valid and invalid API tokens.
Validation of payloads sent to the
parse_documentsAPI.Handling of invalid dataset IDs and document IDs.
Behavior when repeatedly parsing documents or submitting duplicate document IDs.
Performance and correctness of parsing large numbers of documents.
Concurrent parsing requests to simulate real-world parallel processing.
Utility functions are used to wait for asynchronous parsing completion, verify document status, and validate parsing result details.
Detailed Explanations
Imported Modules and Utilities
ThreadPoolExecutor,
as_completedfrom concurrent.futures — for concurrent execution of parsing requests.pytest — test framework used for structured test cases and parametrization.
bulk_upload_documents,list_documents,parse_documentsfrom common — core API interaction functions used in tests.INVALID_API_TOKEN from configs — predefined invalid token for negative auth tests.
RAGFlowHttpApiAuth from libs.auth — authentication class for API requests.
wait_for from utils — decorator utility to poll/wait for a condition with timeout.
Functions
condition(_auth, _dataset_id, _document_ids=None)
Wait condition function decorated with @wait_for(30, 1, "Document parsing timeout"). It polls the document list in a dataset and checks if all targeted documents have finished parsing ("run" == "DONE").
Parameters:
_auth: Authentication object for API calls._dataset_id: Dataset identifier string._document_ids(optional): List of document IDs to check. If none, checks all documents.
Returns: True if all target documents are done parsing;
Falseotherwise.Usage: Used to wait until parsing completes before assertions.
validate_document_details(auth, dataset_id, document_ids)
Verifies detailed parsing results for each document by asserting:
The document's "run" status is
"DONE".The "process_begin_at" timestamp is present.
The "process_duration" is positive.
The "progress" is greater than 0.
The "progress_msg" contains
"Task done".Parameters:
auth: Authentication object.dataset_id: Dataset identifier.document_ids: List of document IDs to validate.
Returns: None. Raises assertion errors on validation failure.
Usage: Ensures parsing produced expected metadata on documents.
Test Classes and Methods
class TestAuthorization
Tests related to API authorization.
test_invalid_auth(self, invalid_auth, expected_code, expected_message)Tests responses when invalid or missing authorization is provided.
Parameters (pytest parametrized):
invalid_auth: EitherNoneor an invalid RAGFlowHttpApiAuth instance.expected_code: Expected error code (0 means no error).expected_message: Expected error message string.
Behavior: Calls
parse_documentswith invalid auth and asserts the response code and message.
class TestDocumentsParse
Tests core document parsing functionality with various payloads and dataset conditions.
test_basic_scenarios(self, HttpApiAuth, add_documents_func, payload, expected_code, expected_message)Tests parsing with multiple payload scenarios:
Nonepayload (skipped).Empty document_ids list.
Invalid document IDs.
Non-JSON payload.
Valid single or multiple document IDs.
Validates proper error codes/messages and successful parsing behavior.
test_invalid_dataset_id(self, HttpApiAuth, add_documents_func, dataset_id, expected_code, expected_message)Tests parsing with invalid or empty dataset IDs and expects errors.
test_parse_partial_invalid_document_id(self, HttpApiAuth, add_documents_func, payload)Tests behavior when a mix of valid and invalid document IDs are submitted. Expects error about missing documents but valid documents still parsed.
test_repeated_parse(self, HttpApiAuth, add_documents_func)Tests idempotency by parsing the same documents multiple times.
test_duplicate_parse(self, HttpApiAuth, add_documents_func)Tests submitting duplicate document IDs in the same request. Expects success with warnings about duplicates and correct success count.
Standalone Test Functions
test_parse_100_files(HttpApiAuth, add_dataset_func, tmp_path)Tests bulk parsing of 100 documents uploaded to a newly created dataset. Waits up to 200 seconds for completion. Validates all documents parsed correctly.
test_concurrent_parse(HttpApiAuth, add_dataset_func, tmp_path)Tests concurrent parsing of 100 documents using a thread pool with 5 workers. Each document is parsed independently in parallel. Validates all responses and final document states.
Important Implementation Details
Waiting for Asynchronous Parsing:
Parsing is asynchronous; tests use the @wait_for decorator to poll document status until parsing is complete or timeout occurs.Concurrency Testing:
Uses Python's ThreadPoolExecutor to simulate multiple simultaneous parsing requests, ensuring thread safety and API concurrency support.Parametrization for Coverage:
Tests use pytest.mark.parametrize extensively to cover multiple input variations, error scenarios, and edge cases without duplicating code.Validation of Parsing Metadata:
Besides checking success codes, tests validate that parsing timestamps, durations, progress indicators, and messages are correctly set on documents.Error Handling:
Tests confirm that invalid inputs (e.g., bad auth tokens, invalid dataset/document IDs) produce appropriate error codes and messages consistent with API design.
Interactions with Other System Components
API Layer:
Interacts primarily with the document parsing REST API via helper functionsparse_documents,list_documents, andbulk_upload_documents.Authentication:
Uses RAGFlowHttpApiAuth for authenticated requests and tests invalid tokens from configs.Dataset and Document Management:
Works with datasets and documents created/uploaded during tests (add_dataset_func,add_documents_funcfixtures).Utilities:
Relies on wait_for utility for asynchronous wait/poll patterns.Common Module (
common.py):
Contains core API call wrappers used extensively for document listing, upload, and parsing requests.
Usage Examples
Example: Wait for parsing to complete and validate
# Wait until documents finish parsing
condition(auth, dataset_id, document_ids)
# Validate detailed parsing results
validate_document_details(auth, dataset_id, document_ids)
Example: Parsing documents with error handling
res = parse_documents(auth, dataset_id, {"document_ids": ["invalid_id"]})
if res["code"] != 0:
print(f"Error: {res['message']}")
Mermaid Class Diagram
classDiagram
class TestAuthorization {
+test_invalid_auth(invalid_auth, expected_code, expected_message)
}
class TestDocumentsParse {
+test_basic_scenarios(HttpApiAuth, add_documents_func, payload, expected_code, expected_message)
+test_invalid_dataset_id(HttpApiAuth, add_documents_func, dataset_id, expected_code, expected_message)
+test_parse_partial_invalid_document_id(HttpApiAuth, add_documents_func, payload)
+test_repeated_parse(HttpApiAuth, add_documents_func)
+test_duplicate_parse(HttpApiAuth, add_documents_func)
}
class Functions {
+condition(_auth, _dataset_id, _document_ids=None)
+validate_document_details(auth, dataset_id, document_ids)
+test_parse_100_files(HttpApiAuth, add_dataset_func, tmp_path)
+test_concurrent_parse(HttpApiAuth, add_dataset_func, tmp_path)
}
TestAuthorization --> Functions : uses
TestDocumentsParse --> Functions : uses
Summary
test_parse_documents.py is a robust, well-structured test module aimed at ensuring the integrity and reliability of the document parsing feature in InfiniFlow. It validates authorization, input correctness, concurrency, and bulk processing, using a mix of synchronous and asynchronous testing techniques. The tests provide confidence that the parsing API behaves as expected across a wide range of scenarios and edge cases.