test_paser_documents.py
Overview
This file contains automated tests for the document parsing functionality within the InfiniFlow system. It primarily verifies the behavior of the document parsing API, including authorization checks, parsing execution, stopping parsing operations, and concurrency handling.
The tests ensure that documents can be parsed correctly, that invalid or unauthorized requests are handled properly, and that the system behaves as expected under concurrent parsing scenarios. The file uses the pytest framework for structuring tests and assertions.
Detailed Explanations
Imported Modules and Utilities
ThreadPoolExecutor, as_completed (from
concurrent.futures): Used to run multiple parsing requests concurrently.pytest: The testing framework used.
common: Module providing utility functions:bulk_upload_documents: Uploads multiple documents for testing.list_documents: Retrieves metadata of documents.parse_documents: Triggers document parsing API calls.
configs.INVALID_API_TOKEN: A constant representing an invalid API token for negative tests.libs.auth.RAGFlowWebApiAuth: Authentication class to interact with the API.utils.wait_for: A decorator utility to repeatedly check a condition with timeout.
Functions
condition(_auth, _kb_id, _document_ids=None)
Purpose: Waits for document parsing to complete by checking if all targeted documents have a status
"run"equal to"3".Parameters:
_auth: Authentication object._kb_id: Knowledge base (KB) ID containing the documents._document_ids: Optional list of document IDs to check. IfNone, checks all documents in the KB.
Returns:
True if all targeted documents have run == "3" (indicating parsing completion).
Falseotherwise.
Usage: Used as a condition function for @wait_for decorator to poll until parsing finishes.
validate_document_parse_done(auth, _kb_id, _document_ids)
Purpose: Validates that specified documents have successfully completed parsing.
Parameters:
auth: Authentication object._kb_id: KB ID._document_ids: List of document IDs to validate.
Assertions:
Document
runstatus is"3".process_begin_at timestamp is present.
process_duration is positive.
progress is positive.
"Task done" present in
progress_msg.
Usage: Called after parsing to confirm success.
validate_document_parse_cancel(auth, _kb_id, _document_ids)
Purpose: Validates that specified documents have parsing stopped/canceled.
Parameters: Same as
validate_document_parse_done.Assertions:
Document
runstatus is"2"(canceled state).process_begin_at timestamp is present.
progress is
0.0.
Usage: Used to verify that stop parsing requests properly cancel document parsing.
Classes and Tests
TestAuthorization
Purpose: Tests authorization handling of the
parse_documentsAPI.Tests:
test_invalid_auth: Checks that requests with no auth or invalid API tokens return HTTP 401 Unauthorized code and corresponding error messages.
Parameters in tests: Invalid auth instances and expected error codes/messages.
TestDocumentsParse
Purpose: Tests various scenarios related to parsing documents.
Tests:
test_basic_scenarios: Parameterized tests for different payload inputs, including empty doc IDs, invalid IDs, malformed input, etc., verifying the API response codes and messages.test_parse_partial_invalid_document_id: Tests handling when some document IDs are invalid.test_repeated_parse: Tests repeated parsing requests on the same documents.test_duplicate_parse: Tests parsing requests with duplicate document IDs.
Parameters:
payload: Different payload inputs for parsing requests.expected_code: Expected response code from API.expected_message: Expected response message.
Usage: Validates robustness and correctness of the parsing API under various input conditions.
test_parse_100_files
Purpose: Tests parsing of 100 documents in a batch.
Flow:
Uploads 100 documents via
bulk_upload_documents.Triggers parsing for all documents.
Waits for completion using a decorated condition with a 100-second timeout.
Validates that parsing succeeded for all documents.
Usage: Validates scalability and performance of batch document parsing.
test_concurrent_parse
Purpose: Tests concurrent parsing of multiple documents using multiple threads.
Flow:
Uploads 100 documents.
Starts 5 worker threads to parse individual documents concurrently using ThreadPoolExecutor.
Waits for all parsing futures to complete and checks all succeeded.
Validates parsing completion.
Usage: Ensures thread safety and correctness when parsing is done concurrently.
TestDocumentsParseStop
Purpose: Tests stopping/canceling document parsing.
Tests:
test_basic_scenarios: Similar to TestDocumentsParse.test_basic_scenarios but for stopping (run="2").test_stop_parse_partial_invalid_document_id: Tests stopping with invalid document IDs.
Note: Some tests are marked to skip, indicating incomplete or disabled tests.
Behavior:
Starts parsing (
run="1"), then sends stop requests (run="2") for some or all documents.Validates that stopped documents have run=="2" and others completed.
Important Implementation Details
The file uses the wait_for decorator to poll the state of documents repeatedly until they reach the desired parsing status or timeout occurs. This is essential because document parsing is asynchronous.
The
runfield on documents indicates the parsing status:"1"= parsing started/requested"2"= parsing stopped/canceled"3"= parsing completed successfully
The tests cover both positive and negative cases, including authorization failures, malformed input, partial failures, and concurrency.
The use of ThreadPoolExecutor in
test_concurrent_parsetests concurrency by submitting single-document parse requests in parallel.
Interaction with Other Parts of the System
API Endpoints (via
parse_documents): This file tests the document parsing API endpoint that accepts document IDs and a run flag to start or stop parsing.Document Storage (via
list_documents): Retrieves document metadata and status to verify parsing progress.Authentication (
RAGFlowWebApiAuth): Authenticates API requests.Document Upload (
bulk_upload_documents): Used to prepare test data by uploading documents to the knowledge base.Configuration (
INVALID_API_TOKEN): Used to simulate authentication failure scenarios.Utilities (wait_for): Provides retry mechanism to wait for asynchronous processing.
This file serves as a critical integration test suite validating the document parsing subsystem's API correctness, error handling, and concurrency behavior.
Usage Examples
Example of triggering a document parse and validating completion in a test:
kb_id, document_ids = add_documents_func # fixture that adds documents
res = parse_documents(WebApiAuth, {"doc_ids": document_ids, "run": "1"})
assert res["code"] == 0
# Wait until parsing is done
condition(WebApiAuth, kb_id, document_ids)
# Validate parsing results
validate_document_parse_done(WebApiAuth, kb_id, document_ids)
Example of testing invalid authorization:
res = parse_documents(None)
assert res["code"] == 401
assert "<Unauthorized" in res["message"]
Mermaid Diagram: Class and Function Structure
classDiagram
class TestAuthorization {
+test_invalid_auth(invalid_auth, expected_code, expected_message)
}
class TestDocumentsParse {
+test_basic_scenarios(WebApiAuth, add_documents_func, payload, expected_code, expected_message)
+test_parse_partial_invalid_document_id(WebApiAuth, add_documents_func, payload)
+test_repeated_parse(WebApiAuth, add_documents_func)
+test_duplicate_parse(WebApiAuth, add_documents_func)
}
class TestDocumentsParseStop {
+test_basic_scenarios(WebApiAuth, add_documents_func, payload, expected_code, expected_message)
+test_stop_parse_partial_invalid_document_id(WebApiAuth, add_documents_func, payload)
}
class Functions {
+condition(_auth, _kb_id, _document_ids)
+validate_document_parse_done(auth, _kb_id, _document_ids)
+validate_document_parse_cancel(auth, _kb_id, _document_ids)
+test_parse_100_files(WebApiAuth, add_dataset_func, tmp_path)
+test_concurrent_parse(WebApiAuth, add_dataset_func, tmp_path)
}
TestAuthorization ..> Functions : uses
TestDocumentsParse ..> Functions : uses
TestDocumentsParseStop ..> Functions : uses
Summary
test_paser_documents.py is a comprehensive pytest-based test suite that validates document parsing operations in the InfiniFlow platform. It covers authorization, normal and edge case parsing scenarios, stopping parsing, concurrency, and batch processing. The file uses helper functions to assert parsing state and progress, and leverages concurrency utilities to simulate parallel requests. It acts as a crucial quality assurance layer ensuring the document parsing API meets functional and security requirements.