test_parse_documents.py
Overview
The test_parse_documents.py file contains automated tests designed to verify the functionality and robustness of the document parsing features within the InfiniFlow platform. It primarily tests the asynchronous parsing of documents in datasets, ensuring that documents move to a "DONE" state upon successful parsing and that relevant metadata is correctly populated.
This file uses the pytest framework for structuring tests and includes utility functions and decorators to handle asynchronous waiting and validation. It also performs concurrency tests to ensure the system behaves correctly under parallel parse requests.
Detailed Explanation
Imports and Dependencies
concurrent.futures.ThreadPoolExecutor, as_completed: Used for running concurrent parsing operations in threads.pytest: Testing framework used for defining test cases and parameterization.
common.bulk_upload_documents: Utility to bulk upload test documents to a dataset.ragflow_sdk.DataSet: SDK class representing a dataset, used to list documents and trigger parsing.utils.wait_for: Decorator utility that repeatedly checks a condition until it is met or a timeout occurs.
Utility Functions
condition(_dataset: DataSet, _document_ids: list[str] = None) -> bool
A polling function decorated by @wait_for to repeatedly check whether the parsing of documents has completed.
Parameters:
_dataset: Instance ofDataSetused to access documents._document_ids(optional): List of document IDs to check. IfNone, checks all documents in the dataset.
Returns: True if all checked documents have their run attribute equal to
"DONE", otherwiseFalse.Usage:
This function is typically used as a condition in asynchronous wait loops to ensure that parsing completes before assertions are made.
validate_document_details(dataset: DataSet, document_ids: list[str]) -> None
Validates that documents have been parsed successfully by asserting on multiple document attributes.
Parameters:
dataset: Instance ofDataSetto list documents.document_ids: List of document IDs whose details are to be validated.
Behavior:
Asserts that
run == "DONE".Checks process_begin_at has a non-empty value.
Ensures process_duration is positive.
Confirms progress is greater than zero.
Checks that "Task done" appears in the
progress_msg.
Usage:
This function is called after parsing operations to verify correctness of parsing metadata.
Test Class: TestDocumentsParse
This class groups together tests related to document parsing scenarios.
Method: test_basic_scenarios(self, add_documents_func, payload, expected_message)
Description: Parametrized test covering multiple input payload scenarios for the
async_parse_documentsmethod.Parameters:
add_documents_func: Fixture that returns a tuple (dataset, documents) with test documents added.payload: Various payloads representing document IDs to parse; can beNone, invalid IDs, empty lists, or valid IDs.expected_message: Expected exception message substring or empty string if no exception expected.
Behavior:
If
payloadis callable, it is executed with the list of document IDs.If
expected_messageis set, asserts theasync_parse_documentscall raises an exception containing the message.If no exception expected, triggers parsing and waits for completion, then validates document details.
Marks: Some test cases are marked as
p1(priority 1) or p3 (priority 3), or skipped.
Method: test_parse_partial_invalid_document_id(self, add_documents_func, payload)
Description: Tests the behavior when a mix of valid and invalid document IDs are passed for parsing.
Parameters:
add_documents_func: Fixture providing dataset and documents.payload: Callable generating a payload list including "invalid_id" among valid document IDs.
Behavior:
Asserts that an exception is raised indicating the invalid document ID.
Confirms that valid documents are still parsed successfully.
Marks: Marked as priority 3.
Method: test_repeated_parse(self, add_documents_func)
Description: Tests that repeatedly parsing the same documents does not cause errors.
Behavior:
Parses documents once and waits.
Parses the same documents again.
No explicit validation, but no exceptions should be thrown.
Marks: Priority 3.
Method: test_duplicate_parse(self, add_documents_func)
Description: Tests parsing with duplicated document IDs in the payload.
Behavior:
Passes a list with document IDs repeated twice.
Waits for completion and validates all documents.
Marks: Priority 3.
Standalone Tests
test_parse_100_files(add_dataset_func, tmp_path)
Description: Tests parsing of a large batch (100) of documents.
Parameters:
add_dataset_func: Fixture providing a fresh dataset.tmp_path: Temporary directory path to store test files.
Behavior:
Uses bulk_upload_documents to upload 100 documents.
Calls
async_parse_documentson all document IDs.Waits for parsing to complete with a 200-second timeout.
Validates all documents.
Marks: Priority 3.
test_concurrent_parse(add_dataset_func, tmp_path)
Description: Tests concurrent parsing of 100 documents using threads.
Parameters:
add_dataset_func: Fixture providing dataset.tmp_path: Temporary directory for test files.
Behavior:
Uploads 100 documents.
Uses ThreadPoolExecutor with 5 workers to parse documents concurrently, each parsing a single document.
Waits for all futures to complete.
Validates all documents after parsing.
Marks: Priority 3.
Important Implementation Details
The @wait_for decorator is crucial for handling asynchronous operations, retrying the
conditionfunction until parsing completes or the timeout occurs.Parsing is triggered asynchronously via
dataset.async_parse_documents(document_ids=...).Validation focuses on the document's parsing state (run), timestamps, duration, progress, and progress messages.
Concurrency testing ensures thread safety and correct handling of simultaneous parse requests.
Parameterized tests with pytest allow testing various edge cases, including invalid inputs and duplicates.
Interaction with Other Components
ragflow_sdk.DataSet: The central API object representing datasets, provides methods to list documents and trigger parsing.common.bulk_upload_documents: Utility function to upload multiple documents into a dataset, used to prepare test data.utils.wait_for: Provides a mechanism to wait for asynchronous operations to complete based on a condition.Pytest Fixtures (
add_documents_func,add_dataset_func): Provide test setup, including dataset creation and document uploads.This test file ensures the correctness of the document parsing workflow in the broader InfiniFlow system.
Usage Examples
Within the tests, the typical usage flow is:
dataset, documents = add_documents_func # Setup fixture
document_ids = [doc.id for doc in documents]
# Trigger async parsing
dataset.async_parse_documents(document_ids=document_ids)
# Wait for parsing to complete
condition(dataset, document_ids)
# Validate parsing results
validate_document_details(dataset, document_ids)
Mermaid Class Diagram
classDiagram
class TestDocumentsParse {
+test_basic_scenarios(payload, expected_message)
+test_parse_partial_invalid_document_id(payload)
+test_repeated_parse()
+test_duplicate_parse()
}
class DataSet {
+list_documents(page_size)
+async_parse_documents(document_ids)
}
class UtilityFunctions {
+condition(_dataset, _document_ids)
+validate_document_details(dataset, document_ids)
}
TestDocumentsParse --> DataSet : uses
TestDocumentsParse --> UtilityFunctions : uses
Summary
The test_parse_documents.py file is a comprehensive test suite focused on validating document parsing operations within datasets in the InfiniFlow platform. It verifies normal and edge cases, error handling, concurrency, and large batch processing. The file leverages pytest parameterization and asynchronous waiting to ensure robust testing of the parsing workflow.