test_parse_documents.py

Overview

The test_parse_documents.py file contains automated tests designed to verify the functionality and robustness of the document parsing features within the InfiniFlow platform. It primarily tests the asynchronous parsing of documents in datasets, ensuring that documents move to a "DONE" state upon successful parsing and that relevant metadata is correctly populated.

This file uses the pytest framework for structuring tests and includes utility functions and decorators to handle asynchronous waiting and validation. It also performs concurrency tests to ensure the system behaves correctly under parallel parse requests.


Detailed Explanation

Imports and Dependencies


Utility Functions

condition(_dataset: DataSet, _document_ids: list[str] = None) -> bool

A polling function decorated by @wait_for to repeatedly check whether the parsing of documents has completed.


validate_document_details(dataset: DataSet, document_ids: list[str]) -> None

Validates that documents have been parsed successfully by asserting on multiple document attributes.


Test Class: TestDocumentsParse

This class groups together tests related to document parsing scenarios.

Method: test_basic_scenarios(self, add_documents_func, payload, expected_message)


Method: test_parse_partial_invalid_document_id(self, add_documents_func, payload)


Method: test_repeated_parse(self, add_documents_func)


Method: test_duplicate_parse(self, add_documents_func)


Standalone Tests

test_parse_100_files(add_dataset_func, tmp_path)


test_concurrent_parse(add_dataset_func, tmp_path)


Important Implementation Details


Interaction with Other Components


Usage Examples

Within the tests, the typical usage flow is:

dataset, documents = add_documents_func  # Setup fixture
document_ids = [doc.id for doc in documents]

# Trigger async parsing
dataset.async_parse_documents(document_ids=document_ids)

# Wait for parsing to complete
condition(dataset, document_ids)

# Validate parsing results
validate_document_details(dataset, document_ids)

Mermaid Class Diagram

classDiagram
    class TestDocumentsParse {
        +test_basic_scenarios(payload, expected_message)
        +test_parse_partial_invalid_document_id(payload)
        +test_repeated_parse()
        +test_duplicate_parse()
    }
    class DataSet {
        +list_documents(page_size)
        +async_parse_documents(document_ids)
    }
    class UtilityFunctions {
        +condition(_dataset, _document_ids)
        +validate_document_details(dataset, document_ids)
    }
    TestDocumentsParse --> DataSet : uses
    TestDocumentsParse --> UtilityFunctions : uses

Summary

The test_parse_documents.py file is a comprehensive test suite focused on validating document parsing operations within datasets in the InfiniFlow platform. It verifies normal and edge cases, error handling, concurrency, and large batch processing. The file leverages pytest parameterization and asynchronous waiting to ensure robust testing of the parsing workflow.