test_chunk.py
Overview
The test_chunk.py file contains an automated integration test that verifies the end-to-end workflow of creating a dataset, uploading a text document, processing (parsing) that document, monitoring the parsing progress, and finally cleaning up by deleting the created datasets. This test ensures that the document ingestion and parsing pipeline of the InfiniFlow system functions correctly when handling plain text files.
The test utilizes a set of common utility functions (imported from common) to interact with the backend services managing datasets and documents. It is designed to be run with an authenticated context (get_auth), which is likely a test fixture providing necessary authorization.
Detailed Explanation
Imports
create_dataset,list_dataset,rm_dataset,upload_file,list_document,get_docs_info,parse_docs
Functions imported from thecommonmodule used to manipulate datasets and documents.sleepfromtime
Used for polling with delay.default_timer from
timeitastimer
Used to measure elapsed time during document parsing.
Function: test_parse_txt_document(get_auth)
Purpose
This function performs a full test cycle on the document parsing feature using a text file as input:
Create a new dataset.
Retrieve a list of existing datasets.
Upload a
.txtfile to the last dataset.List all documents in the dataset.
Retrieve information about those documents.
Trigger parsing of the documents.
Poll for parsing completion.
Delete all datasets created or found during the test.
Parameters
get_auth
An authentication context (likely a fixture or function) that provides valid authorization tokens or credentials needed by the API calls.
Return Value
None (the function uses assertions to validate success and prints progress/status messages).
Usage Example
# Assuming a test framework like pytest and a fixture get_auth is available:
def test_document_parsing_flow(get_auth):
test_parse_txt_document(get_auth)
Step-by-Step Workflow
Create Dataset
Callscreate_datasetwith the name"test_parse_txt_document".
Asserts that the response code is 0 (success).List Dataset(s)
Retrieves datasets page by page (pagination assumed at 150 items per page) until all datasets are collected.
Collects all dataset IDs indataset_list, and keeps the last dataset ID indataset_id.Upload File
Uploadsragflow_test.txtfrom the../test_sdk_api/test_data/directory to the last dataset (dataset_id).
Asserts upload success.List Documents
Lists all documents in the targeted dataset and collects their IDs intodoc_id_list.Get Document Info
Retrieves detailed information for all documents.Parse Documents
Starts the parsing process for all documents by callingparse_docs.Poll for Parsing Completion
Enters a loop, repeatedly callingget_docs_infoevery second to check each document'sprogressfield.
The loop exits once all documents show a progress of 1 (indicating parsing complete).
The total time taken for parsing is printed.Delete All Datasets
Iterates over all datasets collected indataset_listand deletes them viarm_dataset.
Asserts deletion success and prints the count of deleted datasets.
Important Implementation Details
Polling Mechanism:
The parsing progress is monitored by polling every second until all documents reportprogress == 1. This is a simple but effective method to wait for asynchronous operations to complete.Pagination Handling:
The dataset listing handles pagination by checking if the number of datasets collected is less thanpage_number * 150(assuming 150 items per page), breaking out of the loop accordingly.Assertions for Validation:
After each critical API call, assertions check the response code to ensure the operation succeeded, allowing the test to fail fast on errors.Cleanup:
The test cleans up after itself by deleting all datasets it enumerated, ensuring no leftover test data pollutes the environment.
Interactions with Other Parts of the System
commonmodule:
This file depends heavily on thecommonmodule for API interaction functions related to dataset and document management (create_dataset,list_dataset,upload_file, etc.). The test validates that these underlying APIs and their implementations work correctly.Backend Dataset & Document Services:
The file tests the integration with backend services that manage datasets and document parsing workflows.Authentication System:
Requires a valid authentication context (get_auth) to authorize API calls.File System (Test Data):
Reads a local test fileragflow_test.txtfrom a relative path, so the file must exist in the expected location for the test to succeed.
Diagram: Class/Function Structure
flowchart TD
A[test_parse_txt_document(get_auth)]
A --> B[create_dataset]
A --> C[list_dataset]
A --> D[upload_file]
A --> E[list_document]
A --> F[get_docs_info]
A --> G[parse_docs]
A --> H[rm_dataset]
F --> I[polling loop: check parsing progress]
The main function
test_parse_txt_documentorchestrates calls to utility functions imported fromcommon.It controls the entire workflow from dataset creation to cleanup.
The polling loop repeatedly calls
get_docs_infoto check parsing status.
Summary
test_chunk.py is a focused integration test script for the InfiniFlow project, validating the document ingestion and parsing pipeline with real API calls and a sample text file. It ensures the system's ability to create datasets, upload documents, parse them asynchronously, and clean up resources, providing a robust check for core document processing features.
End of documentation