test_chunk.py

Overview

The test_chunk.py file contains an automated integration test that verifies the end-to-end workflow of creating a dataset, uploading a text document, processing (parsing) that document, monitoring the parsing progress, and finally cleaning up by deleting the created datasets. This test ensures that the document ingestion and parsing pipeline of the InfiniFlow system functions correctly when handling plain text files.

The test utilizes a set of common utility functions (imported from common) to interact with the backend services managing datasets and documents. It is designed to be run with an authenticated context (get_auth), which is likely a test fixture providing necessary authorization.

Detailed Explanation

Imports

create_dataset, list_dataset, rm_dataset, upload_file, list_document, get_docs_info, parse_docs
Functions imported from the common module used to manipulate datasets and documents.
sleep from time
Used for polling with delay.
default_timer from timeit as timer
Used to measure elapsed time during document parsing.

Function: `test_parse_txt_document(get_auth)`

Purpose

This function performs a full test cycle on the document parsing feature using a text file as input:

Create a new dataset.
Retrieve a list of existing datasets.
Upload a .txt file to the last dataset.
List all documents in the dataset.
Retrieve information about those documents.
Trigger parsing of the documents.
Poll for parsing completion.
Delete all datasets created or found during the test.

Parameters

get_auth
An authentication context (likely a fixture or function) that provides valid authorization tokens or credentials needed by the API calls.

Return Value

None (the function uses assertions to validate success and prints progress/status messages).

Usage Example

# Assuming a test framework like pytest and a fixture get_auth is available:
def test_document_parsing_flow(get_auth):
    test_parse_txt_document(get_auth)

Step-by-Step Workflow

Create Dataset
Calls create_dataset with the name "test_parse_txt_document".
Asserts that the response code is 0 (success).
List Dataset(s)
Retrieves datasets page by page (pagination assumed at 150 items per page) until all datasets are collected.
Collects all dataset IDs in dataset_list, and keeps the last dataset ID in dataset_id.
Upload File
Uploads ragflow_test.txt from the ../test_sdk_api/test_data/ directory to the last dataset (dataset_id).
Asserts upload success.
List Documents
Lists all documents in the targeted dataset and collects their IDs into doc_id_list.
Get Document Info
Retrieves detailed information for all documents.
Parse Documents
Starts the parsing process for all documents by calling parse_docs.
Poll for Parsing Completion
Enters a loop, repeatedly calling get_docs_info every second to check each document's progress field.
The loop exits once all documents show a progress of 1 (indicating parsing complete).
The total time taken for parsing is printed.
Delete All Datasets
Iterates over all datasets collected in dataset_list and deletes them via rm_dataset.
Asserts deletion success and prints the count of deleted datasets.

Important Implementation Details

Polling Mechanism:
The parsing progress is monitored by polling every second until all documents report progress == 1. This is a simple but effective method to wait for asynchronous operations to complete.
Pagination Handling:
The dataset listing handles pagination by checking if the number of datasets collected is less than page_number * 150 (assuming 150 items per page), breaking out of the loop accordingly.
Assertions for Validation:
After each critical API call, assertions check the response code to ensure the operation succeeded, allowing the test to fail fast on errors.
Cleanup:
The test cleans up after itself by deleting all datasets it enumerated, ensuring no leftover test data pollutes the environment.

Interactions with Other Parts of the System

common module:
This file depends heavily on the common module for API interaction functions related to dataset and document management (create_dataset, list_dataset, upload_file, etc.). The test validates that these underlying APIs and their implementations work correctly.
Backend Dataset & Document Services:
The file tests the integration with backend services that manage datasets and document parsing workflows.
Authentication System:
Requires a valid authentication context (get_auth) to authorize API calls.
File System (Test Data):
Reads a local test file ragflow_test.txt from a relative path, so the file must exist in the expected location for the test to succeed.

Diagram: Class/Function Structure

flowchart TD
    A[test_parse_txt_document(get_auth)]
    A --> B[create_dataset]
    A --> C[list_dataset]
    A --> D[upload_file]
    A --> E[list_document]
    A --> F[get_docs_info]
    A --> G[parse_docs]
    A --> H[rm_dataset]
    F --> I[polling loop: check parsing progress]

The main function test_parse_txt_document orchestrates calls to utility functions imported from common.
It controls the entire workflow from dataset creation to cleanup.
The polling loop repeatedly calls get_docs_info to check parsing status.

Summary

test_chunk.py is a focused integration test script for the InfiniFlow project, validating the document ingestion and parsing pipeline with real API calls and a sample text file. It ensures the system's ability to create datasets, upload documents, parse them asynchronously, and clean up resources, providing a robust check for core document processing features.

End of documentation