test_chunk.py


Overview

The test_chunk.py file contains an automated integration test that verifies the end-to-end workflow of creating a dataset, uploading a text document, processing (parsing) that document, monitoring the parsing progress, and finally cleaning up by deleting the created datasets. This test ensures that the document ingestion and parsing pipeline of the InfiniFlow system functions correctly when handling plain text files.

The test utilizes a set of common utility functions (imported from common) to interact with the backend services managing datasets and documents. It is designed to be run with an authenticated context (get_auth), which is likely a test fixture providing necessary authorization.


Detailed Explanation

Imports


Function: test_parse_txt_document(get_auth)

Purpose

This function performs a full test cycle on the document parsing feature using a text file as input:

  1. Create a new dataset.

  2. Retrieve a list of existing datasets.

  3. Upload a .txt file to the last dataset.

  4. List all documents in the dataset.

  5. Retrieve information about those documents.

  6. Trigger parsing of the documents.

  7. Poll for parsing completion.

  8. Delete all datasets created or found during the test.

Parameters

Return Value

Usage Example

# Assuming a test framework like pytest and a fixture get_auth is available:
def test_document_parsing_flow(get_auth):
    test_parse_txt_document(get_auth)

Step-by-Step Workflow

  1. Create Dataset
    Calls create_dataset with the name "test_parse_txt_document".
    Asserts that the response code is 0 (success).

  2. List Dataset(s)
    Retrieves datasets page by page (pagination assumed at 150 items per page) until all datasets are collected.
    Collects all dataset IDs in dataset_list, and keeps the last dataset ID in dataset_id.

  3. Upload File
    Uploads ragflow_test.txt from the ../test_sdk_api/test_data/ directory to the last dataset (dataset_id).
    Asserts upload success.

  4. List Documents
    Lists all documents in the targeted dataset and collects their IDs into doc_id_list.

  5. Get Document Info
    Retrieves detailed information for all documents.

  6. Parse Documents
    Starts the parsing process for all documents by calling parse_docs.

  7. Poll for Parsing Completion
    Enters a loop, repeatedly calling get_docs_info every second to check each document's progress field.
    The loop exits once all documents show a progress of 1 (indicating parsing complete).
    The total time taken for parsing is printed.

  8. Delete All Datasets
    Iterates over all datasets collected in dataset_list and deletes them via rm_dataset.
    Asserts deletion success and prints the count of deleted datasets.


Important Implementation Details


Interactions with Other Parts of the System


Diagram: Class/Function Structure

flowchart TD
    A[test_parse_txt_document(get_auth)]
    A --> B[create_dataset]
    A --> C[list_dataset]
    A --> D[upload_file]
    A --> E[list_document]
    A --> F[get_docs_info]
    A --> G[parse_docs]
    A --> H[rm_dataset]
    F --> I[polling loop: check parsing progress]

Summary

test_chunk.py is a focused integration test script for the InfiniFlow project, validating the document ingestion and parsing pipeline with real API calls and a sample text file. It ensures the system's ability to create datasets, upload documents, parse them asynchronously, and clean up resources, providing a robust check for core document processing features.


End of documentation