document_app.py

Overview

document_app.py is a Flask-based web API module designed to manage document-related operations within the InfiniFlow platform. It serves as a controller layer, exposing RESTful endpoints that handle uploading, parsing, listing, modifying, and deleting documents associated with knowledgebases. This file heavily interacts with multiple backend services, including document, file, task, user, and knowledgebase services, to coordinate complex workflows like document ingestion, parsing, metadata management, task queuing, and access control.

The API routes defined in this file require user authentication (using flask_login) and perform rigorous input validation. The endpoints support both file uploads and web crawling, different document parsers, and document lifecycle management such as running parsing tasks, changing document status, renaming, and retrieving thumbnails or actual file content.

Key Features:

Upload files to knowledgebases with validation and duplication checks.
Crawl web URLs and convert HTML content into documents.
Create new virtual documents.
List and filter documents with pagination and multi-criteria filtering.
Manage document lifecycle: delete, rename, run parsing tasks, change parser type, and update metadata.
Serve document files and thumbnails.
Handle task cancellation and queueing for document processing.

Detailed Description of Endpoints and Functions

1. `upload()`

Route: /upload
Method: POST
Authentication: Required
Request Form Parameters:
- kb_id (Knowledgebase ID, required)
- file (one or multiple files to upload)
Description:
Uploads one or more documents to a specified knowledgebase after validating file presence, name length, and knowledgebase existence. Delegates upload and parsing to FileService.
Returns:
JSON result containing the list of uploaded files or error messages.

Example Usage:

curl -F "kb_id=abc123" -F "[email protected]" -F "[email protected]" https://<host>/upload

2. `web_crawl()`

Route: /web_crawl
Method: POST
Authentication: Required
Request Form Parameters:
- kb_id (Knowledgebase ID, required)
- name (Desired filename, required)
- url (Valid URL to crawl, required)
Description:
Crawls a web URL, converts the HTML content to PDF blob, saves it into storage, creates a corresponding document entry with appropriate parser type, and associates it with the knowledgebase.
Returns:
JSON success or error response.
Important Details:
- Uses html2pdf utility to convert web page to PDF.
- Ensures filename uniqueness via duplicate_name.
- Sets parser ID based on file type or extension.

3. `create()`

Route: /create
Method: POST
Authentication: Required
Request JSON:
- name (Document name, required)
- kb_id (Knowledgebase ID, required)
Description:
Creates a new virtual document (empty content, no file upload) under a knowledgebase. Validates name, checks for duplicates, and inserts into the DocumentService and FileService.
Returns:
JSON of the created document's details or errors.

4. `list_docs()`

Route: /list
Method: POST
Authentication: Required
Request Args & JSON:
- kb_id (Knowledgebase ID, required)
- Optional filters: keywords, page, page_size, orderby, desc, create_time_from, create_time_to
- Filters in JSON body: run_status, types, suffix
Description:
Retrieves paginated and filtered documents belonging to a knowledgebase with access control. Enhances thumbnail URLs for display.
Returns:
JSON with total count and list of document metadata.
Notes:
- Validates run_status and file types against allowed enums.
- Filters by creation time range if provided.

5. `get_filter()`

Route: /filter
Method: POST
Authentication: Required
Request JSON:
- kb_id (required)
- keywords, run_status, types, suffix (optional filters)
Description:
Fetches aggregated filter information for documents in a knowledgebase to support UI filtering.
Returns:
JSON with total documents count and filter metadata.

6. `docinfos()`

Route: /infos
Method: POST
Authentication: Required
Request JSON:
- doc_ids (list of document IDs, required)
Description:
Returns detailed information for a list of documents after checking user access.
Returns:
JSON list of document dictionaries.

7. `thumbnails()`

Route: /thumbnails
Method: GET
Authentication: Not required (commented out)
Query Params:
- doc_ids (multiple document IDs)
Description:
Retrieves thumbnail images for multiple documents. If thumbnails are stored externally, provides URL paths.
Returns:
JSON mapping document IDs to thumbnails (base64 or URL).

8. `change_status()`

Route: /change_status
Method: POST
Authentication: Required
Request JSON:
- doc_ids (list of document IDs, required)
- status (string "0" or "1", required)
Description:
Changes availability status of documents. Updates database and search index accordingly.
Returns:
JSON mapping doc_id to status or error messages.

9. `rm()`

Route: /rm
Method: POST
Authentication: Required
Request JSON:
- doc_id (single or list of document IDs)
Description:
Deletes documents with access and ownership checks. Removes associated files, tasks, and storage blobs. Cleans up knowledgebase field maps for table parser documents.
Returns:
JSON success or aggregated error messages.

10. `run()`

Route: /run
Method: POST
Authentication: Required
Request JSON:
- doc_ids (list, required)
- run (status to set, e.g., running, cancel)
- Optional delete (boolean)
Description:
Manages document processing tasks: starts, cancels, or reruns parsing tasks. Clears chunks and updates task stores as needed.
Returns:
JSON success or error.
Important Implementation:
- Uses queue_tasks to enqueue parsing jobs.
- Cancels running tasks if requested.
- Deletes task and index data if rerun with delete flag.

11. `rename()`

Route: /rename
Method: POST
Authentication: Required
Request JSON:
- doc_id (required)
- name (new name with same extension, required)
Description:
Renames a document ensuring the file extension remains the same and the name is unique within the knowledgebase.
Returns:
JSON success or error.

12. `get(doc_id)`

Route: /get/<doc_id>
Method: GET
Authentication: Not required (commented out)
Description:
Serves the actual document file content from object storage with appropriate content-type headers based on file extension and type.
Returns:
Flask Response containing file bytes.

13. `change_parser()`

Route: /change_parser
Method: POST
Authentication: Required
Request JSON:
- doc_id (required)
- parser_id (new parser ID, required)
- Optional parser_config
Description:
Changes the parser type for a document. Resets progress and run status. Updates parser config if provided. Removes existing indexed data if tokens exist.
Returns:
JSON success or error.
Validation:
- Restricts unsupported parser changes for visual and presentation files.

14. `get_image(image_id)`

Route: /image/<image_id>
Method: GET
Authentication: Not required (commented out)
Description:
Retrieves an image from storage given an image ID formatted as bucket-name.
Returns:
Flask Response with image bytes and JPEG content type.

15. `upload_and_parse()`

Route: /upload_and_parse
Method: POST
Authentication: Required
Request Form Parameters:
- conversation_id (required)
- file (one or more files)
Description:
Uploads files and initiates parsing for a given conversation context.
Returns:
JSON list of document IDs created.

16. `parse()`

Route: /parse
Method: POST
Authentication: Required
Description:
Parses content from either a given URL or uploaded files. For URLs, Selenium WebDriver (headless Chrome) is used to fetch and analyze the page content. If multiple responses are detected, parses HTML text sections. Otherwise, downloads and parses a file from the URL.
Returns:
JSON with parsed text or error messages.
Important Details:
- Uses deepdoc.parser.html_parser.RAGFlowHtmlParser for HTML content parsing.
- Downloads files to a temporary directory.
- Uses FileService.parse_docs for file content parsing.

17. `set_meta()`

Route: /set_meta
Method: POST
Authentication: Required
Request JSON:
- doc_id (required)
- meta (JSON string representing a dictionary)
Description:
Sets or updates metadata fields on a document. Validates that metadata is a JSON dictionary with string, int, or float values only.
Returns:
JSON success or error.

Important Implementation Details and Algorithms

Access Control:
Document access and authorization checks use DocumentService.accessible and DocumentService.accessible4deletion to ensure users can only manipulate documents they own or have rights to.
File Storage:
Files and thumbnails are stored using an abstracted storage implementation (STORAGE_IMPL). This abstraction supports operations like put, get, rm, and existence checks, enabling flexibility in backend storage.
Document Parsing and Task Management:
Document parsing is orchestrated through task queues leveraging TaskService and specialized parsers indicated by ParserType. The system supports different parsers for visual, audio, presentation, email, and table files.
Web Crawling and HTML to PDF:
The web_crawl endpoint converts web content to PDF for ingestion. The parse endpoint uses Selenium WebDriver to fetch dynamic content and parse it intelligently.
Thumbnail Handling:
Thumbnails can be base64 encoded or stored as images accessible via URLs. The API normalizes thumbnail URLs to serve via /v1/document/image/{id} path.
Error Handling:
The API uses standardized JSON error responses with codes defined in settings.RetCode, and wraps internal errors with server_error_response.
Filename Duplication:
A utility duplicate_name ensures new uploads or created documents do not clash with existing names, appending underscores as needed.

Interaction with Other Parts of the System

Database Models:
Uses File and Task ORM models for persistent storage.
Services:
Coordinates with multiple services:
- DocumentService: Core document CRUD and queries.
- FileService: File metadata and storage operations.
- TaskService: Managing parsing and processing tasks.
- KnowledgebaseService: Knowledgebase retrieval and management.
- UserTenantService: User-tenant relationships and authorization.
- File2DocumentService: Linking files to documents.
Utilities:
Uses utilities for UUID generation, filename handling, content type mapping, URL validation, HTML parsing, and server responses.
Storage Backend:
Abstracted storage layer for files and thumbnails, enabling flexible backend implementations.
Authentication:
Relies on flask_login for user session management.

Visual Diagram

classDiagram
    class DocumentApp {
        +upload()
        +web_crawl()
        +create()
        +list_docs()
        +get_filter()
        +docinfos()
        +thumbnails()
        +change_status()
        +rm()
        +run()
        +rename()
        +get(doc_id)
        +change_parser()
        +get_image(image_id)
        +upload_and_parse()
        +parse()
        +set_meta()
    }

    class DocumentService {
        +get_by_id()
        +query()
        +insert()
        +update_by_id()
        +remove_document()
        +get_by_kb_id()
        +accessible()
        +accessible4deletion()
        +get_filter_by_kb_id()
        +get_by_ids()
        +get_thumbnails()
        +clear_chunk_num_when_rerun()
        +increment_chunk_num()
        +update_parser_config()
        +count_by_kb_id()
        +get_tenant_id()
    }

    class FileService {
        +upload_document()
        +get_root_folder()
        +init_knowledgebase_docs()
        +get_kb_folder()
        +new_a_file_from_kb()
        +add_file_from_kb()
        +filter_delete()
        +get_by_id()
        +update_by_id()
        +parse_docs()
    }

    class TaskService {
        +cancel_all_task_of()
        +filter_delete()
        +queue_tasks()
    }

    class KnowledgebaseService {
        +get_by_id()
        +query()
        +delete_field_map()
    }

    class File2DocumentService {
        +get_storage_address()
        +get_by_document_id()
        +delete_by_document_id()
    }

    class UserTenantService {
        +query()
    }

    DocumentApp --> DocumentService : uses
    DocumentApp --> FileService : uses
    DocumentApp --> TaskService : uses
    DocumentApp --> KnowledgebaseService : uses
    DocumentApp --> File2DocumentService : uses
    DocumentApp --> UserTenantService : uses

Summary

document_app.py is a critical component for document lifecycle management in the InfiniFlow platform. It exposes APIs that cover document upload, creation, listing, parsing, metadata setting, and deletion with robust security and validation mechanisms. The module integrates multiple backend services and utilities to provide a seamless experience for managing knowledgebase documents, supporting complex workflows like web crawling and dynamic parsing. Its design abstracts storage and parsing details while providing comprehensive control over documents and their processing state.

End of document_app.py documentation.

document_app.py

Overview

Key Features:

Detailed Description of Endpoints and Functions

1. upload()

2. web_crawl()

3. create()

4. list_docs()

5. get_filter()

6. docinfos()

7. thumbnails()

8. change_status()

9. rm()

10. run()

11. rename()

12. get(doc_id)

13. change_parser()

14. get_image(image_id)

15. upload_and_parse()

16. parse()

17. set_meta()