document_app.py
Overview
document_app.py is a Flask-based web API module designed to manage document-related operations within the InfiniFlow platform. It serves as a controller layer, exposing RESTful endpoints that handle uploading, parsing, listing, modifying, and deleting documents associated with knowledgebases. This file heavily interacts with multiple backend services, including document, file, task, user, and knowledgebase services, to coordinate complex workflows like document ingestion, parsing, metadata management, task queuing, and access control.
The API routes defined in this file require user authentication (using flask_login) and perform rigorous input validation. The endpoints support both file uploads and web crawling, different document parsers, and document lifecycle management such as running parsing tasks, changing document status, renaming, and retrieving thumbnails or actual file content.
Key Features:
Upload files to knowledgebases with validation and duplication checks.
Crawl web URLs and convert HTML content into documents.
Create new virtual documents.
List and filter documents with pagination and multi-criteria filtering.
Manage document lifecycle: delete, rename, run parsing tasks, change parser type, and update metadata.
Serve document files and thumbnails.
Handle task cancellation and queueing for document processing.
Detailed Description of Endpoints and Functions
1. upload()
Route:
/uploadMethod: POST
Authentication: Required
Request Form Parameters:
kb_id(Knowledgebase ID, required)file(one or multiple files to upload)
Description:
Uploads one or more documents to a specified knowledgebase after validating file presence, name length, and knowledgebase existence. Delegates upload and parsing toFileService.Returns:
JSON result containing the list of uploaded files or error messages.Example Usage:
curl -F "kb_id=abc123" -F "[email protected]" -F "[email protected]" https://<host>/upload
2. web_crawl()
Route:
/web_crawlMethod: POST
Authentication: Required
Request Form Parameters:
kb_id(Knowledgebase ID, required)name(Desired filename, required)url(Valid URL to crawl, required)
Description:
Crawls a web URL, converts the HTML content to PDF blob, saves it into storage, creates a corresponding document entry with appropriate parser type, and associates it with the knowledgebase.Returns:
JSON success or error response.Important Details:
Uses
html2pdfutility to convert web page to PDF.Ensures filename uniqueness via
duplicate_name.Sets parser ID based on file type or extension.
3. create()
Route:
/createMethod: POST
Authentication: Required
Request JSON:
name(Document name, required)kb_id(Knowledgebase ID, required)
Description:
Creates a new virtual document (empty content, no file upload) under a knowledgebase. Validates name, checks for duplicates, and inserts into the DocumentService and FileService.Returns:
JSON of the created document's details or errors.
4. list_docs()
Route:
/listMethod: POST
Authentication: Required
Request Args & JSON:
kb_id(Knowledgebase ID, required)Optional filters: keywords, page, page_size, orderby, desc, create_time_from, create_time_to
Filters in JSON body: run_status, types, suffix
Description:
Retrieves paginated and filtered documents belonging to a knowledgebase with access control. Enhances thumbnail URLs for display.Returns:
JSON with total count and list of document metadata.Notes:
Validates run_status and file types against allowed enums.
Filters by creation time range if provided.
5. get_filter()
Route:
/filterMethod: POST
Authentication: Required
Request JSON:
kb_id(required)keywords,run_status,types,suffix(optional filters)
Description:
Fetches aggregated filter information for documents in a knowledgebase to support UI filtering.Returns:
JSON with total documents count and filter metadata.
6. docinfos()
Route:
/infosMethod: POST
Authentication: Required
Request JSON:
doc_ids(list of document IDs, required)
Description:
Returns detailed information for a list of documents after checking user access.Returns:
JSON list of document dictionaries.
7. thumbnails()
Route:
/thumbnailsMethod: GET
Authentication: Not required (commented out)
Query Params:
doc_ids(multiple document IDs)
Description:
Retrieves thumbnail images for multiple documents. If thumbnails are stored externally, provides URL paths.Returns:
JSON mapping document IDs to thumbnails (base64 or URL).
8. change_status()
Route:
/change_statusMethod: POST
Authentication: Required
Request JSON:
doc_ids(list of document IDs, required)status(string "0" or "1", required)
Description:
Changes availability status of documents. Updates database and search index accordingly.Returns:
JSON mapping doc_id to status or error messages.
9. rm()
Route:
/rmMethod: POST
Authentication: Required
Request JSON:
doc_id(single or list of document IDs)
Description:
Deletes documents with access and ownership checks. Removes associated files, tasks, and storage blobs. Cleans up knowledgebase field maps for table parser documents.Returns:
JSON success or aggregated error messages.
10. run()
Route:
/runMethod: POST
Authentication: Required
Request JSON:
doc_ids(list, required)run(status to set, e.g., running, cancel)Optional
delete(boolean)
Description:
Manages document processing tasks: starts, cancels, or reruns parsing tasks. Clears chunks and updates task stores as needed.Returns:
JSON success or error.Important Implementation:
Uses
queue_tasksto enqueue parsing jobs.Cancels running tasks if requested.
Deletes task and index data if rerun with delete flag.
11. rename()
Route:
/renameMethod: POST
Authentication: Required
Request JSON:
doc_id(required)name(new name with same extension, required)
Description:
Renames a document ensuring the file extension remains the same and the name is unique within the knowledgebase.Returns:
JSON success or error.
12. get(doc_id)
Route:
/get/<doc_id>Method: GET
Authentication: Not required (commented out)
Description:
Serves the actual document file content from object storage with appropriate content-type headers based on file extension and type.Returns:
Flask Response containing file bytes.
13. change_parser()
Route:
/change_parserMethod: POST
Authentication: Required
Request JSON:
doc_id(required)parser_id(new parser ID, required)Optional
parser_config
Description:
Changes the parser type for a document. Resets progress and run status. Updates parser config if provided. Removes existing indexed data if tokens exist.Returns:
JSON success or error.Validation:
Restricts unsupported parser changes for visual and presentation files.
14. get_image(image_id)
Route:
/image/<image_id>Method: GET
Authentication: Not required (commented out)
Description:
Retrieves an image from storage given an image ID formatted asbucket-name.Returns:
Flask Response with image bytes and JPEG content type.
15. upload_and_parse()
Route:
/upload_and_parseMethod: POST
Authentication: Required
Request Form Parameters:
conversation_id(required)file(one or more files)
Description:
Uploads files and initiates parsing for a given conversation context.Returns:
JSON list of document IDs created.
16. parse()
Route:
/parseMethod: POST
Authentication: Required
Description:
Parses content from either a given URL or uploaded files. For URLs, Selenium WebDriver (headless Chrome) is used to fetch and analyze the page content. If multiple responses are detected, parses HTML text sections. Otherwise, downloads and parses a file from the URL.Returns:
JSON with parsed text or error messages.Important Details:
Uses
deepdoc.parser.html_parser.RAGFlowHtmlParserfor HTML content parsing.Downloads files to a temporary directory.
Uses
FileService.parse_docsfor file content parsing.
17. set_meta()
Route:
/set_metaMethod: POST
Authentication: Required
Request JSON:
doc_id(required)meta(JSON string representing a dictionary)
Description:
Sets or updates metadata fields on a document. Validates that metadata is a JSON dictionary with string, int, or float values only.Returns:
JSON success or error.
Important Implementation Details and Algorithms
Access Control:
Document access and authorization checks useDocumentService.accessibleandDocumentService.accessible4deletionto ensure users can only manipulate documents they own or have rights to.File Storage:
Files and thumbnails are stored using an abstracted storage implementation (STORAGE_IMPL). This abstraction supports operations likeput,get,rm, and existence checks, enabling flexibility in backend storage.Document Parsing and Task Management:
Document parsing is orchestrated through task queues leveragingTaskServiceand specialized parsers indicated byParserType. The system supports different parsers for visual, audio, presentation, email, and table files.Web Crawling and HTML to PDF:
Theweb_crawlendpoint converts web content to PDF for ingestion. Theparseendpoint uses Selenium WebDriver to fetch dynamic content and parse it intelligently.Thumbnail Handling:
Thumbnails can be base64 encoded or stored as images accessible via URLs. The API normalizes thumbnail URLs to serve via/v1/document/image/{id}path.Error Handling:
The API uses standardized JSON error responses with codes defined insettings.RetCode, and wraps internal errors withserver_error_response.Filename Duplication:
A utilityduplicate_nameensures new uploads or created documents do not clash with existing names, appending underscores as needed.
Interaction with Other Parts of the System
Database Models:
UsesFileandTaskORM models for persistent storage.Services:
Coordinates with multiple services:DocumentService: Core document CRUD and queries.FileService: File metadata and storage operations.TaskService: Managing parsing and processing tasks.KnowledgebaseService: Knowledgebase retrieval and management.UserTenantService: User-tenant relationships and authorization.File2DocumentService: Linking files to documents.
Utilities:
Uses utilities for UUID generation, filename handling, content type mapping, URL validation, HTML parsing, and server responses.Storage Backend:
Abstracted storage layer for files and thumbnails, enabling flexible backend implementations.Authentication:
Relies onflask_loginfor user session management.
Visual Diagram
classDiagram
class DocumentApp {
+upload()
+web_crawl()
+create()
+list_docs()
+get_filter()
+docinfos()
+thumbnails()
+change_status()
+rm()
+run()
+rename()
+get(doc_id)
+change_parser()
+get_image(image_id)
+upload_and_parse()
+parse()
+set_meta()
}
class DocumentService {
+get_by_id()
+query()
+insert()
+update_by_id()
+remove_document()
+get_by_kb_id()
+accessible()
+accessible4deletion()
+get_filter_by_kb_id()
+get_by_ids()
+get_thumbnails()
+clear_chunk_num_when_rerun()
+increment_chunk_num()
+update_parser_config()
+count_by_kb_id()
+get_tenant_id()
}
class FileService {
+upload_document()
+get_root_folder()
+init_knowledgebase_docs()
+get_kb_folder()
+new_a_file_from_kb()
+add_file_from_kb()
+filter_delete()
+get_by_id()
+update_by_id()
+parse_docs()
}
class TaskService {
+cancel_all_task_of()
+filter_delete()
+queue_tasks()
}
class KnowledgebaseService {
+get_by_id()
+query()
+delete_field_map()
}
class File2DocumentService {
+get_storage_address()
+get_by_document_id()
+delete_by_document_id()
}
class UserTenantService {
+query()
}
DocumentApp --> DocumentService : uses
DocumentApp --> FileService : uses
DocumentApp --> TaskService : uses
DocumentApp --> KnowledgebaseService : uses
DocumentApp --> File2DocumentService : uses
DocumentApp --> UserTenantService : uses
Summary
document_app.py is a critical component for document lifecycle management in the InfiniFlow platform. It exposes APIs that cover document upload, creation, listing, parsing, metadata setting, and deletion with robust security and validation mechanisms. The module integrates multiple backend services and utilities to provide a seamless experience for managing knowledgebase documents, supporting complex workflows like web crawling and dynamic parsing. Its design abstracts storage and parsing details while providing comprehensive control over documents and their processing state.