file_service.py
Overview
file_service.py is a core service module in the InfiniFlow system responsible for managing file-related operations within a multi-tenant knowledge management platform. It provides comprehensive functionality to handle files and folders stored in tenants' knowledge bases, supporting hierarchical folder structures, file metadata management, document uploads, parsing, and integration with storage backends.
The FileService class encapsulates operations such as retrieving files by folder, creating folders recursively, managing knowledge base folders, uploading and parsing documents, moving files, and interacting with linked documents. It leverages Peewee ORM for database interactions, integrates with external parsers, and uses a pluggable storage implementation for file blobs.
Classes and Methods
Class: FileService(CommonService)
Service class managing file entities and associated operations.
Attributes:
model: The Peewee model class representing theFileentity in the database.
Methods
get_by_pf_id(tenant_id, pf_id, page_number, items_per_page, orderby, desc, keywords)
Retrieve a paginated list of files under a specific parent folder (identified by pf_id) for a tenant, optionally filtered by search keywords.
Parameters:
tenant_id(str): Tenant identifier.pf_id(str): Parent folder ID to query files from.page_number(int): Pagination page number.items_per_page(int): Number of items per page.orderby(str): Field name to order by.desc(bool): Whether to order descending.keywords(str|None): Search keywords for file names.
Returns:
Tuple
(list of dict, int): List of file dictionaries with metadata and total file count.
Usage Example:
files, total = FileService.get_by_pf_id('tenant123', 'folder456', 1, 20, 'name', False, 'report')Details:
Supports filtering files by keywords (case-insensitive).
Folders include additional info such as folder size and whether they contain child folders.
Files include knowledge base associations.
get_kb_id_by_file_id(file_id)
Fetches knowledge base IDs and names linked to a given file ID.
Parameters:
file_id(str): File identifier.
Returns:
List of dicts with keys
"kb_id"and"kb_name".
Details:
Joins
File,File2Document,Document, andKnowledgebasetables.Returns empty list if no knowledge base associations exist.
get_by_pf_id_name(id, name)
Retrieve a file record by parent folder ID and file name.
Parameters:
id(str): Parent folder ID.name(str): File name.
Returns:
Filemodel instance orNoneif not found.
get_id_list_by_id(id, name, count, res)
Recursively traverses folder names to return a list of file IDs.
Parameters:
id(str): Starting folder ID.name(list of str): List of folder names to traverse.count(int): Current depth index.res(list): Accumulates file IDs.
Returns:
List of file IDs traversed.
get_all_innermost_file_ids(folder_id, result_ids)
Retrieve IDs of all files in the deepest nested folders starting from folder_id.
Parameters:
folder_id(str): Folder ID to start traversal.result_ids(list): List to accumulate file IDs.
Returns:
List of file IDs in innermost folders.
create_folder(file, parent_id, name, count)
Recursively creates a folder hierarchy as per the given list of folder names.
Parameters:
file(File): Current file/folder object.parent_id(str): Parent folder ID.name(list of str): List of folder names to create.count(int): Current index in folder names list.
Returns:
Created
Fileobject representing the last folder created.
is_parent_folder_exist(parent_id)
Checks whether a folder with the given ID exists.
Parameters:
parent_id(str): Folder ID.
Returns:
bool:Trueif folder exists, elseFalse.
Additional Behavior:
If folder does not exist, deletes folder reference by parent ID to maintain data integrity.
get_root_folder(tenant_id)
Retrieves or creates the root folder for a given tenant.
Parameters:
tenant_id(str): Tenant identifier.
Returns:
Dictionary representing the root folder.
Details:
Root folder is identified by
parent_id == id.Creates a root folder if none exists.
get_kb_folder(tenant_id)
Fetches the knowledge base folder under the tenant's root folder, creating it if necessary.
Parameters:
tenant_id(str): Tenant identifier.
Returns:
Dictionary representing the knowledge base folder.
new_a_file_from_kb(tenant_id, name, parent_id, ty=FileType.FOLDER.value, size=0, location="")
Creates a new file record linked to a knowledge base folder.
Parameters:
tenant_id(str): Tenant identifier.name(str): File or folder name.parent_id(str): Parent folder ID.ty(int): File type (default is folder).size(int): File size.location(str): Storage location string.
Returns:
Dictionary representing the created file.
init_knowledgebase_docs(root_id, tenant_id)
Initializes knowledge base documents under the root folder for a tenant.
Parameters:
root_id(str): Root folder ID.tenant_id(str): Tenant identifier.
Details:
Creates knowledge base folder if missing.
For each knowledge base, creates a folder and adds associated documents.
get_parent_folder(file_id)
Retrieves the parent folder of a given file.
Parameters:
file_id(str): File ID.
Returns:
Parent
Fileobject.
Raises:
RuntimeErrorif file or parent does not exist.
get_all_parent_folders(start_id)
Retrieves all parent folders in the hierarchy for a given file ID.
Parameters:
start_id(str): Starting file ID.
Returns:
List of parent
Fileobjects up to the root.
insert(file)
Inserts a new file record into the database.
Parameters:
file(dict): File metadata.
Returns:
Created
Fileinstance.
Raises:
RuntimeErroron database failure.
delete(file)
Deletes a file record by its object.
Parameters:
file(File): File object.
Returns:
Result of delete operation.
delete_by_pf_id(folder_id)
Deletes files where parent_id equals folder_id.
Parameters:
folder_id(str): Folder ID.
Returns:
Number of deleted records.
delete_folder_by_pf_id(user_id, folder_id)
Recursively deletes folder and all subfolders/files for a user.
Parameters:
user_id(str): Tenant/user ID.folder_id(str): Folder ID to delete.
Returns:
Tuple with delete result count.
Raises:
RuntimeErroron database errors.
get_file_count(tenant_id)
Returns the total number of files for a tenant.
Parameters:
tenant_id(str): Tenant ID.
Returns:
Integer count of files.
get_folder_size(folder_id)
Calculates the cumulative size of a folder (including nested files/folders).
Parameters:
folder_id(str): Folder ID.
Returns:
Integer size in bytes.
Algorithm:
Uses depth-first search recursion to sum sizes.
add_file_from_kb(doc, kb_folder_id, tenant_id)
Adds a file entry linked to a knowledge base document.
Parameters:
doc(dict): Document metadata.kb_folder_id(str): Knowledge base folder ID.tenant_id(str): Tenant ID.
Details:
Avoids duplicates by checking existing links.
Creates File and File2Document entries.
move_file(file_ids, folder_id)
Moves multiple files to a new parent folder.
Parameters:
file_ids(list of str): File IDs to move.folder_id(str): Destination folder ID.
Raises:
RuntimeErroron database errors.
upload_document(kb, file_objs, user_id)
Uploads multiple document files to a knowledge base, processes and stores them.
Parameters:
kb(Knowledgebase): Knowledge base object.file_objs(list): List of uploaded file objects (e.g., FlaskFileStorage).user_id(str): ID of uploading user.
Returns:
Tuple
(list of error messages, list of (document, blob) tuples)
Details:
Checks document health and duplicates.
Determines file type and parser.
Reads file blob; handles PDF repair.
Stores file blob and thumbnail images.
Inserts document and file metadata.
Uses
STORAGE_IMPLfor blob storage.
parse_docs(file_objs, user_id)
Parses multiple document files concurrently.
Parameters:
file_objs(list): List of files to parse.user_id(str): User ID.
Returns:
Concatenated string result of parsed content.
Implementation:
Uses
ThreadPoolExecutorwith max 12 workers.Calls
FileService.parsefor each file asynchronously.
parse(filename, blob, img_base64=True, tenant_id=None)
Parses a single document blob into text or base64 image.
Parameters:
filename(str): File name.blob(bytes): File content.img_base64(bool): IfTrueand visual file, return base64 image.tenant_id(str|None): Tenant ID fallback.
Returns:
Parsed string content or base64 encoded image string.
Details:
Uses different parsers based on file type (presentation, picture, audio, email).
Falls back to naive parser if unknown.
For visual files with
img_base64=True, converts to base64 usingGptV4.
get_parser(doc_type, filename, default)
Determines parser type based on document type and file extension.
Parameters:
doc_type(int): Document file type.filename(str): File name.default(int | str): Default parser type.
Returns:
Parser type constant.
Details:
Maps visual types to picture parser.
Maps audio types to audio parser.
Uses regex to detect presentation and email file extensions.
get_blob(user_id, location)
Retrieves binary blob from storage for user downloads.
Parameters:
user_id(str): User ID.location(str): Storage location key.
Returns:
Binary content blob.
put_blob(user_id, location, blob)
Stores binary blob in user downloads storage.
Parameters:
user_id(str): User ID.location(str): Storage location key.blob(bytes): Binary content.
Returns:
Result from storage put operation.
Important Implementation Details and Algorithms
Database Layer: Uses Peewee ORM with
DB.connection_context()decorator ensuring connection management for all DB operations.Folder Hierarchy: Recursive methods (
get_id_list_by_id,get_all_innermost_file_ids,create_folder) walk or build folder structures, enabling nested folder support.File Size Calculation: Recursive DFS sums sizes of files within nested folders (
get_folder_size).Document Uploads:
Handles duplicate file naming with
duplicate_name.Supports PDF repair via
read_potential_broken_pdf.Generates thumbnails for supported files.
Uses pluggable storage backend (
STORAGE_IMPL) for file blobs.Enforces file type support and selects parser accordingly.
Parsing:
Uses a factory pattern to select parsers based on file type.
Supports concurrent parsing with thread pool for efficiency.
For images, optionally returns base64 encoded representation via GPT-4 CV model.
Error Handling:
Raises
RuntimeErroron database failures.Logs exceptions during folder deletion and file moves.
Interaction with Other System Components
Database Models:
Interacts heavily with
File,Document,File2Document,Knowledgebasemodels.Uses services like
DocumentServiceandFile2DocumentServicefor document and file-document link management.
Storage:
Uses
STORAGE_IMPLabstraction for blob storage operations (put/get/exist).Stores files and thumbnails in tenant-specific storage buckets.
Parsing Modules:
Imports parsers for audio, email, picture, presentation, and a naive fallback from
rag.app.Uses GPT-4 CV model for image to base64 conversion.
Authentication:
Uses
flask_login.current_userto determine current user for ownership and tenant context.
File Utilities:
Uses utility functions for filename type detection, PDF repair, and thumbnail generation.
Usage Example - Uploading Documents
from api.db.models import Knowledgebase
from api.db.services.file_service import FileService
# Assume kb is a Knowledgebase object, files is a list of uploaded FileStorage objects, and user_id is current user ID
errors, uploaded_files = FileService.upload_document(kb, files, user_id)
if errors:
print("Some files failed to upload:", errors)
else:
print("All files uploaded successfully.")
Mermaid Class Diagram
classDiagram
class FileService {
<<CommonService>>
+model: File
+get_by_pf_id(tenant_id, pf_id, page_number, items_per_page, orderby, desc, keywords)
+get_kb_id_by_file_id(file_id)
+get_by_pf_id_name(id, name)
+get_id_list_by_id(id, name, count, res)
+get_all_innermost_file_ids(folder_id, result_ids)
+create_folder(file, parent_id, name, count)
+is_parent_folder_exist(parent_id)
+get_root_folder(tenant_id)
+get_kb_folder(tenant_id)
+new_a_file_from_kb(tenant_id, name, parent_id, ty, size, location)
+init_knowledgebase_docs(root_id, tenant_id)
+get_parent_folder(file_id)
+get_all_parent_folders(start_id)
+insert(file)
+delete(file)
+delete_by_pf_id(folder_id)
+delete_folder_by_pf_id(user_id, folder_id)
+get_file_count(tenant_id)
+get_folder_size(folder_id)
+add_file_from_kb(doc, kb_folder_id, tenant_id)
+move_file(file_ids, folder_id)
+upload_document(kb, file_objs, user_id)
+parse_docs(file_objs, user_id)
+parse(filename, blob, img_base64, tenant_id)
+get_parser(doc_type, filename, default)
+get_blob(user_id, location)
+put_blob(user_id, location, blob)
}
Summary
file_service.py implements a robust file management service that supports multi-tenant hierarchical file storage integrated with knowledge bases in InfiniFlow. It provides CRUD operations for files and folders, document upload and parsing capabilities, and connects tightly with other database services and storage backends. The design emphasizes recursive folder handling, extensible parsing strategies, and seamless interaction with knowledge base documents.
This file plays a critical role in enabling users to organize, upload, parse, and manage knowledge documents efficiently within their tenant spaces, serving as a backbone for file-related workflows in the platform.