file_utils.py
Overview
file_utils.py is a utility module designed to support file handling operations within the InfiniFlow project. Its primary focus lies in:
Determining file paths relative to the project and deployment environments.
Loading, dumping, and rewriting configuration files in JSON and YAML formats.
Identifying file types based on filename extensions.
Generating thumbnails for various file types (PDFs, images, PowerPoint presentations).
Traversing directory trees to yield file paths.
Repairing potentially corrupted PDF files using Ghostscript.
Safely reading PDF files with automatic repair attempts when needed.
The module provides essential abstraction and convenience functions that facilitate consistent file management, configuration handling, and basic file content processing across the system.
Detailed Descriptions
Constants and Global Variables
PROJECT_BASE: Base directory of the project, derived from environment variables or inferred from the module location.
RAG_BASE: Base directory for RAG deployment, similarly derived.
LOCK_KEY_pdfplumber: A global threading lock key to ensure thread-safe operations when using
pdfplumber(used for PDF processing).
Functions
get_project_base_directory(*args)
Returns the absolute path to the project base directory, optionally appending additional path components.
Parameters:
*args(str): Optional additional path segments to append.
Returns:
str: Absolute path to the project base or the appended subpath.
Usage Example:
config_path = get_project_base_directory("configs", "settings.json")Implementation Notes:
Caches the base directory in the global
PROJECT_BASEvariable.Falls back to directory two levels above the current file if environment variables are not set.
get_rag_directory(*args)
Returns the absolute path to the RAG base directory, optionally appending additional path components.
Parameters:
*args(str): Optional additional path segments to append.
Returns:
str: Absolute path to the RAG base or appended path.
Usage Example:
rag_python_path = get_rag_directory("python", "scripts")Implementation Notes:
Similar to
get_project_base_directorybut goes three levels above the current file location.
get_rag_python_directory(*args)
Shortcut to get the RAG Python directory under the RAG base directory.
Parameters:
*args(str): Optional additional path segments.
Returns:
str: Path to<RAG_BASE>/pythonplus optional subdirectories.
get_home_cache_dir()
Returns the path to a home cache directory named .ragflow inside the user's home directory, creating it if it doesn't exist.
Returns:
str: Absolute path to the cache directory.
Usage Example:
cache_dir = get_home_cache_dir()
load_json_conf(conf_path)
Loads a JSON configuration file with caching support (LRU cache size 10).
Parameters:
conf_path(str): Absolute or relative path to the JSON config file.
Returns:
dict: Parsed JSON content.
Raises:
EnvironmentErrorif loading fails.
Caching:
Uses
cachetools.cachedwith an LRU cache to avoid reloading unchanged configs repeatedly.
Usage Example:
config = load_json_conf("configs/settings.json")
dump_json_conf(config_data, conf_path)
Dumps a Python dictionary as formatted JSON into the specified file.
Parameters:
config_data(dict): Data to write.conf_path(str): Target JSON file path.
Raises:
EnvironmentErrorif writing fails.
load_json_conf_real_time(conf_path)
Loads JSON configuration from file every time it’s called (no caching).
Parameters:
conf_path(str): Path to JSON config.
Returns:
dict: Parsed JSON content.
Raises:
EnvironmentErrorif loading fails.
load_yaml_conf(conf_path)
Loads a YAML configuration file in safe mode.
Parameters:
conf_path(str): Absolute or relative path to YAML file.
Returns:
Parsed YAML content as Python objects.
Raises:
EnvironmentErrorif loading fails.
Implementation Detail:
Uses
ruamel.yamlwithtyp='safe'andpure=Truefor safe loading.
rewrite_yaml_conf(conf_path, config)
Writes a Python object back to a YAML file.
Parameters:
conf_path(str): YAML file path.config: Python object to serialize.
Raises:
EnvironmentErrorif writing fails.
rewrite_json_file(filepath, json_data)
Writes JSON data to a file with indentation and separators for readability.
Parameters:
filepath(str): Target file path.json_data(dict): JSON-serializable data.
filename_type(filename)
Determines the file type based on the filename extension.
Parameters:
filename(str): Filename or path.
Returns:
FileTypeenum value indicating the file category:PDFDOC(documents and code)AURAL(audio files)VISUAL(images and videos)OTHER(any other types)
Implementation Notes:
Uses regular expressions against common file extensions to categorize.
Usage Example:
file_type = filename_type("report.pdf") # Returns FileType.PDF.value
thumbnail_img(filename, blob)
Generates a raw thumbnail image (PNG format bytes) from the file content.
Parameters:
filename(str): Name of the file (used to decide processing method).blob(bytes): Raw file content.
Returns:
bytesof PNG image data representing the thumbnail, orNoneif unsupported.
Supported File Types and Processing:
PDF: Uses
pdfplumberto render the first page at adjustable resolution.Iteratively reduces resolution if output image exceeds MySQL max length (~64KB).
Thread-safe via a global lock.
Images (jpg, png, gif, icons): Uses PIL to generate a 30x30 thumbnail.
PowerPoint (ppt, pptx): Uses Aspose Slides to generate thumbnails of first slide.
Also scales down to keep under size limits.
Other file types return
None.
Implementation Details:
Handles image size constraints for database storage.
Uses in-memory buffers (
BytesIO) to avoid file I/O.
thumbnail(filename, blob)
Generates a base64-encoded thumbnail string prefixed with IMG_BASE64_PREFIX suitable for embedding in HTML.
Parameters:
filename(str): File name.blob(bytes): File content.
Returns:
str: Base64-encoded thumbnail string or empty string if thumbnail generation fails.
Usage Example:
img_tag_src = thumbnail("picture.png", image_bytes)Implementation Detail:
Calls
thumbnail_imginternally and encodes the result.
traversal_files(base)
Generator that yields full file paths for all files recursively under the given base directory.
Parameters:
base(str): Root directory to traverse.
Yields:
str: Full path to each file found.
Usage Example:
for filepath in traversal_files("/data/documents"): print(filepath)
repair_pdf_with_ghostscript(input_bytes)
Attempts to repair a possibly corrupted PDF by re-processing it with Ghostscript.
Parameters:
input_bytes(bytes): Raw PDF data.
Returns:
bytes: Repaired PDF bytes if successful, original bytes if not.
Implementation Details:
Requires
gs(Ghostscript) executable in system PATH.Writes input to a temporary file and runs Ghostscript with
/prepresssettings to regenerate PDF.Returns original input if Ghostscript is unavailable or the process fails.
read_potential_broken_pdf(blob)
Reads a PDF file safely, attempting to repair it if it cannot be opened initially.
Parameters:
blob(bytes): Raw PDF content.
Returns:
bytes: PDF bytes that can be successfully opened bypdfplumber.
Implementation Details:
Tries to open the PDF with
pdfplumber.If fails, attempts repair using
repair_pdf_with_ghostscript.Returns repaired bytes if successful, otherwise returns original blob.
Important Implementation Details and Algorithms
Thread Safety with
pdfplumber:
Becausepdfplumbercan have thread-safety issues, a global threading lock (LOCK_KEY_pdfplumber) is used to serialize access when generating PDF thumbnails.Thumbnail Size Management:
Thumbnail generation for PDFs and PowerPoint files includes an iterative scaling approach to ensure the resulting image size fits within a 64KB limit (due to MySQLLongTextconstraints). The scale or resolution is halved repeatedly until the size constraint is met or a minimum scale is reached.File Type Detection:
Uses regex matching on file extensions to classify files into broad categories (PDF,DOC,AURAL,VISUAL,OTHER), facilitating downstream processing decisions.PDF Repair:
Incorporates an external dependency on Ghostscript to repair corrupted PDFs by reprocessing them, improving robustness in handling PDF files that might otherwise be unusable.Configuration Loading with Caching:
JSON config files are loaded with caching to improve performance on repeated calls, while a non-cached variant is provided for real-time loading.
Integration and Interaction with Other System Components
Constants and Enums:
Imports
IMG_BASE64_PREFIXfromapi.constantsused as prefix in base64-encoded thumbnails.Uses
FileTypeenum fromapi.dbto categorize files.
PDF Processing:
Uses
pdfplumberfor PDF page rendering and reading.
Image Processing:
Uses
PIL(Pillow) for image thumbnail creation.
YAML Processing:
Uses
ruamel.yamlfor YAML configuration loading and dumping.
Caching:
Uses
cachetoolsfor JSON config caching.
External Binary Dependency:
Ghostscript (
gs) is invoked as an external process for PDF repair.
Threading:
Uses Python's
threading.Lockmechanism for synchronizing PDF processing.
This module is a foundational utility layer widely used by higher-level components that require file I/O, configuration management, and file content preview generation.
Visual Diagram
flowchart TD
A[get_project_base_directory] -->|uses| B[PROJECT_BASE]
C[get_rag_directory] -->|uses| D[RAG_BASE]
E[get_rag_python_directory] --> C
F[get_home_cache_dir]
G[load_json_conf] --> H[cachetools LRUCache]
I[load_json_conf_real_time]
J[dump_json_conf]
K[load_yaml_conf]
L[rewrite_yaml_conf]
M[rewrite_json_file]
N[filename_type]
O[thumbnail_img] -->|calls| P[pdfplumber (with Lock)]
O -->|calls| Q[PIL.Image]
O -->|calls| R[Aspose Slides]
S[thumbnail] --> O
T[traversal_files]
U[repair_pdf_with_ghostscript] -->|calls| V[Ghostscript (gs)]
W[read_potential_broken_pdf] --> U
W -->|calls| P[pdfplumber]
style P fill:#f9f,stroke:#333,stroke-width:1px
style Q fill:#bbf,stroke:#333,stroke-width:1px
style R fill:#bbf,stroke:#333,stroke-width:1px
style V fill:#bfb,stroke:#333,stroke-width:1px
Summary
The file_utils.py module provides a robust suite of file handling utilities crucial for the InfiniFlow ecosystem. It abstracts common file system operations, supports efficient config file management, enables safe PDF handling with repair capabilities, and facilitates thumbnail generation for multiple file types. Its design emphasizes performance (caching), safety (thread locks), and extensibility, making it an essential backbone for file-related workflows throughout the application.