file_utils.py


Overview

file_utils.py is a utility module designed to support file handling operations within the InfiniFlow project. Its primary focus lies in:

The module provides essential abstraction and convenience functions that facilitate consistent file management, configuration handling, and basic file content processing across the system.


Detailed Descriptions

Constants and Global Variables


Functions


get_project_base_directory(*args)

Returns the absolute path to the project base directory, optionally appending additional path components.


get_rag_directory(*args)

Returns the absolute path to the RAG base directory, optionally appending additional path components.


get_rag_python_directory(*args)

Shortcut to get the RAG Python directory under the RAG base directory.


get_home_cache_dir()

Returns the path to a home cache directory named .ragflow inside the user's home directory, creating it if it doesn't exist.


load_json_conf(conf_path)

Loads a JSON configuration file with caching support (LRU cache size 10).


dump_json_conf(config_data, conf_path)

Dumps a Python dictionary as formatted JSON into the specified file.


load_json_conf_real_time(conf_path)

Loads JSON configuration from file every time it’s called (no caching).


load_yaml_conf(conf_path)

Loads a YAML configuration file in safe mode.


rewrite_yaml_conf(conf_path, config)

Writes a Python object back to a YAML file.


rewrite_json_file(filepath, json_data)

Writes JSON data to a file with indentation and separators for readability.


filename_type(filename)

Determines the file type based on the filename extension.


thumbnail_img(filename, blob)

Generates a raw thumbnail image (PNG format bytes) from the file content.


thumbnail(filename, blob)

Generates a base64-encoded thumbnail string prefixed with IMG_BASE64_PREFIX suitable for embedding in HTML.


traversal_files(base)

Generator that yields full file paths for all files recursively under the given base directory.


repair_pdf_with_ghostscript(input_bytes)

Attempts to repair a possibly corrupted PDF by re-processing it with Ghostscript.


read_potential_broken_pdf(blob)

Reads a PDF file safely, attempting to repair it if it cannot be opened initially.


Important Implementation Details and Algorithms


Integration and Interaction with Other System Components

This module is a foundational utility layer widely used by higher-level components that require file I/O, configuration management, and file content preview generation.


Visual Diagram

flowchart TD
    A[get_project_base_directory] -->|uses| B[PROJECT_BASE]
    C[get_rag_directory] -->|uses| D[RAG_BASE]
    E[get_rag_python_directory] --> C
    F[get_home_cache_dir]

    G[load_json_conf] --> H[cachetools LRUCache]
    I[load_json_conf_real_time]
    J[dump_json_conf]

    K[load_yaml_conf]
    L[rewrite_yaml_conf]

    M[rewrite_json_file]

    N[filename_type]

    O[thumbnail_img] -->|calls| P[pdfplumber (with Lock)]
    O -->|calls| Q[PIL.Image]
    O -->|calls| R[Aspose Slides]
    S[thumbnail] --> O

    T[traversal_files]

    U[repair_pdf_with_ghostscript] -->|calls| V[Ghostscript (gs)]
    W[read_potential_broken_pdf] --> U
    W -->|calls| P[pdfplumber]

    style P fill:#f9f,stroke:#333,stroke-width:1px
    style Q fill:#bbf,stroke:#333,stroke-width:1px
    style R fill:#bbf,stroke:#333,stroke-width:1px
    style V fill:#bfb,stroke:#333,stroke-width:1px

Summary

The file_utils.py module provides a robust suite of file handling utilities crucial for the InfiniFlow ecosystem. It abstracts common file system operations, supports efficient config file management, enables safe PDF handling with repair capabilities, and facilitates thumbnail generation for multiple file types. Its design emphasizes performance (caching), safety (thread locks), and extensibility, making it an essential backbone for file-related workflows throughout the application.