init.py

Overview

This init.py file serves as the main initialization module for a document analysis package within the InfiniFlow project. It facilitates importing core recognition components, managing input and output preparation workflows (especially for image and PDF files), and controlling concurrent access to PDF processing resources.

Key functionalities include:

Importing and exposing major classes such as OCR, Recognizer, LayoutRecognizer, and TableStructureRecognizer as part of the package's public API.
Providing a utility function init_in_out that handles loading of input files (images or PDFs), converts PDF pages into images, and prepares corresponding output paths.
Implementing a thread-safe mechanism to handle concurrent PDF processing via a global lock.

This file acts as a bridge connecting lower-level recognition modules with higher-level application workflows, enabling seamless document processing pipelines.

Classes and Functions

Imported Classes (Re-exported)

The following classes are imported from submodules and exposed as part of the package API:

Class Name	Source Module	Description
`OCR`	.ocr	Performs Optical Character Recognition (OCR) on document images.
`Recognizer`	.recognizer	General recognizer class for identifying document elements or content.
`LayoutRecognizer`	.layout_recognizer	Specialized layout analysis using YOLOv10-based model for detecting document structures.
`TableStructureRecognizer`	.table_structure_recognizer	Recognizes table structures within document images.

Global Variables

`LOCK_KEY_pdfplumber`

Type: str
Value: "global_shared_lock_pdfplumber"
Purpose: A unique key string used to store a global threading lock object in sys.modules to synchronize access to pdfplumber PDF processing.

`sys.modules[LOCK_KEY_pdfplumber]`

Type: threading.Lock()
Purpose: A global lock object used to ensure thread-safe operations when opening and processing PDFs with pdfplumber, which isn't inherently thread-safe.

Function: `init_in_out(args)`

init_in_out(args) -> (List[PIL.Image.Image], List[str])

Description

Prepares input images and corresponding output file paths based on the provided arguments. Supports both image files and multi-page PDFs by converting PDF pages into images. Ensures that output directories exist and manages thread-safe PDF processing.

Parameters

args: An object (typically a namespace or similar) with the following expected attributes:
- inputs (str): Path to the input file or directory containing images/PDFs.
- output_dir (str): Directory path where output files should be saved.

Returns

images (List[PIL.Image.Image]): A list of image objects loaded from input files or extracted from PDF pages.
outputs (List[str]): Corresponding output file paths where processed images or results can be saved.

Usage Example

class Args:
    inputs = "/path/to/input_folder_or_file"
    output_dir = "/path/to/output_folder"

args = Args()
images, outputs = init_in_out(args)

for img, out_path in zip(images, outputs):
    # Process image and save results to out_path
    process(img)
    save_results(out_path)

Implementation Details

Handles input paths that are either directories or single files.
For directories, recursively traverses all files using traversal_files utility.
For PDF files:
- Uses pdfplumber to open the PDF within a global lock context to ensure no concurrent access issues.
- Converts each PDF page into an RGB image at a zoomed resolution (default zoom factor 3).
- Generates output filenames by appending the page index to the original filename.
For image files:
- Opens and converts images to RGB format using PIL.
- Handles exceptions gracefully with traceback logging.
Ensures the output directory exists or creates it.
Returns synchronized lists of images and their corresponding output file paths.

Important Implementation Notes

Thread Safety for PDF Processing:
pdfplumber is not thread-safe by default, which can cause crashes or corrupted reads if accessed concurrently. This file introduces a global lock stored in sys.modules to serialize access to the PDF opening and page image extraction, ensuring safe multi-threaded usage.
Image Conversion:
PDF pages are converted to images with a resolution scaled by 3 (72 dpi × 3 = 216 dpi), which balances quality and processing speed.
Dynamic Output Naming:
For PDFs, output images are named with the original PDF filename plus a page index suffix (e.g., document.pdf_0.jpg). For images, the original filename is preserved.
Dependency on External Utilities:
The function depends on:
- traversal_files from api.utils.file_utils to list files recursively.
- PIL (Pillow) for image handling.
- pdfplumber for PDF page extraction.

Interaction with Other Modules

Imports core recognition components from sibling modules (ocr, recognizer, layout_recognizer, table_structure_recognizer) and exposes them in the package namespace.
Utilizes utilities from an external api.utils.file_utils module for directory traversal.
Provides foundational input/output preparation functionality used by higher-level workflows or scripts that perform document recognition and analysis.

Package Public API

The following names are exported as part of the package's __all__ list, indicating the public interface of this module:

__all__ = [
    "OCR",
    "Recognizer",
    "LayoutRecognizer",
    "TableStructureRecognizer",
    "init_in_out",
]

Mermaid Diagram: Flowchart of `init_in_out` Workflow

flowchart TD
    A[Start: Receive args.inputs and args.output_dir] --> B{Is args.inputs a directory?}
    
    B -- Yes --> C[Traverse files recursively using traversal_files]
    C --> D[For each file: images_and_outputs(file)]
    
    B -- No --> D[images_and_outputs(args.inputs)]
    
    subgraph images_and_outputs(fnm)
        direction LR
        E{Is file a PDF?} -->|Yes| F[pdf_pages(fnm)]
        E -->|No| G[Open and convert image file]
        
        subgraph pdf_pages(fnm)
            F1[Acquire LOCK_KEY_pdfplumber Lock]
            F2[Open PDF with pdfplumber]
            F3[Convert each page to image with zoom factor]
            F4[Append page images and output paths]
            F5[Close PDF and release lock]
            F1 --> F2 --> F3 --> F4 --> F5
        end
    end
    
    D --> H[Ensure output_dir exists or create it]
    H --> I[Return images list and outputs list]

Summary

This init.py initializes the document recognition package by exposing key recognizer classes and providing the init_in_out function. This function is crucial for preprocessing inputs—loading images and PDFs safely and preparing outputs—serving as the entry point into the document analysis pipeline. The use of a global lock ensures thread-safe PDF processing, a critical implementation detail for concurrent environments.

This module interacts closely with internal recognition submodules and external utilities, forming a foundational layer in the system's document processing architecture.