init.py
Overview
This init.py file serves as the main initialization module for a document analysis package within the InfiniFlow project. It facilitates importing core recognition components, managing input and output preparation workflows (especially for image and PDF files), and controlling concurrent access to PDF processing resources.
Key functionalities include:
Importing and exposing major classes such as OCR, Recognizer, LayoutRecognizer, and TableStructureRecognizer as part of the package's public API.
Providing a utility function
init_in_outthat handles loading of input files (images or PDFs), converts PDF pages into images, and prepares corresponding output paths.Implementing a thread-safe mechanism to handle concurrent PDF processing via a global lock.
This file acts as a bridge connecting lower-level recognition modules with higher-level application workflows, enabling seamless document processing pipelines.
Classes and Functions
Imported Classes (Re-exported)
The following classes are imported from submodules and exposed as part of the package API:
Class Name | Source Module | Description |
|---|---|---|
| Performs Optical Character Recognition (OCR) on document images. | |
| General recognizer class for identifying document elements or content. | |
| Specialized layout analysis using YOLOv10-based model for detecting document structures. | |
| Recognizes table structures within document images. |
Global Variables
LOCK_KEY_pdfplumber
Type:
strValue:
"global_shared_lock_pdfplumber"Purpose: A unique key string used to store a global threading lock object in
sys.modulesto synchronize access topdfplumberPDF processing.
sys.modules[LOCK_KEY_pdfplumber]
Type:
threading.Lock()Purpose: A global lock object used to ensure thread-safe operations when opening and processing PDFs with
pdfplumber, which isn't inherently thread-safe.
Function: init_in_out(args)
init_in_out(args) -> (List[PIL.Image.Image], List[str])
Description
Prepares input images and corresponding output file paths based on the provided arguments. Supports both image files and multi-page PDFs by converting PDF pages into images. Ensures that output directories exist and manages thread-safe PDF processing.
Parameters
args: An object (typically a namespace or similar) with the following expected attributes:inputs(str): Path to the input file or directory containing images/PDFs.output_dir(str): Directory path where output files should be saved.
Returns
images(List[PIL.Image.Image]): A list of image objects loaded from input files or extracted from PDF pages.outputs(List[str]): Corresponding output file paths where processed images or results can be saved.
Usage Example
class Args:
inputs = "/path/to/input_folder_or_file"
output_dir = "/path/to/output_folder"
args = Args()
images, outputs = init_in_out(args)
for img, out_path in zip(images, outputs):
# Process image and save results to out_path
process(img)
save_results(out_path)
Implementation Details
Handles input paths that are either directories or single files.
For directories, recursively traverses all files using
traversal_filesutility.For PDF files:
Uses
pdfplumberto open the PDF within a global lock context to ensure no concurrent access issues.Converts each PDF page into an RGB image at a zoomed resolution (default zoom factor 3).
Generates output filenames by appending the page index to the original filename.
For image files:
Opens and converts images to RGB format using PIL.
Handles exceptions gracefully with traceback logging.
Ensures the output directory exists or creates it.
Returns synchronized lists of images and their corresponding output file paths.
Important Implementation Notes
Thread Safety for PDF Processing:
pdfplumberis not thread-safe by default, which can cause crashes or corrupted reads if accessed concurrently. This file introduces a global lock stored insys.modulesto serialize access to the PDF opening and page image extraction, ensuring safe multi-threaded usage.Image Conversion:
PDF pages are converted to images with a resolution scaled by 3 (72 dpi × 3 = 216 dpi), which balances quality and processing speed.Dynamic Output Naming:
For PDFs, output images are named with the original PDF filename plus a page index suffix (e.g.,document.pdf_0.jpg). For images, the original filename is preserved.Dependency on External Utilities:
The function depends on:traversal_filesfromapi.utils.file_utilsto list files recursively.PIL (
Pillow) for image handling.pdfplumberfor PDF page extraction.
Interaction with Other Modules
Imports core recognition components from sibling modules (
ocr,recognizer,layout_recognizer,table_structure_recognizer) and exposes them in the package namespace.Utilizes utilities from an external
api.utils.file_utilsmodule for directory traversal.Provides foundational input/output preparation functionality used by higher-level workflows or scripts that perform document recognition and analysis.
Package Public API
The following names are exported as part of the package's __all__ list, indicating the public interface of this module:
__all__ = [
"OCR",
"Recognizer",
"LayoutRecognizer",
"TableStructureRecognizer",
"init_in_out",
]
Mermaid Diagram: Flowchart of init_in_out Workflow
flowchart TD
A[Start: Receive args.inputs and args.output_dir] --> B{Is args.inputs a directory?}
B -- Yes --> C[Traverse files recursively using traversal_files]
C --> D[For each file: images_and_outputs(file)]
B -- No --> D[images_and_outputs(args.inputs)]
subgraph images_and_outputs(fnm)
direction LR
E{Is file a PDF?} -->|Yes| F[pdf_pages(fnm)]
E -->|No| G[Open and convert image file]
subgraph pdf_pages(fnm)
F1[Acquire LOCK_KEY_pdfplumber Lock]
F2[Open PDF with pdfplumber]
F3[Convert each page to image with zoom factor]
F4[Append page images and output paths]
F5[Close PDF and release lock]
F1 --> F2 --> F3 --> F4 --> F5
end
end
D --> H[Ensure output_dir exists or create it]
H --> I[Return images list and outputs list]
Summary
This init.py initializes the document recognition package by exposing key recognizer classes and providing the init_in_out function. This function is crucial for preprocessing inputs—loading images and PDFs safely and preparing outputs—serving as the entry point into the document analysis pipeline. The use of a global lock ensures thread-safe PDF processing, a critical implementation detail for concurrent environments.
This module interacts closely with internal recognition submodules and external utilities, forming a foundational layer in the system's document processing architecture.