t_ocr.py

Overview

t_ocr.py is a utility script designed to perform Optical Character Recognition (OCR) on a collection of input images or PDF files. It leverages the OCR module from the deepdoc.vision package to detect and extract text regions and their content from images. The script supports parallel execution on multiple CUDA-enabled GPUs using asynchronous concurrency with the trio library. Results are output as annotated images with bounding boxes drawn around detected text, as well as corresponding plain text files containing the recognized text.

This tool is primarily intended for batch OCR processing workflows where users provide a directory or individual files as input, and receive both visual and textual OCR outputs in a specified directory.

Detailed Explanation

Imports and Environment Setup

Modifies sys.path to include a relative path to the project root for module imports.
Imports draw_box, OCR, and init_in_out from deepdoc.vision and deepdoc.vision.seeit.
Sets the CUDA device visibility via the environment variable CUDA_VISIBLE_DEVICES. Defaults to a single GPU ('0'), but has commented options for multiple GPUs or CPU-only.
Uses argparse for CLI argument parsing.
Uses numpy for image array conversion.
Uses trio for asynchronous concurrency.

Main Components

Function: `main(args)`

Primary function that orchestrates the OCR workflow.

Parameters

args: The command-line arguments parsed by argparse. Expected to include:
- inputs: Path to input images/PDFs (directory or file).
- output_dir: Directory path where output images and text files will be saved.

Workflow

CUDA Device Detection
Uses torch.cuda.device_count() to determine the number of available GPUs.
Capacity Limiters for Concurrency
If multiple GPUs are available, creates a list of trio.CapacityLimiter instances, one per device, to restrict concurrency and avoid device contention.
OCR Initialization
Instantiates the OCR engine via ocr = OCR().
Input/Output Initialization
Calls init_in_out(args) which returns:
- images: a list of loaded images.
- outputs: corresponding output file paths.
OCR Task Definition (__ocr)
A synchronous function that:
- Receives task index i, device id id, and image img.
- Converts the image to a NumPy array.
- Runs OCR to detect text lines and bounding boxes.
- Formats the OCR output into bounding boxes with coordinates and text.
- Draws bounding boxes on the original image.
- Saves the annotated image and a .txt file with extracted text.
Asynchronous OCR Wrapper (__ocr_thread)
An async wrapper that:
- If concurrency limiter is set (multiple GPUs), uses async with limiter to control access.
- Runs the synchronous __ocr function in a thread to avoid blocking the event loop.
OCR Launcher (__ocr_launcher)
- If multiple GPUs are detected, runs concurrent OCR tasks across devices using a nursery.
- If single GPU or CPU, runs OCR tasks sequentially.
- Uses await trio.sleep(0.1) to stagger task starts slightly.
Execution
Calls trio.run(__ocr_launcher) to run the async event loop and start OCR tasks.
Prints completion message after all OCR tasks finish.

Returns

None. Outputs are saved to disk.

Usage Example

python t_ocr.py --inputs ./input_images --output_dir ./ocr_results

Command-Line Interface

The script uses argparse to define two arguments:

--inputs (required): Path to input directory or file (image or PDF).
--output_dir (optional): Directory for storing output images and text files. Defaults to ./ocr_outputs.

Important Implementation Details and Algorithms

OCR Processing:
The OCR engine (OCR() class) is expected to return detected text lines along with bounding box coordinates. This script processes those outputs by extracting bounding box corners and text, then filters out invalid boxes where coordinates are inconsistent.
Bounding Box Drawing:
Uses the draw_box function to visually annotate images with OCR results, marking recognized text areas.
Asynchronous GPU Utilization:
The script uses the trio concurrency library to run multiple OCR tasks in parallel, distributing them across multiple CUDA devices where available. Each GPU has a CapacityLimiter(1) ensuring one task per GPU simultaneously, preventing resource conflicts.
Threading inside Async:
Since OCR is a synchronous process (likely CPU/GPU bound), it is run inside trio.to_thread.run_sync to avoid blocking the async event loop.
Input/Output Handling:
The init_in_out function (imported) presumably loads input images and prepares output file paths, abstracting away file I/O complexity.

Interactions with Other System Components

deepdoc.vision.OCR:
Core OCR engine performing text detection and recognition.
deepdoc.vision.seeit.draw_box:
Utility for drawing bounding boxes on images.
deepdoc.vision.init_in_out:
Responsible for input loading and output path setup.
torch.cuda:
Used to detect available GPUs for parallel processing.
trio:
Asynchronous concurrency framework used to parallelize OCR tasks.

This file acts as a CLI tool that integrates these components to perform batched OCR with GPU acceleration and asynchronous concurrency.

Visual Diagram

flowchart TD
    A[Start: Parse CLI Arguments]
    B[Initialize OCR Engine]
    C[Load Inputs & Prepare Outputs]
    D{Detect CUDA Devices}
    E[Create CapacityLimiters (if multiple GPUs)]
    F[For Each Image]
    G{Multiple GPUs?}
    H[Run OCR Task with CapacityLimiter]
    I[Run OCR Task Sequentially]
    J[OCR Task (__ocr)]
    K[Draw Bounding Boxes]
    L[Save Annotated Image & Text]
    M[All Tasks Completed]

    A --> B --> C --> D --> E
    E --> F
    F --> G
    G -- Yes --> H --> J --> K --> L
    G -- No --> I --> J --> K --> L
    L --> F
    F --> M

Summary

t_ocr.py is a GPU-accelerated, asynchronous batch OCR processing script that loads images/PDFs, applies OCR to extract text and bounding boxes, annotates images, and saves both visual and textual results. It intelligently utilizes available GPUs with controlled concurrency to maximize throughput while avoiding device conflicts. The script is designed to be run from the command line with simple arguments specifying input and output locations. It relies on core OCR and vision utilities from the deepdoc package and is a useful component in document digitization or analysis pipelines.

End of Documentation for t_ocr.py