t_ocr.py
Overview
t_ocr.py is a utility script designed to perform Optical Character Recognition (OCR) on a collection of input images or PDF files. It leverages the OCR module from the deepdoc.vision package to detect and extract text regions and their content from images. The script supports parallel execution on multiple CUDA-enabled GPUs using asynchronous concurrency with the trio library. Results are output as annotated images with bounding boxes drawn around detected text, as well as corresponding plain text files containing the recognized text.
This tool is primarily intended for batch OCR processing workflows where users provide a directory or individual files as input, and receive both visual and textual OCR outputs in a specified directory.
Detailed Explanation
Imports and Environment Setup
Modifies sys.path to include a relative path to the project root for module imports.
Imports
draw_box,OCR, andinit_in_outfromdeepdoc.visionanddeepdoc.vision.seeit.Sets the CUDA device visibility via the environment variable
CUDA_VISIBLE_DEVICES. Defaults to a single GPU ('0'), but has commented options for multiple GPUs or CPU-only.Uses
argparsefor CLI argument parsing.Uses
numpyfor image array conversion.Uses
triofor asynchronous concurrency.
Main Components
Function: main(args)
Primary function that orchestrates the OCR workflow.
Parameters
args: The command-line arguments parsed byargparse. Expected to include:inputs: Path to input images/PDFs (directory or file).output_dir: Directory path where output images and text files will be saved.
Workflow
CUDA Device Detection
Uses torch.cuda.device_count() to determine the number of available GPUs.Capacity Limiters for Concurrency
If multiple GPUs are available, creates a list of trio.CapacityLimiter instances, one per device, to restrict concurrency and avoid device contention.OCR Initialization
Instantiates the OCR engine viaocr = OCR().Input/Output Initialization
Calls init_in_out(args) which returns:images: a list of loaded images.outputs: corresponding output file paths.
OCR Task Definition (
__ocr)
A synchronous function that:Receives task index
i, device idid, and imageimg.Converts the image to a NumPy array.
Runs OCR to detect text lines and bounding boxes.
Formats the OCR output into bounding boxes with coordinates and text.
Draws bounding boxes on the original image.
Saves the annotated image and a .txt file with extracted text.
Asynchronous OCR Wrapper (
__ocr_thread)
An async wrapper that:If concurrency limiter is set (multiple GPUs), uses async with limiter to control access.
Runs the synchronous
__ocrfunction in a thread to avoid blocking the event loop.
OCR Launcher (
__ocr_launcher)If multiple GPUs are detected, runs concurrent OCR tasks across devices using a nursery.
If single GPU or CPU, runs OCR tasks sequentially.
Uses await trio.sleep(0.1) to stagger task starts slightly.
Execution
Calls trio.run(__ocr_launcher) to run the async event loop and start OCR tasks.Prints completion message after all OCR tasks finish.
Returns
None. Outputs are saved to disk.
Usage Example
python t_ocr.py --inputs ./input_images --output_dir ./ocr_results
Command-Line Interface
The script uses argparse to define two arguments:
--inputs(required): Path to input directory or file (image or PDF).--output_dir(optional): Directory for storing output images and text files. Defaults to./ocr_outputs.
Important Implementation Details and Algorithms
OCR Processing:
The OCR engine (OCR()class) is expected to return detected text lines along with bounding box coordinates. This script processes those outputs by extracting bounding box corners and text, then filters out invalid boxes where coordinates are inconsistent.Bounding Box Drawing:
Uses thedraw_boxfunction to visually annotate images with OCR results, marking recognized text areas.Asynchronous GPU Utilization:
The script uses thetrioconcurrency library to run multiple OCR tasks in parallel, distributing them across multiple CUDA devices where available. Each GPU has aCapacityLimiter(1)ensuring one task per GPU simultaneously, preventing resource conflicts.Threading inside Async:
Since OCR is a synchronous process (likely CPU/GPU bound), it is run insidetrio.to_thread.run_syncto avoid blocking the async event loop.Input/Output Handling:
Theinit_in_outfunction (imported) presumably loads input images and prepares output file paths, abstracting away file I/O complexity.
Interactions with Other System Components
deepdoc.vision.OCR:
Core OCR engine performing text detection and recognition.deepdoc.vision.seeit.draw_box:
Utility for drawing bounding boxes on images.deepdoc.vision.init_in_out:
Responsible for input loading and output path setup.torch.cuda:
Used to detect available GPUs for parallel processing.trio:
Asynchronous concurrency framework used to parallelize OCR tasks.
This file acts as a CLI tool that integrates these components to perform batched OCR with GPU acceleration and asynchronous concurrency.
Visual Diagram
flowchart TD
A[Start: Parse CLI Arguments]
B[Initialize OCR Engine]
C[Load Inputs & Prepare Outputs]
D{Detect CUDA Devices}
E[Create CapacityLimiters (if multiple GPUs)]
F[For Each Image]
G{Multiple GPUs?}
H[Run OCR Task with CapacityLimiter]
I[Run OCR Task Sequentially]
J[OCR Task (__ocr)]
K[Draw Bounding Boxes]
L[Save Annotated Image & Text]
M[All Tasks Completed]
A --> B --> C --> D --> E
E --> F
F --> G
G -- Yes --> H --> J --> K --> L
G -- No --> I --> J --> K --> L
L --> F
F --> M
Summary
t_ocr.py is a GPU-accelerated, asynchronous batch OCR processing script that loads images/PDFs, applies OCR to extract text and bounding boxes, annotates images, and saves both visual and textual results. It intelligently utilizes available GPUs with controlled concurrency to maximize throughput while avoiding device conflicts. The script is designed to be run from the command line with simple arguments specifying input and output locations. It relies on core OCR and vision utilities from the deepdoc package and is a useful component in document digitization or analysis pipelines.