ocr.py

Overview

The ocr.py file implements a modular Optical Character Recognition (OCR) pipeline designed for document and text image analysis. It provides core functionality for detecting text regions in images, recognizing the text content within those regions, and managing end-to-end OCR workflows. Leveraging ONNX runtime for efficient model inference on both CPU and GPU, this module supports multi-device parallelism and various image preprocessing strategies tailored for different OCR model architectures.

Key capabilities include:

The file is part of a larger system likely focused on document understanding, possibly within the InfiniFlow project context, and depends on related modules like operators and postprocess.


Detailed Documentation

Functions


transform(data, ops=None)

Applies a sequence of operator functions to input data.


create_operators(op_param_list, global_config=None)

Creates a list of operator instances based on configuration dictionaries.


load_model(model_dir, nm, device_id: int | None = None)

Loads an ONNX model with optional GPU support and caches it for reuse.


Classes


TextRecognizer

Performs text recognition from cropped text images.


TextDetector

Detects text boxes in an image.


OCR

High-level class integrating detection and recognition into an end-to-end OCR pipeline.


Important Implementation Details and Algorithms


Interaction with Other System Components


Visual Diagram

Below is a class diagram illustrating the main classes, their core methods, and relationships in ocr.py.

classDiagram
    class OCR {
        -text_detector: list~TextDetector~
        -text_recognizer: list~TextRecognizer~
        -drop_score: float
        +__init__(model_dir=None)
        +detect(img, device_id)
        +recognize(ori_im, box, device_id)
        +recognize_batch(img_list, device_id)
        +__call__(img, device_id, cls)
        +get_rotate_crop_image(img, points)
        +sorted_boxes(dt_boxes)
    }

    class TextDetector {
        -preprocess_op: list
        -postprocess_op
        -predictor
        -run_options
        -input_tensor
        +__init__(model_dir, device_id)
        +__call__(img)
        +order_points_clockwise(pts)
        +clip_det_res(points, img_h, img_w)
        +filter_tag_det_res(dt_boxes, image_shape)
        +filter_tag_det_res_only_clip(dt_boxes, image_shape)
        +close()
    }

    class TextRecognizer {
        -rec_image_shape: list
        -rec_batch_num: int
        -postprocess_op
        -predictor
        -run_options
        -input_tensor
        +__init__(model_dir, device_id)
        +__call__(img_list)
        +resize_norm_img(img, max_wh_ratio)
        +resize_norm_img_vl(img, image_shape)
        +resize_norm_img_srn(img, image_shape)
        +srn_other_inputs(image_shape, num_heads, max_text_length)
        +process_image_srn(img, image_shape, num_heads, max_text_length)
        +resize_norm_img_sar(img, image_shape, width_downsample_ratio)
        +resize_norm_img_spin(img)
        +resize_norm_img_svtr(img, image_shape)
        +resize_norm_img_abinet(img, image_shape)
        +norm_img_can(img, image_shape)
        +close()
    }

    OCR "1" *-- "*" TextDetector : has
    OCR "1" *-- "*" TextRecognizer : has

Summary

ocr.py implements a robust, extensible OCR pipeline supporting state-of-the-art detection and recognition models via ONNX Runtime. It offers flexible, configurable preprocessing, multi-device inference, and modular design to integrate seamlessly into larger document processing systems. The OCR class serves as the primary interface, coordinating text detection and recognition with performance optimizations such as batching and smart cropping with rotation correction.