layout_recognizer.py

Overview

layout_recognizer.py defines classes and methods for detecting and classifying the structural layout components of document images. It extends a general Recognizer class to identify various document regions such as text blocks, titles, figures, tables, headers, footers, references, and equations within scanned or digital documents.

The primary functionality includes:

Loading and initializing layout detection models from local resources or the Huggingface Hub.
Predicting and tagging layout elements in document images.
Post-processing and cleaning layout predictions to improve accuracy.
Specialized subclass LayoutRecognizer4YOLOv10 implementing preprocessing and postprocessing tailored for a YOLOv10-based layout detection model.

This module plays a critical role in document digitization pipelines, enabling downstream tasks such as OCR, semantic tagging, and content extraction by providing structured layout information.

Classes and Functions

Class: `LayoutRecognizer`

Inherits from deepdoc.vision.Recognizer.

Description

LayoutRecognizer encapsulates a deep learning model for document layout analysis. It predicts bounding boxes and categories of layout elements on pages and integrates OCR results to tag text blocks with layout types.

Attributes

labels (list of str): The predefined layout classes the model can recognize, including:
- background, Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation.
garbage_layouts (list of str): Layout types considered as "garbage" or less relevant, e.g., "footer", "header", "reference".
client (optional): A client interface to a Deep Learning Accelerator (DLA) server if configured via environment variable TENSORRT_DLA_SVR.

Methods

`init(self, domain)`

Initializes the recognizer:

Attempts to load a local deepdoc model from the project directory.
If unavailable, downloads the model snapshot from the Huggingface Hub repository "InfiniFlow/deepdoc".
Optionally initializes a DLA client if environment variable TENSORRT_DLA_SVR is set.

Parameters

domain (str): The domain context for the recognizer (e.g., "layout").

`call(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True) -> (list, list)`

Main inference method to predict and tag layout regions on a list of images.

Parameters

image_list (List[PIL.Image or np.ndarray]): List of document images to process.
ocr_res (List[List[dict]]): Corresponding OCR results per image; each OCR result is a list of text blocks with bounding boxes.
scale_factor (float, default=3): Scale factor to adjust coordinates between layout and OCR results.
thr (float, default=0.2): Confidence threshold for filtering layout predictions.
batch_size (int, default=16): Batch size for model inference.
drop (bool, default=True): Whether to remove "garbage" layout elements from results.

Returns

ocr_res (list): Updated OCR results with additional fields layout_type and layoutno tagging layout categories.
page_layout (list): List of detected layouts per page with bounding boxes and types.

Usage Example

from PIL import Image
layout_recognizer = LayoutRecognizer(domain="layout")

images = [Image.open("page1.png"), Image.open("page2.png")]
ocr_results = [...]  # OCR outputs for these pages

tagged_ocr, layouts = layout_recognizer(images, ocr_results)

Implementation Details

Uses either local model inference or a remote DLA client to get initial layout predictions.
Filters low-confidence and garbage layout types.
Sorts layouts top-to-bottom using sort_Y_firstly.
Cleans up overlapping or erroneous layouts with layouts_cleanup.
Tags OCR text blocks with identified layout types based on bounding box overlap.
Removes text blocks identified as "garbage" based on regex patterns and frequency counts.
Adds standalone figure/equation boxes if no overlapping text blocks exist.

`forward(self, image_list, thr=0.7, batch_size=16)`

Simplified wrapper to call the parent class Recognizer's inference method directly.

Parameters

image_list (list): List of images.
thr (float): Confidence threshold.
batch_size (int): Batch size.

Returns

List of raw layout predictions.

Class: `LayoutRecognizer4YOLOv10`

Inherits from LayoutRecognizer.

Description

Specialized layout recognizer using a YOLOv10 architecture variant. Implements custom preprocessing and postprocessing compatible with YOLOv10 input/output formats.

Attributes

labels (list of str): Layout classes specific to YOLOv10 model.
auto, scaleFill, scaleup, stride, center (bool/float): Configuration flags controlling image scaling and padding before inference.

Methods

`init(self, domain)`

Initializes the recognizer and sets YOLOv10-specific parameters.

Sets domain to "layout".
Calls super constructor.
Initializes preprocessing flags.

`preprocess(self, image_list) -> list`

Prepares input images for YOLOv10 model inference.

Parameters

image_list (list of np.ndarray): List of images in BGR format.

Returns

inputs (list of dict): Each dict contains:
- Preprocessed image tensor under the input name key.
- scale_factor list encoding resizing and padding info for later coordinate adjustment.

Process

Resizes images to the model's expected input shape while preserving aspect ratio.
Adds padding with constant value (114) to meet fixed input size.
Normalizes pixel values to [0,1].
Transposes dimensions to channel-first format.
Stores scaling and padding factors for postprocessing.

`postprocess(self, boxes, inputs, thr) -> list`

Filters and refines raw YOLOv10 output bounding boxes.

Parameters

boxes (np.ndarray): Raw model predictions.
inputs (dict): Corresponding input preprocessing info.
thr (float): Confidence threshold (overridden internally to 0.08).

Returns

List of dicts with keys:
- "type": Layout class label (lowercase).
- "bbox": Bounding box coordinates [x0, y0, x1, y1].
- "score": Confidence score.

Process

Filters boxes below confidence threshold.
Adjusts bounding box coordinates inversely according to scaling and padding factors.
Applies Non-Maximum Suppression (NMS) separately per class to remove overlapping boxes.
Maps class indices to label strings.

Important Implementation Details and Algorithms

Garbage Filtering: Uses regex patterns to detect and discard OCR text blocks likely to be noise (e.g., bullet points, page numbers, URLs).
Layout Cleanup: Applies a heuristic to sort detected layouts vertically and remove overlapping or spurious boxes.
Layout Tagging: Matches OCR boxes to detected layouts based on spatial overlap with a threshold (0.4 IoU approx.), assigning layout types accordingly.
Deep Learning Models: Loads pre-trained models either locally or via Huggingface Hub snapshot; supports integration with DLA acceleration via RPC client.
YOLOv10 Pre/Postprocessing: Custom image resizing with padding to fixed input size, normalization, and channel reordering. Postprocessing includes coordinate re-scaling and class-wise NMS for accurate box selection.
Deepcopy Usage: Ensures layout boxes added without OCR text do not modify original detection dictionaries.

Interaction with Other System Components

deepdoc.vision.Recognizer: Base class providing generic model loading and inference capabilities.
api.utils.file_utils.get_project_base_directory: Utility to determine project root path for locating model resources.
huggingface_hub.snapshot_download: Fetches model snapshot from remote repository if local resources are missing.
deepdoc.vision.operators.nms: Performs Non-Maximum Suppression used in YOLOv10 postprocessing.
OCR Results Input: Expects external OCR outputs to enrich detected layout boxes with associated text and metadata.
Environment Variable TENSORRT_DLA_SVR: Enables accelerated inference via Deep Learning Accelerator client if set.

This file typically fits into a document processing pipeline downstream of OCR and upstream of semantic analysis or document understanding modules.

Visual Diagram

classDiagram
    class LayoutRecognizer {
        +labels: list
        +garbage_layouts: list
        +client: DLAClient or None
        +__init__(domain)
        +__call__(image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True)
        +forward(image_list, thr=0.7, batch_size=16)
    }

    class LayoutRecognizer4YOLOv10 {
        +labels: list
        +auto: bool
        +scaleFill: bool
        +scaleup: bool
        +stride: int
        +center: bool
        +__init__(domain)
        +preprocess(image_list)
        +postprocess(boxes, inputs, thr)
    }

    LayoutRecognizer4YOLOv10 --|> LayoutRecognizer

Summary

The layout_recognizer.py module provides a robust framework for detecting and classifying document layout regions using deep learning. It supports flexible model loading, accelerated inference, and integrates OCR results to produce enriched and cleaned layout annotations. The specialized YOLOv10 subclass demonstrates extensibility for different model architectures with tailored data processing pipelines. This module is central to enabling structured understanding of complex document images within the InfiniFlow system.

layout_recognizer.py

Overview

Classes and Functions

Class: LayoutRecognizer

Description

Attributes

Methods

__init__(self, domain)

__call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True) -> (list, list)

forward(self, image_list, thr=0.7, batch_size=16)

Class: LayoutRecognizer4YOLOv10

Description

Attributes

Methods

__init__(self, domain)

preprocess(self, image_list) -> list

postprocess(self, boxes, inputs, thr) -> list

Important Implementation Details and Algorithms

Interaction with Other System Components

Visual Diagram

Summary

Class: `LayoutRecognizer`

`init(self, domain)`

`call(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True) -> (list, list)`

`forward(self, image_list, thr=0.7, batch_size=16)`

Class: `LayoutRecognizer4YOLOv10`

`init(self, domain)`

`preprocess(self, image_list) -> list`

`postprocess(self, boxes, inputs, thr) -> list`