layout_recognizer.py
Overview
layout_recognizer.py defines classes and methods for detecting and classifying the structural layout components of document images. It extends a general Recognizer class to identify various document regions such as text blocks, titles, figures, tables, headers, footers, references, and equations within scanned or digital documents.
The primary functionality includes:
Loading and initializing layout detection models from local resources or the Huggingface Hub.
Predicting and tagging layout elements in document images.
Post-processing and cleaning layout predictions to improve accuracy.
Specialized subclass
LayoutRecognizer4YOLOv10implementing preprocessing and postprocessing tailored for a YOLOv10-based layout detection model.
This module plays a critical role in document digitization pipelines, enabling downstream tasks such as OCR, semantic tagging, and content extraction by providing structured layout information.
Classes and Functions
Class: LayoutRecognizer
Inherits from deepdoc.vision.Recognizer.
Description
LayoutRecognizer encapsulates a deep learning model for document layout analysis. It predicts bounding boxes and categories of layout elements on pages and integrates OCR results to tag text blocks with layout types.
Attributes
labels(list of str): The predefined layout classes the model can recognize, including:background,Text,Title,Figure,Figure caption,Table,Table caption,Header,Footer,Reference,Equation.
garbage_layouts(list of str): Layout types considered as "garbage" or less relevant, e.g.,"footer","header","reference".client(optional): A client interface to a Deep Learning Accelerator (DLA) server if configured via environment variableTENSORRT_DLA_SVR.
Methods
__init__(self, domain)
Initializes the recognizer:
Attempts to load a local deepdoc model from the project directory.
If unavailable, downloads the model snapshot from the Huggingface Hub repository
"InfiniFlow/deepdoc".Optionally initializes a DLA client if environment variable
TENSORRT_DLA_SVRis set.
Parameters
domain(str): The domain context for the recognizer (e.g.,"layout").
__call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True) -> (list, list)
Main inference method to predict and tag layout regions on a list of images.
Parameters
image_list(List[PIL.Image or np.ndarray]): List of document images to process.ocr_res(List[List[dict]]): Corresponding OCR results per image; each OCR result is a list of text blocks with bounding boxes.scale_factor(float, default=3): Scale factor to adjust coordinates between layout and OCR results.thr(float, default=0.2): Confidence threshold for filtering layout predictions.batch_size(int, default=16): Batch size for model inference.drop(bool, default=True): Whether to remove "garbage" layout elements from results.
Returns
ocr_res(list): Updated OCR results with additional fieldslayout_typeandlayoutnotagging layout categories.page_layout(list): List of detected layouts per page with bounding boxes and types.
Usage Example
from PIL import Image
layout_recognizer = LayoutRecognizer(domain="layout")
images = [Image.open("page1.png"), Image.open("page2.png")]
ocr_results = [...] # OCR outputs for these pages
tagged_ocr, layouts = layout_recognizer(images, ocr_results)
Implementation Details
Uses either local model inference or a remote DLA client to get initial layout predictions.
Filters low-confidence and garbage layout types.
Sorts layouts top-to-bottom using
sort_Y_firstly.Cleans up overlapping or erroneous layouts with
layouts_cleanup.Tags OCR text blocks with identified layout types based on bounding box overlap.
Removes text blocks identified as "garbage" based on regex patterns and frequency counts.
Adds standalone figure/equation boxes if no overlapping text blocks exist.
forward(self, image_list, thr=0.7, batch_size=16)
Simplified wrapper to call the parent class Recognizer's inference method directly.
Parameters
image_list(list): List of images.thr(float): Confidence threshold.batch_size(int): Batch size.
Returns
List of raw layout predictions.
Class: LayoutRecognizer4YOLOv10
Inherits from LayoutRecognizer.
Description
Specialized layout recognizer using a YOLOv10 architecture variant. Implements custom preprocessing and postprocessing compatible with YOLOv10 input/output formats.
Attributes
labels(list of str): Layout classes specific to YOLOv10 model.auto,scaleFill,scaleup,stride,center(bool/float): Configuration flags controlling image scaling and padding before inference.
Methods
__init__(self, domain)
Initializes the recognizer and sets YOLOv10-specific parameters.
Sets
domainto"layout".Calls super constructor.
Initializes preprocessing flags.
preprocess(self, image_list) -> list
Prepares input images for YOLOv10 model inference.
Parameters
image_list(list of np.ndarray): List of images in BGR format.
Returns
inputs(list of dict): Each dict contains:Preprocessed image tensor under the input name key.
scale_factorlist encoding resizing and padding info for later coordinate adjustment.
Process
Resizes images to the model's expected input shape while preserving aspect ratio.
Adds padding with constant value (114) to meet fixed input size.
Normalizes pixel values to [0,1].
Transposes dimensions to channel-first format.
Stores scaling and padding factors for postprocessing.
postprocess(self, boxes, inputs, thr) -> list
Filters and refines raw YOLOv10 output bounding boxes.
Parameters
boxes(np.ndarray): Raw model predictions.inputs(dict): Corresponding input preprocessing info.thr(float): Confidence threshold (overridden internally to 0.08).
Returns
List of dicts with keys:
"type": Layout class label (lowercase)."bbox": Bounding box coordinates[x0, y0, x1, y1]."score": Confidence score.
Process
Filters boxes below confidence threshold.
Adjusts bounding box coordinates inversely according to scaling and padding factors.
Applies Non-Maximum Suppression (NMS) separately per class to remove overlapping boxes.
Maps class indices to label strings.
Important Implementation Details and Algorithms
Garbage Filtering: Uses regex patterns to detect and discard OCR text blocks likely to be noise (e.g., bullet points, page numbers, URLs).
Layout Cleanup: Applies a heuristic to sort detected layouts vertically and remove overlapping or spurious boxes.
Layout Tagging: Matches OCR boxes to detected layouts based on spatial overlap with a threshold (0.4 IoU approx.), assigning layout types accordingly.
Deep Learning Models: Loads pre-trained models either locally or via Huggingface Hub snapshot; supports integration with DLA acceleration via RPC client.
YOLOv10 Pre/Postprocessing: Custom image resizing with padding to fixed input size, normalization, and channel reordering. Postprocessing includes coordinate re-scaling and class-wise NMS for accurate box selection.
Deepcopy Usage: Ensures layout boxes added without OCR text do not modify original detection dictionaries.
Interaction with Other System Components
deepdoc.vision.Recognizer: Base class providing generic model loading and inference capabilities.api.utils.file_utils.get_project_base_directory: Utility to determine project root path for locating model resources.huggingface_hub.snapshot_download: Fetches model snapshot from remote repository if local resources are missing.deepdoc.vision.operators.nms: Performs Non-Maximum Suppression used in YOLOv10 postprocessing.OCR Results Input: Expects external OCR outputs to enrich detected layout boxes with associated text and metadata.
Environment Variable
TENSORRT_DLA_SVR: Enables accelerated inference via Deep Learning Accelerator client if set.
This file typically fits into a document processing pipeline downstream of OCR and upstream of semantic analysis or document understanding modules.
Visual Diagram
classDiagram
class LayoutRecognizer {
+labels: list
+garbage_layouts: list
+client: DLAClient or None
+__init__(domain)
+__call__(image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True)
+forward(image_list, thr=0.7, batch_size=16)
}
class LayoutRecognizer4YOLOv10 {
+labels: list
+auto: bool
+scaleFill: bool
+scaleup: bool
+stride: int
+center: bool
+__init__(domain)
+preprocess(image_list)
+postprocess(boxes, inputs, thr)
}
LayoutRecognizer4YOLOv10 --|> LayoutRecognizer
Summary
The layout_recognizer.py module provides a robust framework for detecting and classifying document layout regions using deep learning. It supports flexible model loading, accelerated inference, and integrates OCR results to produce enriched and cleaned layout annotations. The specialized YOLOv10 subclass demonstrates extensibility for different model architectures with tailored data processing pipelines. This module is central to enabling structured understanding of complex document images within the InfiniFlow system.