t_recognizer.py

Overview

t_recognizer.py is a command-line tool designed to perform document image analysis focusing on two key tasks:

Layout Recognition: Identifying and classifying different regions or components within document images, such as paragraphs, titles, figures, tables, etc.
Table Structure Recognition (TSR): Specifically detecting table components (columns, headers, rows, spanning cells) and reconstructing table structures from images.

The script leverages pre-trained deep learning models from the deepdoc library for layout and table structure recognition, and integrates Optical Character Recognition (OCR) to extract textual content for reconstructing tables as HTML.

It supports batch processing of images or PDFs from input directories or individual files, outputs visualized results with bounding boxes, and for the TSR mode, generates corresponding HTML files representing the detected table layouts.

Detailed Description

Main Functionalities

Initialize inputs and outputs using init_in_out from deepdoc.vision.
Run either Layout Recognition (LayoutRecognizer) or Table Structure Recognition (TableStructureRecognizer) based on the user’s mode choice.
For TSR mode, run OCR on detected components to extract text and build an HTML table.
Save annotated images with bounding boxes drawn around detected regions.
For TSR mode, save the reconstructed tables as .html files.

Classes and Functions

1. `main(args)`

Purpose:
Entry function that processes input images, runs the selected recognition mode, visualizes and saves the results.

Parameters:

args: Parsed command-line arguments containing:
- inputs: path to input images or PDFs
- output_dir: directory to save outputs
- threshold: detection confidence threshold
- mode: "layout" or "tsr"

Returns:

None (outputs files to disk)

Functionality:

Calls init_in_out(args) to load images and prepare output paths.
Depending on mode:
- "layout": uses LayoutRecognizer to detect layout components.
- "tsr": uses TableStructureRecognizer and OCR to detect and interpret table structures.
For TSR, calls get_table_html() to generate HTML representation of tables.
Draws bounding boxes on images using draw_box.
Saves annotated images and, for TSR, HTML files.
Logs output save locations.

Usage Example:

python t_recognizer.py --inputs ./docs/sample.pdf --output_dir ./results --mode tsr --threshold 0.5

2. `get_table_html(img, tb_cpns, ocr)`

Purpose:
Generates an HTML representation of a table detected within an image using OCR text extraction and component layout analysis.

Parameters:

img: PIL Image object of the source document page.
tb_cpns: List of detected table components (dicts with bounding box and label info).
ocr: An OCR callable instance that takes a numpy array image and returns detected text boxes.

Returns:

html: A string containing a complete HTML document representing the table structure.

Implementation Details:

Performs OCR on the image to get text bounding boxes.
Converts OCR output into a structured list of boxes sorted primarily by Y coordinate.
Defines a helper function gather() that filters and cleans up layout elements matching keywords (headers, rows, spans, columns).
Annotates each OCR box with indices pointing to overlapping or closest table components (rows, headers, columns, spanning cells) using layout utilities like find_overlapped_with_threshold and find_horizontally_tightest_fit.
Uses TableStructureRecognizer.construct_table() to convert the annotated boxes into an HTML table string.
Wraps the table HTML inside a styled HTML document for presentation.

Usage Context:
Called internally during TSR mode processing to produce user-readable table outputs.

Important Implementation Details

Dynamic Path Insertion:
The script modifies sys.path to ensure the deepdoc package (assumed to reside two directories above this file) is importable.
Threshold Usage:
Detection confidence threshold is used to filter out low-confidence layout components.
Layout Sorting and Cleanup:
The LayoutRecognizer.sort_Y_firstly method sorts layout elements by vertical position to preserve reading order.
layouts_cleanup removes redundant or overlapping detections.
Box Overlap and Fit Analysis:
Several helper functions identify which OCR text boxes correspond to which table components based on spatial overlap and proximity thresholds.
HTML Styling:
The generated HTML table uses embedded CSS for styling, including alternating row colors, hover effects, and header formatting.

Interaction with Other System Components

deepdoc.vision Package:
- LayoutRecognizer, TableStructureRecognizer, and OCR are core model classes used for layout and text extraction tasks.
- init_in_out handles input/output file preparations.
- draw_box visualizes detection results by drawing bounding boxes on images.
Command-Line Interface:
The script relies on argparse for user parameter input.
File System:
Reads inputs from user-specified directories or files, writes annotated images and HTML files to output directories.

Execution Flow Diagram

flowchart TD
    A[Start: Parse CLI Args] --> B[init_in_out: Load images and outputs]
    B --> C{Mode?}
    C -->|layout| D[Init LayoutRecognizer]
    C -->|tsr| E[Init TableStructureRecognizer & OCR]
    D --> F[Run layout.forward(images, threshold)]
    E --> G[Run tsr(images, threshold)]
    G --> H[get_table_html for each image]
    F --> I[Draw bounding boxes on images]
    H --> I
    I --> J[Save annotated images]
    H --> K[Save HTML files (TSR mode only)]
    J --> L[Log output paths]
    K --> L
    L --> M[End]

Class Diagram of Key Imported Models (Conceptual)

classDiagram
    class LayoutRecognizer {
        +forward(images, thr)
        +sort_Y_firstly(layouts, fuzzy)
        +layouts_cleanup(boxes, layouts, margin, portion)
        +find_overlapped_with_threshold(box, layouts, thr)
        +find_horizontally_tightest_fit(box, layouts)
        +labels
    }

    class TableStructureRecognizer {
        +__call__(images, thr)
        +construct_table(boxes, html)
    }

    class OCR {
        +__call__(image_array)
    }

    t_recognizer.py --> LayoutRecognizer
    t_recognizer.py --> TableStructureRecognizer
    t_recognizer.py --> OCR

Summary

t_recognizer.py is a specialized utility script in the InfiniFlow project for advanced document image analysis, focusing on layout and table structure recognition. It provides an end-to-end pipeline from input image loading, detection, OCR text extraction, visualization, to output generation including HTML tables. The script is designed to be extensible and relies heavily on the deepdoc library’s vision models and utilities.

Appendix: Command Line Arguments

Argument	Description	Default	Required
`--inputs`	Input file or directory path containing images or PDFs	None	Yes
`--output_dir`	Directory to save output images and HTML	`./layouts_outputs`	No
`--threshold`	Confidence threshold for filtering detections	0.5	No
`--mode`	Task mode: `"layout"` for layout recognition or `"tsr"` for tables	`"layout"`	No

Example Usage

python t_recognizer.py --inputs ./sample_docs --output_dir ./output --mode tsr --threshold 0.6

This command processes all images/PDFs in ./sample_docs with table structure recognition at a 0.6 confidence threshold, saving annotated images and HTML files to ./output.