t_recognizer.py
Overview
t_recognizer.py is a command-line tool designed to perform document image analysis focusing on two key tasks:
Layout Recognition: Identifying and classifying different regions or components within document images, such as paragraphs, titles, figures, tables, etc.
Table Structure Recognition (TSR): Specifically detecting table components (columns, headers, rows, spanning cells) and reconstructing table structures from images.
The script leverages pre-trained deep learning models from the deepdoc library for layout and table structure recognition, and integrates Optical Character Recognition (OCR) to extract textual content for reconstructing tables as HTML.
It supports batch processing of images or PDFs from input directories or individual files, outputs visualized results with bounding boxes, and for the TSR mode, generates corresponding HTML files representing the detected table layouts.
Detailed Description
Main Functionalities
Initialize inputs and outputs using
init_in_outfromdeepdoc.vision.Run either Layout Recognition (
LayoutRecognizer) or Table Structure Recognition (TableStructureRecognizer) based on the user’s mode choice.For TSR mode, run OCR on detected components to extract text and build an HTML table.
Save annotated images with bounding boxes drawn around detected regions.
For TSR mode, save the reconstructed tables as .html files.
Classes and Functions
1. main(args)
Purpose:
Entry function that processes input images, runs the selected recognition mode, visualizes and saves the results.
Parameters:
args: Parsed command-line arguments containing:inputs: path to input images or PDFsoutput_dir: directory to save outputsthreshold: detection confidence thresholdmode:"layout"or"tsr"
Returns:
None (outputs files to disk)
Functionality:
Calls init_in_out(args) to load images and prepare output paths.
Depending on
mode:"layout": usesLayoutRecognizerto detect layout components."tsr": usesTableStructureRecognizerand OCR to detect and interpret table structures.
For TSR, calls get_table_html() to generate HTML representation of tables.
Draws bounding boxes on images using
draw_box.Saves annotated images and, for TSR, HTML files.
Logs output save locations.
Usage Example:
python t_recognizer.py --inputs ./docs/sample.pdf --output_dir ./results --mode tsr --threshold 0.5
2. get_table_html(img, tb_cpns, ocr)
Purpose:
Generates an HTML representation of a table detected within an image using OCR text extraction and component layout analysis.
Parameters:
img: PIL Image object of the source document page.tb_cpns: List of detected table components (dicts with bounding box and label info).ocr: An OCR callable instance that takes a numpy array image and returns detected text boxes.
Returns:
html: A string containing a complete HTML document representing the table structure.
Implementation Details:
Performs OCR on the image to get text bounding boxes.
Converts OCR output into a structured list of boxes sorted primarily by Y coordinate.
Defines a helper function
gather()that filters and cleans up layout elements matching keywords (headers, rows, spans, columns).Annotates each OCR box with indices pointing to overlapping or closest table components (rows, headers, columns, spanning cells) using layout utilities like
find_overlapped_with_thresholdandfind_horizontally_tightest_fit.Uses
TableStructureRecognizer.construct_table()to convert the annotated boxes into an HTML table string.Wraps the table HTML inside a styled HTML document for presentation.
Usage Context:
Called internally during TSR mode processing to produce user-readable table outputs.
Important Implementation Details
Dynamic Path Insertion:
The script modifiessys.pathto ensure thedeepdocpackage (assumed to reside two directories above this file) is importable.Threshold Usage:
Detection confidence threshold is used to filter out low-confidence layout components.Layout Sorting and Cleanup:
TheLayoutRecognizer.sort_Y_firstlymethod sorts layout elements by vertical position to preserve reading order.layouts_cleanupremoves redundant or overlapping detections.Box Overlap and Fit Analysis:
Several helper functions identify which OCR text boxes correspond to which table components based on spatial overlap and proximity thresholds.HTML Styling:
The generated HTML table uses embedded CSS for styling, including alternating row colors, hover effects, and header formatting.
Interaction with Other System Components
deepdoc.vision Package:
LayoutRecognizer,TableStructureRecognizer, andOCRare core model classes used for layout and text extraction tasks.init_in_outhandles input/output file preparations.draw_boxvisualizes detection results by drawing bounding boxes on images.
Command-Line Interface:
The script relies onargparsefor user parameter input.File System:
Reads inputs from user-specified directories or files, writes annotated images and HTML files to output directories.
Execution Flow Diagram
flowchart TD
A[Start: Parse CLI Args] --> B[init_in_out: Load images and outputs]
B --> C{Mode?}
C -->|layout| D[Init LayoutRecognizer]
C -->|tsr| E[Init TableStructureRecognizer & OCR]
D --> F[Run layout.forward(images, threshold)]
E --> G[Run tsr(images, threshold)]
G --> H[get_table_html for each image]
F --> I[Draw bounding boxes on images]
H --> I
I --> J[Save annotated images]
H --> K[Save HTML files (TSR mode only)]
J --> L[Log output paths]
K --> L
L --> M[End]
Class Diagram of Key Imported Models (Conceptual)
classDiagram
class LayoutRecognizer {
+forward(images, thr)
+sort_Y_firstly(layouts, fuzzy)
+layouts_cleanup(boxes, layouts, margin, portion)
+find_overlapped_with_threshold(box, layouts, thr)
+find_horizontally_tightest_fit(box, layouts)
+labels
}
class TableStructureRecognizer {
+__call__(images, thr)
+construct_table(boxes, html)
}
class OCR {
+__call__(image_array)
}
t_recognizer.py --> LayoutRecognizer
t_recognizer.py --> TableStructureRecognizer
t_recognizer.py --> OCR
Summary
t_recognizer.py is a specialized utility script in the InfiniFlow project for advanced document image analysis, focusing on layout and table structure recognition. It provides an end-to-end pipeline from input image loading, detection, OCR text extraction, visualization, to output generation including HTML tables. The script is designed to be extensible and relies heavily on the deepdoc library’s vision models and utilities.
Appendix: Command Line Arguments
Argument | Description | Default | Required |
|---|---|---|---|
| Input file or directory path containing images or PDFs | None | Yes |
| Directory to save output images and HTML |
| No |
| Confidence threshold for filtering detections | 0.5 | No |
| Task mode: |
| No |
Example Usage
python t_recognizer.py --inputs ./sample_docs --output_dir ./output --mode tsr --threshold 0.6
This command processes all images/PDFs in ./sample_docs with table structure recognition at a 0.6 confidence threshold, saving annotated images and HTML files to ./output.