table_structure_recognizer.py

Overview

The table_structure_recognizer.py file defines a specialized class, TableStructureRecognizer, designed to identify and reconstruct table structures from images or detected layout blocks. This recognizer extends a generic Recognizer class and focuses on detecting various table-related components such as table cells, rows, columns, headers, and spanning cells. It processes spatial and textual information from layout blocks to organize detected elements into structured tables, optionally outputting them as HTML or descriptive text.

This module is part of the InfiniFlow project and leverages models and utilities from the RAG (Retrieval-Augmented Generation) NLP framework and Huggingface Hub for model management. The primary goal is to facilitate downstream tasks like table extraction, understanding, and conversion from document images or layout data.

Class: TableStructureRecognizer

Description

TableStructureRecognizer extends the base Recognizer class to specifically target table-related entities in document images or layouts. It supports:

Detection of table structure labels such as rows, columns, headers, and spanning cells.
Alignment and normalization of bounding boxes for rows and columns.
Construction of table data structures from detected blocks.
Output of reconstructed tables as HTML tables or as descriptive text formats.
Identification and processing of captions.
Determination of block types based on text patterns and tokenization.
Handling of multi-page tables and complex spanning cell arrangements.

Class Variable

labels: List of string labels representing different table structure components recognized by this class.

labels = [
    "table",
    "table column",
    "table row",
    "table column header",
    "table projected row header",
    "table spanning cell",
]

Methods

`init(self)`

Description:
Initializes the recognizer by invoking the parent Recognizer constructor with table-specific labels and a local model directory. If loading the local model fails, it downloads the model snapshot from Huggingface Hub and initializes again.

Parameters:
None.

Returns:
None.

Usage example:

tsr = TableStructureRecognizer()

`call(self, images, thr=0.2)`

Description:
Processes images to detect table structures, normalizes bounding boxes for rows and columns by aligning left/right or top/bottom edges, and returns structured detection results.

Parameters:

images (list): A list of image objects or image data to be processed.
thr (float, optional): Confidence threshold for detection filtering. Default is 0.2.

Returns:

res (list): A list of tables, each table represented as a list of bounding boxes with normalized coordinates and labels.

Usage example:

results = tsr([image1, image2], thr=0.3)
for table in results:
    print(table)

Implementation details:

Calls the base class call to get initial bounding box detections.
For each detected table, aligns the bounding boxes of rows and headers horizontally (left and right edges).
Aligns column bounding boxes vertically (top and bottom edges).
Uses means or medians depending on the number of elements to determine alignment boundaries.
Returns a refined list of bounding boxes with adjusted coordinates.

`is_caption(bx)`

Description:
Static method to detect if a bounding box block represents a table caption based on regex patterns or layout type.

Parameters:

bx (dict): A bounding box dictionary containing at least a "text" key and optionally "layout_type".

Returns:

bool: True if the block is identified as a caption, False otherwise.

Usage example:

caption_detected = TableStructureRecognizer.is_caption(block)

Details:
Uses regex patterns targeting Chinese or English caption formats and also checks the "layout_type" field for the substring "caption".

`blockType(b)`

Description:
Static method that classifies the textual content of a block into a simplified block type code based on regex matching and tokenization.

Parameters:

b (dict): A block dictionary with at least a "text" string.

Returns:

str: A block type code such as "Dt" (Date), "Nu" (Number), "Ca" (Capitalized), "En" (English text), "NE" (Mixed text), "Sg" (Single character), "Tx" (Short text), "Lx" (Long text), "Nr" (Named entity), or "Ot" (Other).

Usage example:

block_type = TableStructureRecognizer.blockType(block)

Details:
Uses a sequence of regex patterns to categorize text, falling back to tokenization and POS tagging via rag_tokenizer for more nuanced distinction.

`construct_table(boxes, is_english=False, html=True, **kwargs)`

Description:
Constructs a structured table representation from a list of bounding boxes representing detected table cells and related elements. It removes caption blocks, identifies block types, sorts blocks spatially, organizes them into rows and columns, handles spanning cells, removes singleton row/column anomalies, and outputs either an HTML table or a descriptive text list.

Parameters:

boxes (list): List of bounding box dictionaries representing table components.
is_english (bool, optional): Determines output language style for descriptions. Default is False.
html (bool, optional): If True, returns HTML string; if False, returns descriptive text list. Default is True.
**kwargs: Additional optional keyword arguments (currently unused).

Returns:

If html=True: A string containing an HTML table representation.
If html=False: A list of descriptive strings summarizing the table content.

Usage example:

html_table = TableStructureRecognizer.construct_table(detected_boxes, is_english=True)
desc_table = TableStructureRecognizer.construct_table(detected_boxes, html=False)

Implementation details:

Extracts and removes captions from the input blocks.
Assigns block types.
Sorts blocks by row (vertical) and column (horizontal) coordinates, respecting page breaks.
Groups blocks into rows and columns.
Detects and removes columns or rows that only have a single cell and tries to merge them with adjacent rows/columns.
Determines which rows are headers based on block types and header flags.
Calls helper methods to generate HTML or descriptive text output.
Handles complex spanning cells by calculating rowspans and colspans.

`__html_table(cap, hdset, tbl)`

Description:
Private static method that generates an HTML table string from the structured table data including captions and header row information.

Parameters:

cap (str): Caption text of the table.
hdset (set): Set of row indices that are headers.
tbl (list of lists): 2D list representing rows and columns, each cell is a list of blocks with text and span info.

Returns:

str: HTML string representing the table.

`__desc_table(cap, hdr_rowno, tbl, is_english)`

Description:
Private static method that generates a descriptive text representation of the table suitable for text-based processing or display.

Parameters:

cap (str): Caption text.
hdr_rowno (set): Set of header row indices.
tbl (list of lists): Table data structure as rows and columns of blocks.
is_english (bool): Whether to output descriptions in English style or other language.

Returns:

list of strings: Each string describes a table row in a human-readable format.

`__cal_spans(boxes, rows, cols, tbl, html=True)`

Description:
Private static method to calculate and assign rowspan and colspan attributes for cells that span multiple rows or columns, updating the table data structure accordingly.

Parameters:

boxes (list): List of all blocks in the table with spatial info.
rows (list): List of rows, each a list of blocks.
cols (list): List of columns, each a list of blocks.
tbl (list of lists): The 2D table structure representation.
html (bool): If True, modifies tbl to be compatible with HTML output; otherwise for descriptive output.

Returns:

Modified tbl with spanning cells merged and span counts assigned.

Important Implementation Details and Algorithms

Bounding Box Alignment:
The recognizer aligns bounding boxes spatially to correct minor detection inconsistencies by leveraging mean or median coordinate values based on sample size.
Block Type Classification:
Uses regex patterns and tokenization with POS tagging to classify blocks into semantic types like dates, numbers, English text, etc., which aids in header identification and table structuring.
Row and Column Sorting:
Blocks are sorted primarily by their row (R) and then column (C) coordinates, with fallback on spatial sorting by x or y coordinates. This sorting respects multi-page tables by checking page numbers.
Singleton Row/Column Removal:
To handle artifacts where a row or column contains only a single cell, the algorithm attempts to merge such cells into adjacent rows or columns based on spatial proximity and textual presence.
Header Row Identification:
Header rows are identified by the presence of header flags or by block types, with thresholds on the proportion of header cells per row.
Span Calculation:
Spanning cells are detected by checking overlaps in horizontal and vertical directions, and cells are merged with appropriate rowspan and colspan counts calculated.
Output Formats:
Supports HTML table generation with proper <th> and <td> tags and span attributes, or descriptive text output for accessibility or alternative uses.

Interaction with Other System Components

Recognizer Base Class:
Inherits from the generic Recognizer class that provides foundational detection and handling capabilities. The parent class likely manages model loading, inference, and basic post-processing.
rag_tokenizer:
Utilizes the rag_tokenizer module from the rag.nlp package for tokenizing text and POS tagging to assist in block type classification.
Model Resources:
Loads or downloads pretrained model data from local directories or the Huggingface Hub repository InfiniFlow/deepdoc.
File Utilities:
Uses get_project_base_directory from api.utils.file_utils to resolve resource paths.

The class is intended to be used as part of a document understanding pipeline where images or layout data are passed in, and structured table data is extracted for further processing or presentation.

Visual Diagram: Class Structure

classDiagram
    class TableStructureRecognizer {
        -labels: list
        +__init__()
        +__call__(images, thr=0.2)
        +is_caption(bx) <<static>>
        +blockType(b) <<static>>
        +construct_table(boxes, is_english=False, html=True, **kwargs) <<static>>
        -__html_table(cap, hdset, tbl) <<static>>
        -__desc_table(cap, hdr_rowno, tbl, is_english) <<static>>
        -__cal_spans(boxes, rows, cols, tbl, html=True) <<static>>
    }
    TableStructureRecognizer --|> Recognizer

Summary

The table_structure_recognizer.py module provides a robust, configurable recognizer for extracting and reconstructing tables from images or layout blocks. It combines spatial heuristics, text pattern analysis, and sophisticated cell grouping and spanning logic to deliver structured table outputs suitable for both programmatic consumption and display. This class serves as a critical component in document AI workflows focused on table extraction and understanding.

table_structure_recognizer.py

Overview

Class: TableStructureRecognizer

Description

Class Variable

Methods

__init__(self)

__call__(self, images, thr=0.2)

is_caption(bx)

blockType(b)

construct_table(boxes, is_english=False, html=True, **kwargs)

__html_table(cap, hdset, tbl)

__desc_table(cap, hdr_rowno, tbl, is_english)

__cal_spans(boxes, rows, cols, tbl, html=True)

Important Implementation Details and Algorithms

Interaction with Other System Components

Visual Diagram: Class Structure

Summary

`init(self)`

`call(self, images, thr=0.2)`

`is_caption(bx)`

`blockType(b)`

`construct_table(boxes, is_english=False, html=True, **kwargs)`

`__html_table(cap, hdset, tbl)`

`__desc_table(cap, hdr_rowno, tbl, is_english)`

`__cal_spans(boxes, rows, cols, tbl, html=True)`