table.py


Overview

The table.py file provides functionality to parse tabular data from Excel (.xls/.xlsx) and text-based CSV or TXT files into structured data chunks suitable for downstream NLP processing. It primarily focuses on intelligently extracting headers and data rows from Excel sheets, handling complex header structures including merged cells, and determining column data types. The file integrates with the InfiniFlow knowledge base system by updating parser configurations and preparing tokenized, normalized records for indexing or further analysis.

Key features:


Classes and Functions

Class: Excel

Extends ExcelParser to provide advanced Excel sheet parsing capabilities including detection and handling of complex/multi-level headers and extraction of row data accounting for merged cells.

Methods


__call__(self, fnm, binary=None, from_page=0, to_page=10000000000, callback=None)

Main entry point to parse an Excel file or binary content.

excel_parser = Excel()
dataframes = excel_parser("data.xlsx", from_page=0, to_page=100, callback=print)
for df in dataframes:
    print(df.head())

_parse_headers(self, ws, rows)

Determines whether sheet headers are simple or complex (multi-level), then delegates to appropriate parser.


_has_complex_header_structure(self, ws, rows)

Checks if the sheet has merged cells in the first two rows indicating a complex header.


_row_looks_like_header(self, row)

Heuristic to determine if a given row resembles a header row based on cell content.


_parse_simple_headers(self, rows)

Parses a simple single-row header, assigning default column names if empty cells found.


_parse_multi_level_headers(self, ws, rows)

Handles headers spanning multiple rows, detecting header row count and building hierarchical header names joined by hyphens.


_detect_header_rows(self, rows)

Detects how many initial rows qualify as header rows based on heuristics.


_looks_like_header(self, value)

Checks if a string value looks like a header cell based on character content and symbols.


_looks_like_data(self, value)

Checks if a string looks like data (numbers, boolean flags, hex).


_build_hierarchical_headers(self, ws, rows, header_rows)

Builds combined header names from multiple header rows, considering merged cells.


_is_valid_header_part(self, value)

Filters out header parts that are likely not meaningful (e.g., single-character flags or symbols).


_get_merged_cell_value(self, ws, row, col, merged_ranges)

Retrieves the value from a merged cell covering the specified row and column.


_extract_row_data(self, ws, row, absolute_row_idx, expected_cols)

Extracts data for a row across all expected columns, considering merged cells and inherited values.


_get_inherited_value(self, ws, row, col, merged_ranges)

Helper to get the inherited value from merged cells if the current cell is empty.


_is_empty_row(self, row_data)

Checks if a row contains only empty or null values.


Function: trans_datatime(s)

Attempts to parse a string into a datetime and format as YYYY-MM-DD HH:mm:ss.


Function: trans_bool(s)

Attempts to normalize boolean-like strings to "yes" or "no".


Function: column_data_type(arr)

Infers the predominant data type of a column by analyzing its values. Supports int, float, datetime, bool, and text.

values = ["100", "200", "300"]
converted, dtype = column_data_type(values)
print(dtype)  # 'int'
print(converted)  # [100, 200, 300]

Function: chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)

Main function to parse a table file (Excel or text-based CSV/TXT) into tokenized chunks for NLP processing.

chunks = chunk("data.xlsx", lang="English", callback=print)
for chunk_data in chunks:
    print(chunk_data)

Important Implementation Details and Algorithms


Interaction with Other Parts of the System


Visual Diagram

classDiagram
    class Excel {
        +__call__(fnm, binary=None, from_page=0, to_page=10000000000, callback=None)
        -_parse_headers(ws, rows)
        -_has_complex_header_structure(ws, rows)
        -_row_looks_like_header(row)
        -_parse_simple_headers(rows)
        -_parse_multi_level_headers(ws, rows)
        -_detect_header_rows(rows)
        -_looks_like_header(value)
        -_looks_like_data(value)
        -_build_hierarchical_headers(ws, rows, header_rows)
        -_is_valid_header_part(value)
        -_get_merged_cell_value(ws, row, col, merged_ranges)
        -_extract_row_data(ws, row, absolute_row_idx, expected_cols)
        -_get_inherited_value(ws, row, col, merged_ranges)
        -_is_empty_row(row_data)
    }

    class Functions {
        +trans_datatime(s)
        +trans_bool(s)
        +column_data_type(arr)
        +chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)
    }

    Excel ..> deepdoc.parser.ExcelParser : extends
    chunk ..> Excel : uses
    chunk ..> rag.nlp.rag_tokenizer : uses
    chunk ..> rag.nlp.tokenize : uses
    chunk ..> api.db.services.KnowledgebaseService : updates
    chunk ..> xpinyin.Pinyin : uses
    chunk ..> pandas.DataFrame : creates

Summary

The table.py module is a robust utility for extracting tabular data from Excel and CSV/TXT files, intelligently handling complex headers and data types, and preparing the data for NLP processing within the InfiniFlow platform. It bridges raw file input and structured semantic data chunks, supporting knowledge base integrations with field mapping updates and tokenization workflows.

This file is crucial for any workflow that requires automatic ingestion and understanding of structured tabular data formats, especially in multilingual environments with mixed data types and complex table layouts.