table.py

Overview

The table.py file provides functionality to parse tabular data from Excel (.xls/.xlsx) and text-based CSV or TXT files into structured data chunks suitable for downstream NLP processing. It primarily focuses on intelligently extracting headers and data rows from Excel sheets, handling complex header structures including merged cells, and determining column data types. The file integrates with the InfiniFlow knowledge base system by updating parser configurations and preparing tokenized, normalized records for indexing or further analysis.

Key features:

Parsing Excel files with support for multi-level headers and merged cells.
Parsing CSV/TXT files with flexible delimiter support (default TAB).
Automatic column data type inference (int, float, bool, datetime, text).
Data normalization and tokenization using RAG tokenizer for NLP readiness.
Callback support for progress reporting.
Integration with KnowledgebaseService to update field mappings.

Classes and Functions

Class: `Excel`

Extends ExcelParser to provide advanced Excel sheet parsing capabilities including detection and handling of complex/multi-level headers and extraction of row data accounting for merged cells.

Methods

`call(self, fnm, binary=None, from_page=0, to_page=10000000000, callback=None)`

Main entry point to parse an Excel file or binary content.

Parameters:
- fnm (str): Filename or path to the Excel file.
- binary (bytes, optional): Raw binary content of the Excel file. If provided, file loading uses this instead of reading from disk.
- from_page (int, optional): Start row index (0-based) for data extraction.
- to_page (int, optional): End row index (exclusive) for data extraction.
- callback (callable, optional): Function to report progress and messages. Called with (progress_float, message_str).
Returns:
- List[pd.DataFrame]: List of pandas DataFrames, each representing a parsed sheet data block.
Usage Example:

excel_parser = Excel()
dataframes = excel_parser("data.xlsx", from_page=0, to_page=100, callback=print)
for df in dataframes:
    print(df.head())

Implementation Details:
- Loads workbook using either file path or binary data.
- Iterates sheets, counts total rows.
- Parses headers with _parse_headers (supports simple and multi-level headers).
- Extracts rows from specified range, skips empty or failed rows.
- Collects data rows into pandas DataFrame objects.
- Reports progress and returns list of DataFrames.

`_parse_headers(self, ws, rows)`

Determines whether sheet headers are simple or complex (multi-level), then delegates to appropriate parser.

Parameters:
- ws: Excel worksheet object.
- rows: List of rows in the sheet.
Returns:
- Tuple (headers: List[str], header_rows: int): List of header column names and the number of header rows processed.

`_has_complex_header_structure(self, ws, rows)`

Checks if the sheet has merged cells in the first two rows indicating a complex header.

Returns: bool

`_row_looks_like_header(self, row)`

Heuristic to determine if a given row resembles a header row based on cell content.

Returns: bool

`_parse_simple_headers(self, rows)`

Parses a simple single-row header, assigning default column names if empty cells found.

Returns: (headers: List[str], header_rows: int)

`_parse_multi_level_headers(self, ws, rows)`

Handles headers spanning multiple rows, detecting header row count and building hierarchical header names joined by hyphens.

Returns: (headers: List[str], header_rows: int)

`_detect_header_rows(self, rows)`

Detects how many initial rows qualify as header rows based on heuristics.

Returns: int number of header rows

`_looks_like_header(self, value)`

Checks if a string value looks like a header cell based on character content and symbols.

Returns: bool

`_looks_like_data(self, value)`

Checks if a string looks like data (numbers, boolean flags, hex).

Returns: bool

`_build_hierarchical_headers(self, ws, rows, header_rows)`

Builds combined header names from multiple header rows, considering merged cells.

Returns: List[str]

`_is_valid_header_part(self, value)`

Filters out header parts that are likely not meaningful (e.g., single-character flags or symbols).

Returns: bool

`_get_merged_cell_value(self, ws, row, col, merged_ranges)`

Retrieves the value from a merged cell covering the specified row and column.

Returns: Cell value or None

`_extract_row_data(self, ws, row, absolute_row_idx, expected_cols)`

Extracts data for a row across all expected columns, considering merged cells and inherited values.

Returns: List of cell values for the row.

`_get_inherited_value(self, ws, row, col, merged_ranges)`

Helper to get the inherited value from merged cells if the current cell is empty.

Returns: Cell value or None

`_is_empty_row(self, row_data)`

Checks if a row contains only empty or null values.

Returns: bool

Function: `trans_datatime(s)`

Attempts to parse a string into a datetime and format as YYYY-MM-DD HH:mm:ss.

Parameters:
- s (str): Input string.
Returns:
- Formatted datetime string or None on failure.

Function: `trans_bool(s)`

Attempts to normalize boolean-like strings to "yes" or "no".

Parameters:
- s (str): Input string.
Returns:
- "yes" or "no" or None if no match.

Function: `column_data_type(arr)`

Infers the predominant data type of a column by analyzing its values. Supports int, float, datetime, bool, and text.

Parameters:
- arr (iterable): Column values.
Returns:
- Tuple (transformed_list, data_type_str) where:
  - transformed_list: List of values converted to inferred type or None where conversion failed.
  - data_type_str: One of "int", "float", "datetime", "bool", "text".
Usage Example:

values = ["100", "200", "300"]
converted, dtype = column_data_type(values)
print(dtype)  # 'int'
print(converted)  # [100, 200, 300]

Implementation Details:
- Uses regex and helper functions to count type occurrences.
- Converts values to the most frequent type.
- Flags large integers as floats if exceeding 64-bit integer limits.

Function: `chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)`

Main function to parse a table file (Excel or text-based CSV/TXT) into tokenized chunks for NLP processing.

Parameters:
- filename (str): File path or name.
- binary (bytes, optional): Raw file content.
- from_page (int, optional): Start row index.
- to_page (int, optional): End row index.
- lang (str, optional): Language for tokenization ("Chinese" or "English").
- callback (callable, optional): Progress reporting function.
- **kwargs: Additional options, e.g., delimiter for CSV/TXT.
Returns:
- List[dict]: List of dictionaries, each representing a tokenized data chunk with normalized fields.
Usage Example:

chunks = chunk("data.xlsx", lang="English", callback=print)
for chunk_data in chunks:
    print(chunk_data)

Implementation Details:
- Detects file type by extension.
- Uses Excel class for .xls/.xlsx, custom parsing for .csv/.txt.
- Validates headers, handles duplicate columns.
- Infers column types via column_data_type.
- Normalizes and tokenizes data fields using rag_tokenizer and tokenize.
- Updates knowledge base parser configuration with field mappings.
- Supports progress update callbacks.

Important Implementation Details and Algorithms

Header Parsing:
- Detects if headers are simple (single row) or complex (multi-level with merged cells).
- For multi-level headers, merges cell values vertically and horizontally to build compound header names.
- Uses heuristics based on character content to distinguish headers from data rows.
Row Extraction:
- Handles merged cells by fetching appropriate inherited values if a cell is empty.
- Skips empty rows.
- Tracks failures in data extraction to report problematic lines.
Data Type Inference:
- Counts occurrences of various data type patterns per column.
- Supports large integer detection and converts to float if exceeding 64-bit range.
- Converts values to detected types for downstream consistency.
Tokenization and NLP Preparation:
- Uses rag_tokenizer to tokenize text fields.
- Incorporates pinyin transliteration for Chinese headers.
- Organizes fields with suffixes indicating data types (_tks for text tokens, _long for integers, etc.).
- Prepares dictionaries representing document chunks with metadata for indexing.
Integration:
- Updates KnowledgebaseService with field mappings to keep track of field types and names for the knowledge base.

Interaction with Other Parts of the System

Dependencies:
- deepdoc.parser.ExcelParser: Base class for Excel parsing.
- rag.nlp.rag_tokenizer and rag.nlp.tokenize: For tokenizing extracted text.
- api.db.services.knowledgebase_service.KnowledgebaseService: To update parser configuration with field mappings.
- deepdoc.parser.utils.get_text: To read text content from files.
- xpinyin.Pinyin: To convert Chinese column headers to pinyin for standardized field names.
- pandas and numpy: For data structure management and transformations.
- dateutil.parser for datetime parsing.
Outputs:
- Structured and tokenized data chunks ready for indexing or NLP analysis.
- Updates to knowledge base parser configuration for field recognition.

Visual Diagram

classDiagram
    class Excel {
        +__call__(fnm, binary=None, from_page=0, to_page=10000000000, callback=None)
        -_parse_headers(ws, rows)
        -_has_complex_header_structure(ws, rows)
        -_row_looks_like_header(row)
        -_parse_simple_headers(rows)
        -_parse_multi_level_headers(ws, rows)
        -_detect_header_rows(rows)
        -_looks_like_header(value)
        -_looks_like_data(value)
        -_build_hierarchical_headers(ws, rows, header_rows)
        -_is_valid_header_part(value)
        -_get_merged_cell_value(ws, row, col, merged_ranges)
        -_extract_row_data(ws, row, absolute_row_idx, expected_cols)
        -_get_inherited_value(ws, row, col, merged_ranges)
        -_is_empty_row(row_data)
    }

    class Functions {
        +trans_datatime(s)
        +trans_bool(s)
        +column_data_type(arr)
        +chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)
    }

    Excel ..> deepdoc.parser.ExcelParser : extends
    chunk ..> Excel : uses
    chunk ..> rag.nlp.rag_tokenizer : uses
    chunk ..> rag.nlp.tokenize : uses
    chunk ..> api.db.services.KnowledgebaseService : updates
    chunk ..> xpinyin.Pinyin : uses
    chunk ..> pandas.DataFrame : creates

Summary

The table.py module is a robust utility for extracting tabular data from Excel and CSV/TXT files, intelligently handling complex headers and data types, and preparing the data for NLP processing within the InfiniFlow platform. It bridges raw file input and structured semantic data chunks, supporting knowledge base integrations with field mapping updates and tokenization workflows.

This file is crucial for any workflow that requires automatic ingestion and understanding of structured tabular data formats, especially in multilingual environments with mixed data types and complex table layouts.

table.py

Overview

Classes and Functions

Class: Excel

Methods

__call__(self, fnm, binary=None, from_page=0, to_page=10000000000, callback=None)

_parse_headers(self, ws, rows)

_has_complex_header_structure(self, ws, rows)

_row_looks_like_header(self, row)

_parse_simple_headers(self, rows)

_parse_multi_level_headers(self, ws, rows)

_detect_header_rows(self, rows)

_looks_like_header(self, value)

_looks_like_data(self, value)

_build_hierarchical_headers(self, ws, rows, header_rows)

_is_valid_header_part(self, value)

_get_merged_cell_value(self, ws, row, col, merged_ranges)

_extract_row_data(self, ws, row, absolute_row_idx, expected_cols)

_get_inherited_value(self, ws, row, col, merged_ranges)

_is_empty_row(self, row_data)

Function: trans_datatime(s)

Function: trans_bool(s)

Function: column_data_type(arr)

Function: chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)