table.py
Overview
The table.py file provides functionality to parse tabular data from Excel (.xls/.xlsx) and text-based CSV or TXT files into structured data chunks suitable for downstream NLP processing. It primarily focuses on intelligently extracting headers and data rows from Excel sheets, handling complex header structures including merged cells, and determining column data types. The file integrates with the InfiniFlow knowledge base system by updating parser configurations and preparing tokenized, normalized records for indexing or further analysis.
Key features:
Parsing Excel files with support for multi-level headers and merged cells.
Parsing CSV/TXT files with flexible delimiter support (default TAB).
Automatic column data type inference (int, float, bool, datetime, text).
Data normalization and tokenization using RAG tokenizer for NLP readiness.
Callback support for progress reporting.
Integration with
KnowledgebaseServiceto update field mappings.
Classes and Functions
Class: Excel
Extends ExcelParser to provide advanced Excel sheet parsing capabilities including detection and handling of complex/multi-level headers and extraction of row data accounting for merged cells.
Methods
__call__(self, fnm, binary=None, from_page=0, to_page=10000000000, callback=None)
Main entry point to parse an Excel file or binary content.
Parameters:
fnm(str): Filename or path to the Excel file.binary(bytes, optional): Raw binary content of the Excel file. If provided, file loading uses this instead of reading from disk.from_page(int, optional): Start row index (0-based) for data extraction.to_page(int, optional): End row index (exclusive) for data extraction.callback(callable, optional): Function to report progress and messages. Called with (progress_float, message_str).
Returns:
List[pd.DataFrame]: List of pandas DataFrames, each representing a parsed sheet data block.
Usage Example:
excel_parser = Excel()
dataframes = excel_parser("data.xlsx", from_page=0, to_page=100, callback=print)
for df in dataframes:
print(df.head())
Implementation Details:
Loads workbook using either file path or binary data.
Iterates sheets, counts total rows.
Parses headers with
_parse_headers(supports simple and multi-level headers).Extracts rows from specified range, skips empty or failed rows.
Collects data rows into pandas DataFrame objects.
Reports progress and returns list of DataFrames.
_parse_headers(self, ws, rows)
Determines whether sheet headers are simple or complex (multi-level), then delegates to appropriate parser.
Parameters:
ws: Excel worksheet object.rows: List of rows in the sheet.
Returns:
Tuple
(headers: List[str], header_rows: int): List of header column names and the number of header rows processed.
_has_complex_header_structure(self, ws, rows)
Checks if the sheet has merged cells in the first two rows indicating a complex header.
Returns:
bool
_row_looks_like_header(self, row)
Heuristic to determine if a given row resembles a header row based on cell content.
Returns:
bool
_parse_simple_headers(self, rows)
Parses a simple single-row header, assigning default column names if empty cells found.
Returns:
(headers: List[str], header_rows: int)
_parse_multi_level_headers(self, ws, rows)
Handles headers spanning multiple rows, detecting header row count and building hierarchical header names joined by hyphens.
Returns:
(headers: List[str], header_rows: int)
_detect_header_rows(self, rows)
Detects how many initial rows qualify as header rows based on heuristics.
Returns:
intnumber of header rows
_looks_like_header(self, value)
Checks if a string value looks like a header cell based on character content and symbols.
Returns:
bool
_looks_like_data(self, value)
Checks if a string looks like data (numbers, boolean flags, hex).
Returns:
bool
_build_hierarchical_headers(self, ws, rows, header_rows)
Builds combined header names from multiple header rows, considering merged cells.
Returns:
List[str]
_is_valid_header_part(self, value)
Filters out header parts that are likely not meaningful (e.g., single-character flags or symbols).
Returns:
bool
_get_merged_cell_value(self, ws, row, col, merged_ranges)
Retrieves the value from a merged cell covering the specified row and column.
Returns: Cell value or
None
_extract_row_data(self, ws, row, absolute_row_idx, expected_cols)
Extracts data for a row across all expected columns, considering merged cells and inherited values.
Returns:
Listof cell values for the row.
_get_inherited_value(self, ws, row, col, merged_ranges)
Helper to get the inherited value from merged cells if the current cell is empty.
Returns: Cell value or
None
_is_empty_row(self, row_data)
Checks if a row contains only empty or null values.
Returns:
bool
Function: trans_datatime(s)
Attempts to parse a string into a datetime and format as YYYY-MM-DD HH:mm:ss.
Parameters:
s(str): Input string.
Returns:
Formatted datetime string or
Noneon failure.
Function: trans_bool(s)
Attempts to normalize boolean-like strings to "yes" or "no".
Parameters:
s(str): Input string.
Returns:
"yes"or"no"orNoneif no match.
Function: column_data_type(arr)
Infers the predominant data type of a column by analyzing its values. Supports int, float, datetime, bool, and text.
Parameters:
arr(iterable): Column values.
Returns:
Tuple
(transformed_list, data_type_str)where:transformed_list: List of values converted to inferred type orNonewhere conversion failed.data_type_str: One of"int","float","datetime","bool","text".
Usage Example:
values = ["100", "200", "300"]
converted, dtype = column_data_type(values)
print(dtype) # 'int'
print(converted) # [100, 200, 300]
Implementation Details:
Uses regex and helper functions to count type occurrences.
Converts values to the most frequent type.
Flags large integers as floats if exceeding 64-bit integer limits.
Function: chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)
Main function to parse a table file (Excel or text-based CSV/TXT) into tokenized chunks for NLP processing.
Parameters:
filename(str): File path or name.binary(bytes, optional): Raw file content.from_page(int, optional): Start row index.to_page(int, optional): End row index.lang(str, optional): Language for tokenization ("Chinese" or "English").callback(callable, optional): Progress reporting function.**kwargs: Additional options, e.g.,delimiterfor CSV/TXT.
Returns:
List[dict]: List of dictionaries, each representing a tokenized data chunk with normalized fields.
Usage Example:
chunks = chunk("data.xlsx", lang="English", callback=print)
for chunk_data in chunks:
print(chunk_data)
Implementation Details:
Detects file type by extension.
Uses
Excelclass for.xls/.xlsx, custom parsing for.csv/.txt.Validates headers, handles duplicate columns.
Infers column types via
column_data_type.Normalizes and tokenizes data fields using
rag_tokenizerandtokenize.Updates knowledge base parser configuration with field mappings.
Supports progress update callbacks.
Important Implementation Details and Algorithms
Header Parsing:
Detects if headers are simple (single row) or complex (multi-level with merged cells).
For multi-level headers, merges cell values vertically and horizontally to build compound header names.
Uses heuristics based on character content to distinguish headers from data rows.
Row Extraction:
Handles merged cells by fetching appropriate inherited values if a cell is empty.
Skips empty rows.
Tracks failures in data extraction to report problematic lines.
Data Type Inference:
Counts occurrences of various data type patterns per column.
Supports large integer detection and converts to float if exceeding 64-bit range.
Converts values to detected types for downstream consistency.
Tokenization and NLP Preparation:
Uses
rag_tokenizerto tokenize text fields.Incorporates pinyin transliteration for Chinese headers.
Organizes fields with suffixes indicating data types (
_tksfor text tokens,_longfor integers, etc.).Prepares dictionaries representing document chunks with metadata for indexing.
Integration:
Updates
KnowledgebaseServicewith field mappings to keep track of field types and names for the knowledge base.
Interaction with Other Parts of the System
Dependencies:
deepdoc.parser.ExcelParser: Base class for Excel parsing.rag.nlp.rag_tokenizerandrag.nlp.tokenize: For tokenizing extracted text.api.db.services.knowledgebase_service.KnowledgebaseService: To update parser configuration with field mappings.deepdoc.parser.utils.get_text: To read text content from files.xpinyin.Pinyin: To convert Chinese column headers to pinyin for standardized field names.pandasandnumpy: For data structure management and transformations.dateutil.parserfor datetime parsing.
Outputs:
Structured and tokenized data chunks ready for indexing or NLP analysis.
Updates to knowledge base parser configuration for field recognition.
Visual Diagram
classDiagram
class Excel {
+__call__(fnm, binary=None, from_page=0, to_page=10000000000, callback=None)
-_parse_headers(ws, rows)
-_has_complex_header_structure(ws, rows)
-_row_looks_like_header(row)
-_parse_simple_headers(rows)
-_parse_multi_level_headers(ws, rows)
-_detect_header_rows(rows)
-_looks_like_header(value)
-_looks_like_data(value)
-_build_hierarchical_headers(ws, rows, header_rows)
-_is_valid_header_part(value)
-_get_merged_cell_value(ws, row, col, merged_ranges)
-_extract_row_data(ws, row, absolute_row_idx, expected_cols)
-_get_inherited_value(ws, row, col, merged_ranges)
-_is_empty_row(row_data)
}
class Functions {
+trans_datatime(s)
+trans_bool(s)
+column_data_type(arr)
+chunk(filename, binary=None, from_page=0, to_page=10000000000, lang="Chinese", callback=None, **kwargs)
}
Excel ..> deepdoc.parser.ExcelParser : extends
chunk ..> Excel : uses
chunk ..> rag.nlp.rag_tokenizer : uses
chunk ..> rag.nlp.tokenize : uses
chunk ..> api.db.services.KnowledgebaseService : updates
chunk ..> xpinyin.Pinyin : uses
chunk ..> pandas.DataFrame : creates
Summary
The table.py module is a robust utility for extracting tabular data from Excel and CSV/TXT files, intelligently handling complex headers and data types, and preparing the data for NLP processing within the InfiniFlow platform. It bridges raw file input and structured semantic data chunks, supporting knowledge base integrations with field mapping updates and tokenization workflows.
This file is crucial for any workflow that requires automatic ingestion and understanding of structured tabular data formats, especially in multilingual environments with mixed data types and complex table layouts.