excel_parser.py

Overview

excel_parser.py provides a utility class RAGFlowExcelParser designed for parsing and processing Excel and CSV files. Its primary purpose is to load spreadsheet data from various file formats, clean and convert it into consistent in-memory Excel workbooks, and offer multiple output formats including HTML tables, Markdown, and plain text representations.

The parser supports both .xlsx and legacy .xls Excel files, as well as CSV files, automatically detecting the file type. It leverages multiple libraries to maximize compatibility and robustness: openpyxl for Excel loading, pandas for flexible reading and conversion, and fallback engines like calamine for edge cases. It also includes cleaning of illegal characters that might cause issues during processing.

This file is useful for applications that need to ingest spreadsheet data in diverse formats and produce consistent, clean, and display-friendly outputs without depending on the original file format.

Classes and Methods

Class: `RAGFlowExcelParser`

A parser class to load, clean, and transform Excel and CSV files into various output formats.

Static Methods

`_load_excel_to_workbook(file_like_object) -> openpyxl.Workbook`

Purpose:
Load spreadsheet data from a file-like object or bytes, returning an openpyxl.Workbook instance.

Parameters:

file_like_object (bytes or file-like object): Raw input data or file stream containing Excel or CSV content.

Returns:

openpyxl.Workbook: A workbook object containing the parsed spreadsheet data.

Details:

Determines file type by reading the first 4 bytes (magic numbers for XLSX or XLS files).
If not Excel, tries to parse as CSV using pandas.read_csv.
If Excel, tries openpyxl.load_workbook with data_only=True.
On failure, tries pandas.read_excel with fallback to the calamine engine.
Converts pandas.DataFrame objects back into openpyxl.Workbook via _dataframe_to_workbook.
Raises exceptions on total failure.

Usage Example:

with open("data.xlsx", "rb") as f:
    wb = RAGFlowExcelParser._load_excel_to_workbook(f)

`_clean_dataframe(df: pd.DataFrame) -> pd.DataFrame`

Purpose:
Remove illegal XML characters from all string cells in a DataFrame.

Parameters:

df (pandas.DataFrame): Input data frame to clean.

Returns:

pandas.DataFrame: Cleaned data frame with illegal characters replaced by spaces.

Implementation Details:

Uses a regex ILLEGAL_CHARACTERS_RE to match control characters disallowed in Excel XML.
Applies a cleaning function to each cell in the DataFrame.

`_dataframe_to_workbook(df: pd.DataFrame) -> openpyxl.Workbook`

Purpose:
Convert a cleaned pandas DataFrame into an openpyxl.Workbook.

Parameters:

df (pandas.DataFrame): Cleaned DataFrame to convert.

Returns:

openpyxl.Workbook: Workbook containing the data in a sheet named "Data".

Implementation Details:

Writes column headers in the first row.
Writes values row-wise starting from the second row.

Instance Methods

`html(self, fnm, chunk_rows=256) -> list[str]`

Purpose:
Render the spreadsheet data as a list of HTML table chunks, suitable for display in web pages.

Parameters:

fnm (bytes, str, or file-like): The file content or path.
chunk_rows (int, optional): Number of rows per HTML table chunk (default 256).

Returns:

list[str]: List of HTML strings, each representing a table chunk for a worksheet.

Implementation Details:

Uses _load_excel_to_workbook to get workbook.
Escapes HTML special characters in cell values.
Splits rows into chunks to avoid very large tables.
Each table includes a caption with the sheet name.
Empty cells rendered as empty <td> cells.

Usage Example:

parser = RAGFlowExcelParser()
html_tables = parser.html(excel_bytes)
for table_html in html_tables:
    print(table_html)

`markdown(self, fnm) -> str`

Purpose:
Convert the spreadsheet or CSV file content into a Markdown-formatted table.

Parameters:

fnm (bytes, str, or file-like): The file content or path.

Returns:

str: A Markdown string representation of the first sheet or CSV.

Implementation Details:

Tries pandas.read_excel, falls back to pandas.read_csv on failure.
Replaces empty or whitespace-only cells with empty strings.
Uses pandas.DataFrame.to_markdown for output.

`call(self, fnm) -> list[str]`

Purpose:
Parse the spreadsheet and return a list of semicolon-separated field strings representing rows.

Parameters:

fnm (bytes, str, or file-like): File content or file path.

Returns:

list[str]: List of strings in the format "Header: Value; Header2: Value2 ——SheetName".

Implementation Details:

Loads workbook with _load_excel_to_workbook.
Iterates sheets and rows, skipping empty cells.
Combines header and cell value with a colon separator, joins fields by semicolons.
Adds sheet name as suffix if sheet name does not contain "sheet".

Usage Example:

parser = RAGFlowExcelParser()
lines = parser(excel_bytes)
for line in lines:
    print(line)

`row_number(fnm: str, binary: bytes) -> int`

Purpose:
Calculate the total number of rows in the given spreadsheet or CSV file.

Parameters:

fnm (str): Filename or file identifier to infer file type.
binary (bytes): Raw file content.

Returns:

int: Total number of rows across all sheets or lines in the CSV.

Implementation Details:

For Excel files (.xls, .xlsx), loads workbook and sums rows from all sheets.
For CSV or TXT files, detects encoding using find_codec and counts newline characters.

Important Implementation Details

File Type Detection: The parser reads the first 4 bytes of the input to detect whether it is an Excel file by matching magic bytes for .xlsx (PK\x03\x04) or .xls (\xd0\xcf\x11\xe0). If neither matches, it treats the input as CSV.
Multiple Loading Strategies: To maximize compatibility, it uses openpyxl first, then pandas.read_excel with the default engine, and finally with the calamine engine as fallback.
Cleaning Illegal Characters: Excel's XML format disallows certain control characters. The parser cleans these using a regex before converting data into workbooks.
Chunked HTML Output: For large sheets, the HTML output divides rows into manageable chunks to avoid overwhelming browsers or clients.
Flexible Input: Methods accept raw bytes or file-like objects, allowing integration with in-memory data or files.
Integration with rag.nlp.find_codec: The row_number method uses an external function to detect text encoding for CSV files, indicating integration with a larger NLP or toolkit ecosystem.

Interaction with Other System Components

Imports find_codec from rag.nlp, suggesting this parser is part of a larger system that includes natural language processing or data ingestion utilities.
Uses standard libraries (openpyxl, pandas, logging) for file handling and data manipulation.
The class is designed to be instantiated and called directly or used for generating HTML and Markdown reports, making it suitable as a backend utility in web applications, CLI tools, or data pipelines.

Visual Diagram

classDiagram
    class RAGFlowExcelParser {
        <<static>>
        +_load_excel_to_workbook(file_like_object) Workbook
        +_clean_dataframe(df: DataFrame) DataFrame
        +_dataframe_to_workbook(df: DataFrame) Workbook
        --
        +html(fnm, chunk_rows=256) list~str~
        +markdown(fnm) str
        +__call__(fnm) list~str~
        +row_number(fnm: str, binary: bytes) int
    }

    RAGFlowExcelParser ..> pandas.DataFrame : uses
    RAGFlowExcelParser ..> openpyxl.Workbook : uses
    RAGFlowExcelParser ..> rag.nlp.find_codec : uses

Summary

excel_parser.py defines a robust and flexible parser class for Excel and CSV files, capable of handling multiple formats, cleaning data, and producing various output representations (HTML, Markdown, text). It integrates multiple libraries and fallback mechanisms, ensuring high compatibility with diverse spreadsheet inputs. This file serves as a core utility in a system that requires ingestion and display of tabular data from user uploads or external sources.

excel_parser.py

Overview

Classes and Methods

Class: RAGFlowExcelParser

Static Methods

_load_excel_to_workbook(file_like_object) -> openpyxl.Workbook

_clean_dataframe(df: pd.DataFrame) -> pd.DataFrame

_dataframe_to_workbook(df: pd.DataFrame) -> openpyxl.Workbook

Instance Methods

html(self, fnm, chunk_rows=256) -> list[str]

markdown(self, fnm) -> str

__call__(self, fnm) -> list[str]

row_number(fnm: str, binary: bytes) -> int

Important Implementation Details

Interaction with Other System Components

Visual Diagram

Summary

Class: `RAGFlowExcelParser`

`_load_excel_to_workbook(file_like_object) -> openpyxl.Workbook`

`_clean_dataframe(df: pd.DataFrame) -> pd.DataFrame`

`_dataframe_to_workbook(df: pd.DataFrame) -> openpyxl.Workbook`

`html(self, fnm, chunk_rows=256) -> list[str]`

`markdown(self, fnm) -> str`

`call(self, fnm) -> list[str]`

`row_number(fnm: str, binary: bytes) -> int`