excel_parser.py


Overview

excel_parser.py provides a utility class RAGFlowExcelParser designed for parsing and processing Excel and CSV files. Its primary purpose is to load spreadsheet data from various file formats, clean and convert it into consistent in-memory Excel workbooks, and offer multiple output formats including HTML tables, Markdown, and plain text representations.

The parser supports both .xlsx and legacy .xls Excel files, as well as CSV files, automatically detecting the file type. It leverages multiple libraries to maximize compatibility and robustness: openpyxl for Excel loading, pandas for flexible reading and conversion, and fallback engines like calamine for edge cases. It also includes cleaning of illegal characters that might cause issues during processing.

This file is useful for applications that need to ingest spreadsheet data in diverse formats and produce consistent, clean, and display-friendly outputs without depending on the original file format.


Classes and Methods

Class: RAGFlowExcelParser

A parser class to load, clean, and transform Excel and CSV files into various output formats.


Static Methods

_load_excel_to_workbook(file_like_object) -> openpyxl.Workbook

Purpose:
Load spreadsheet data from a file-like object or bytes, returning an openpyxl.Workbook instance.

Parameters:

Returns:

Details:

Usage Example:

with open("data.xlsx", "rb") as f:
    wb = RAGFlowExcelParser._load_excel_to_workbook(f)

_clean_dataframe(df: pd.DataFrame) -> pd.DataFrame

Purpose:
Remove illegal XML characters from all string cells in a DataFrame.

Parameters:

Returns:

Implementation Details:


_dataframe_to_workbook(df: pd.DataFrame) -> openpyxl.Workbook

Purpose:
Convert a cleaned pandas DataFrame into an openpyxl.Workbook.

Parameters:

Returns:

Implementation Details:


Instance Methods

html(self, fnm, chunk_rows=256) -> list[str]

Purpose:
Render the spreadsheet data as a list of HTML table chunks, suitable for display in web pages.

Parameters:

Returns:

Implementation Details:

Usage Example:

parser = RAGFlowExcelParser()
html_tables = parser.html(excel_bytes)
for table_html in html_tables:
    print(table_html)

markdown(self, fnm) -> str

Purpose:
Convert the spreadsheet or CSV file content into a Markdown-formatted table.

Parameters:

Returns:

Implementation Details:


__call__(self, fnm) -> list[str]

Purpose:
Parse the spreadsheet and return a list of semicolon-separated field strings representing rows.

Parameters:

Returns:

Implementation Details:

Usage Example:

parser = RAGFlowExcelParser()
lines = parser(excel_bytes)
for line in lines:
    print(line)

row_number(fnm: str, binary: bytes) -> int

Purpose:
Calculate the total number of rows in the given spreadsheet or CSV file.

Parameters:

Returns:

Implementation Details:


Important Implementation Details


Interaction with Other System Components


Visual Diagram

classDiagram
    class RAGFlowExcelParser {
        <<static>>
        +_load_excel_to_workbook(file_like_object) Workbook
        +_clean_dataframe(df: DataFrame) DataFrame
        +_dataframe_to_workbook(df: DataFrame) Workbook
        --
        +html(fnm, chunk_rows=256) list~str~
        +markdown(fnm) str
        +__call__(fnm) list~str~
        +row_number(fnm: str, binary: bytes) int
    }

    RAGFlowExcelParser ..> pandas.DataFrame : uses
    RAGFlowExcelParser ..> openpyxl.Workbook : uses
    RAGFlowExcelParser ..> rag.nlp.find_codec : uses

Summary

excel_parser.py defines a robust and flexible parser class for Excel and CSV files, capable of handling multiple formats, cleaning data, and producing various output representations (HTML, Markdown, text). It integrates multiple libraries and fallback mechanisms, ensuring high compatibility with diverse spreadsheet inputs. This file serves as a core utility in a system that requires ingestion and display of tabular data from user uploads or external sources.