excel_parser.py
Overview
excel_parser.py provides a utility class RAGFlowExcelParser designed for parsing and processing Excel and CSV files. Its primary purpose is to load spreadsheet data from various file formats, clean and convert it into consistent in-memory Excel workbooks, and offer multiple output formats including HTML tables, Markdown, and plain text representations.
The parser supports both .xlsx and legacy .xls Excel files, as well as CSV files, automatically detecting the file type. It leverages multiple libraries to maximize compatibility and robustness: openpyxl for Excel loading, pandas for flexible reading and conversion, and fallback engines like calamine for edge cases. It also includes cleaning of illegal characters that might cause issues during processing.
This file is useful for applications that need to ingest spreadsheet data in diverse formats and produce consistent, clean, and display-friendly outputs without depending on the original file format.
Classes and Methods
Class: RAGFlowExcelParser
A parser class to load, clean, and transform Excel and CSV files into various output formats.
Static Methods
_load_excel_to_workbook(file_like_object) -> openpyxl.Workbook
Purpose:
Load spreadsheet data from a file-like object or bytes, returning an openpyxl.Workbook instance.
Parameters:
file_like_object(bytesor file-like object): Raw input data or file stream containing Excel or CSV content.
Returns:
openpyxl.Workbook: A workbook object containing the parsed spreadsheet data.
Details:
Determines file type by reading the first 4 bytes (magic numbers for XLSX or XLS files).
If not Excel, tries to parse as CSV using
pandas.read_csv.If Excel, tries openpyxl.load_workbook with
data_only=True.On failure, tries
pandas.read_excelwith fallback to thecalamineengine.Converts
pandas.DataFrameobjects back intoopenpyxl.Workbookvia_dataframe_to_workbook.Raises exceptions on total failure.
Usage Example:
with open("data.xlsx", "rb") as f:
wb = RAGFlowExcelParser._load_excel_to_workbook(f)
_clean_dataframe(df: pd.DataFrame) -> pd.DataFrame
Purpose:
Remove illegal XML characters from all string cells in a DataFrame.
Parameters:
df(pandas.DataFrame): Input data frame to clean.
Returns:
pandas.DataFrame: Cleaned data frame with illegal characters replaced by spaces.
Implementation Details:
Uses a regex
ILLEGAL_CHARACTERS_REto match control characters disallowed in Excel XML.Applies a cleaning function to each cell in the DataFrame.
_dataframe_to_workbook(df: pd.DataFrame) -> openpyxl.Workbook
Purpose:
Convert a cleaned pandas DataFrame into an openpyxl.Workbook.
Parameters:
df(pandas.DataFrame): Cleaned DataFrame to convert.
Returns:
openpyxl.Workbook: Workbook containing the data in a sheet named "Data".
Implementation Details:
Writes column headers in the first row.
Writes values row-wise starting from the second row.
Instance Methods
html(self, fnm, chunk_rows=256) -> list[str]
Purpose:
Render the spreadsheet data as a list of HTML table chunks, suitable for display in web pages.
Parameters:
fnm(bytes,str, or file-like): The file content or path.chunk_rows(int, optional): Number of rows per HTML table chunk (default 256).
Returns:
list[str]: List of HTML strings, each representing a table chunk for a worksheet.
Implementation Details:
Uses
_load_excel_to_workbookto get workbook.Escapes HTML special characters in cell values.
Splits rows into chunks to avoid very large tables.
Each table includes a caption with the sheet name.
Empty cells rendered as empty
<td>cells.
Usage Example:
parser = RAGFlowExcelParser()
html_tables = parser.html(excel_bytes)
for table_html in html_tables:
print(table_html)
markdown(self, fnm) -> str
Purpose:
Convert the spreadsheet or CSV file content into a Markdown-formatted table.
Parameters:
fnm(bytes,str, or file-like): The file content or path.
Returns:
str: A Markdown string representation of the first sheet or CSV.
Implementation Details:
Tries
pandas.read_excel, falls back topandas.read_csvon failure.Replaces empty or whitespace-only cells with empty strings.
Uses
pandas.DataFrame.to_markdownfor output.
__call__(self, fnm) -> list[str]
Purpose:
Parse the spreadsheet and return a list of semicolon-separated field strings representing rows.
Parameters:
fnm(bytes,str, or file-like): File content or file path.
Returns:
list[str]: List of strings in the format"Header: Value; Header2: Value2 ——SheetName".
Implementation Details:
Loads workbook with
_load_excel_to_workbook.Iterates sheets and rows, skipping empty cells.
Combines header and cell value with a colon separator, joins fields by semicolons.
Adds sheet name as suffix if sheet name does not contain "sheet".
Usage Example:
parser = RAGFlowExcelParser()
lines = parser(excel_bytes)
for line in lines:
print(line)
row_number(fnm: str, binary: bytes) -> int
Purpose:
Calculate the total number of rows in the given spreadsheet or CSV file.
Parameters:
fnm(str): Filename or file identifier to infer file type.binary(bytes): Raw file content.
Returns:
int: Total number of rows across all sheets or lines in the CSV.
Implementation Details:
For Excel files (
.xls,.xlsx), loads workbook and sums rows from all sheets.For CSV or TXT files, detects encoding using
find_codecand counts newline characters.
Important Implementation Details
File Type Detection: The parser reads the first 4 bytes of the input to detect whether it is an Excel file by matching magic bytes for
.xlsx(PK\x03\x04) or.xls(\xd0\xcf\x11\xe0). If neither matches, it treats the input as CSV.Multiple Loading Strategies: To maximize compatibility, it uses
openpyxlfirst, thenpandas.read_excelwith the default engine, and finally with thecalamineengine as fallback.Cleaning Illegal Characters: Excel's XML format disallows certain control characters. The parser cleans these using a regex before converting data into workbooks.
Chunked HTML Output: For large sheets, the HTML output divides rows into manageable chunks to avoid overwhelming browsers or clients.
Flexible Input: Methods accept raw bytes or file-like objects, allowing integration with in-memory data or files.
Integration with
rag.nlp.find_codec: Therow_numbermethod uses an external function to detect text encoding for CSV files, indicating integration with a larger NLP or toolkit ecosystem.
Interaction with Other System Components
Imports
find_codecfromrag.nlp, suggesting this parser is part of a larger system that includes natural language processing or data ingestion utilities.Uses standard libraries (
openpyxl,pandas,logging) for file handling and data manipulation.The class is designed to be instantiated and called directly or used for generating HTML and Markdown reports, making it suitable as a backend utility in web applications, CLI tools, or data pipelines.
Visual Diagram
classDiagram
class RAGFlowExcelParser {
<<static>>
+_load_excel_to_workbook(file_like_object) Workbook
+_clean_dataframe(df: DataFrame) DataFrame
+_dataframe_to_workbook(df: DataFrame) Workbook
--
+html(fnm, chunk_rows=256) list~str~
+markdown(fnm) str
+__call__(fnm) list~str~
+row_number(fnm: str, binary: bytes) int
}
RAGFlowExcelParser ..> pandas.DataFrame : uses
RAGFlowExcelParser ..> openpyxl.Workbook : uses
RAGFlowExcelParser ..> rag.nlp.find_codec : uses
Summary
excel_parser.py defines a robust and flexible parser class for Excel and CSV files, capable of handling multiple formats, cleaning data, and producing various output representations (HTML, Markdown, text). It integrates multiple libraries and fallback mechanisms, ensuring high compatibility with diverse spreadsheet inputs. This file serves as a core utility in a system that requires ingestion and display of tabular data from user uploads or external sources.