html_parser.py
Overview
html_parser.py is a utility module designed to parse HTML content with the specific goal of extracting and chunking text and tables in a structured manner. It is part of the InfiniFlow project and facilitates the conversion of HTML documents into manageable text blocks optimized for token-based processing, presumably for downstream NLP tasks such as retrieval-augmented generation (RAG).
The core functionality includes:
Detecting file encoding for robust reading of HTML files.
Cleaning HTML content by removing style, script, inline styles, and comments.
Recursively extracting textual content and tables from HTML elements.
Merging and chunking extracted content into token-length bounded pieces.
Splitting large HTML tables into smaller chunks fitting token limits.
This modular parser is built around the main class RAGFlowHtmlParser which provides a callable interface and multiple class methods to handle parsing workflows.
Classes and Methods
class RAGFlowHtmlParser
This is the principal class that encapsulates all parsing logic for HTML content. It can be used as a callable or by invoking its class methods individually.
__call__(fnm, binary=None, chunk_token_num=None) -> list[str]
Parses an HTML file or binary HTML content and returns a list of text chunks and table content strings, each chunk approximately bounded by chunk_token_num tokens.
Parameters:
fnm(str): File path to the HTML file. Required ifbinaryis not provided.binary(bytes, optional): Raw binary HTML data. If provided, file reading is skipped.chunk_token_num(int, optional): Maximum number of tokens per chunk. Defaults to None (no chunking limit).
Returns:
List[str]— List of text chunks and table contents extracted from the HTML.Usage Example:
parser = RAGFlowHtmlParser() chunks = parser("example.html", chunk_token_num=512)Implementation Notes:
Ifbinaryis provided, the encoding is detected viafind_codecand decoded accordingly. Otherwise, the file is opened and read with auto-detected encoding viaget_encoding.
@classmethod parser_txt(cls, txt, chunk_token_num) -> list[str]
Parses raw HTML text string and extracts structured content chunks.
Parameters:
txt(str): HTML content as a string.chunk_token_num(int): Token limit for chunking.
Returns:
List[str]— List of text chunks and tables extracted from the HTML.Raises:
TypeErroriftxtis not a string.Description:
This method cleans unwanted tags and attributes (style,script, inline styles, comments), then callsread_text_recursivelyto extract content recursively. It merges block-level text withmerge_block_textand finally chunks the content withchunk_block. Tables are appended as-is.
@classmethod split_table(cls, html_table, chunk_token_num=512) -> list[str]
Splits a large HTML <table> element into smaller tables, each fitting within the token limit.
Parameters:
html_table(str): HTML string representing a<table>.chunk_token_num(int): Maximum tokens per chunk (default 512).
Returns:
List[str]— List of HTML strings representing smaller table chunks.Usage Example:
small_tables = RAGFlowHtmlParser.split_table(large_table_html, 512)Implementation Details:
Iterates over<tr>rows, accumulating rows until token limit is reached, then starts a new chunk table.
@classmethod read_text_recursively(cls, element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) -> list[dict]
Recursively traverses HTML elements, extracting text and tables into structured dictionaries.
Parameters:
element(bs4.element.Tag or NavigableString): Current HTML element or text node.parser_result(list): Accumulates resulting parsed content dicts.chunk_token_num(int): Token limit (used in some recursive calls, default 512).parent_name(str, optional): Tag name of parent element.block_id(str, optional): UUID string identifying a block of text.
Returns:
List[dict]— List of content dictionaries with keys:"content"(str): Extracted text or HTML snippet."tag_name"(str): Tag name or"inner_text"."metadata"(dict): Additional info, e.g. block or table IDs.
Description:
For text nodes (
NavigableString), strips and returns content if valid.For
<table>tags, returns the entire table as a single content dict.For block-level tags (from
BLOCK_TAGS), assigns a uniqueblock_idto group text.Recurses into children elements, aggregating content dicts.
@classmethod merge_block_text(cls, parser_result) -> tuple[list[str], list[dict]]
Merges text fragments belonging to the same block into single strings and separates tables.
Parameters:
parser_result(list): List of content dicts generated byread_text_recursively.
Returns:
Tuple of two elements:List[str]: Merged block text strings.List[dict]: List of table content dicts.
Implementation Details:
Groups content byblock_id. Titles (h1-h6) are prefixed with markdown-style heading markers. Tables are extracted separately.
@classmethod chunk_block(cls, block_txt_list, chunk_token_num=512) -> list[str]
Chunks large text blocks into smaller pieces constrained by token count.
Parameters:
block_txt_list(List[str]): List of merged text blocks.chunk_token_num(int): Token limit per chunk.
Returns:
List[str]— List of chunked text strings.Description:
For blocks exceeding token limits, splits tokens evenly into multiple chunks. Smaller blocks are concatenated until the token limit is reached, then appended as a chunk.
Utility Function
get_encoding(file) -> str
Detects and returns the character encoding of a file by reading its bytes.
Parameters:
file(str): Path to the file.
Returns:
str— Detected encoding name (e.g.,"utf-8").Usage Example:
encoding = get_encoding("example.html")Implementation Details:
Useschardetlibrary to detect encoding from raw bytes.
Constants
BLOCK_TAGS: List of HTML tags considered as block-level elements for grouping text.TITLE_TAGS: Mapping of header tags (h1-h6) to markdown-style heading prefixes.
Implementation Details and Algorithms
Encoding detection:
The file encoding is detected dynamically to correctly read diverse HTML files.HTML cleaning:
Removes non-content elements such as<style>,<script>, inline styles, and comments to reduce noise.Recursive parsing:
Uses BeautifulSoup to traverse the DOM tree. Text nodes are collected and grouped by parent block tags. Tables are handled distinctly as entire HTML chunks.Block grouping:
Text fragments within the same block tag share a UUIDblock_id. This allows merging related texts while preserving structural context.Token-based chunking:
Content is tokenized usingrag_tokenizer. Large blocks or tables are split into chunks not exceeding the token limit to optimize processing in downstream NLP pipelines.
Integration and Interaction with Other Components
Dependencies:
rag.nlp.find_codec&rag.nlp.rag_tokenizer: External NLP utilities for encoding detection and tokenization.beautifulsoup4: For HTML parsing and traversal.chardet: For encoding detection on raw files.html: For unescaping HTML entities.
Role in System:
This parser module provides a clean, token-chunked textual representation of HTML documents which can be used for information retrieval, embedding generation, or language model inputs within the InfiniFlow RAG pipeline.Data Flow:
Input HTML file or binary → Encoding detection → Read HTML text.
Clean HTML → Recursively extract text and tables.
Merge and chunk content based on token limits.
Output list of textual chunks and table HTML strings for downstream processing.
Usage Example
from html_parser import RAGFlowHtmlParser
parser = RAGFlowHtmlParser()
chunks = parser("document.html", chunk_token_num=512)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n{'-'*20}")
Mermaid Class Diagram
classDiagram
class RAGFlowHtmlParser {
+__call__(fnm, binary=None, chunk_token_num=None) list~str~
+parser_txt(txt, chunk_token_num) list~str~
+split_table(html_table, chunk_token_num=512) list~str~
+read_text_recursively(element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) list~dict~
+merge_block_text(parser_result) tuple~list~str~, list~dict~~
+chunk_block(block_txt_list, chunk_token_num=512) list~str~
}
Summary
html_parser.py is a specialized HTML parsing utility focused on extracting and chunking both text and tables from HTML documents into token-bounded segments, suitable for NLP workflows. It handles encoding detection, content cleaning, recursive extraction, and token-aware chunking, making it a crucial preprocessing step in the InfiniFlow RAG pipeline.