html_parser.py

Overview

html_parser.py is a utility module designed to parse HTML content with the specific goal of extracting and chunking text and tables in a structured manner. It is part of the InfiniFlow project and facilitates the conversion of HTML documents into manageable text blocks optimized for token-based processing, presumably for downstream NLP tasks such as retrieval-augmented generation (RAG).

The core functionality includes:

Detecting file encoding for robust reading of HTML files.
Cleaning HTML content by removing style, script, inline styles, and comments.
Recursively extracting textual content and tables from HTML elements.
Merging and chunking extracted content into token-length bounded pieces.
Splitting large HTML tables into smaller chunks fitting token limits.

This modular parser is built around the main class RAGFlowHtmlParser which provides a callable interface and multiple class methods to handle parsing workflows.

Classes and Methods

`class RAGFlowHtmlParser`

This is the principal class that encapsulates all parsing logic for HTML content. It can be used as a callable or by invoking its class methods individually.

`call(fnm, binary=None, chunk_token_num=None) -> list[str]`

Parses an HTML file or binary HTML content and returns a list of text chunks and table content strings, each chunk approximately bounded by chunk_token_num tokens.

Parameters:
- fnm (str): File path to the HTML file. Required if binary is not provided.
- binary (bytes, optional): Raw binary HTML data. If provided, file reading is skipped.
- chunk_token_num (int, optional): Maximum number of tokens per chunk. Defaults to None (no chunking limit).
Returns:
List[str] — List of text chunks and table contents extracted from the HTML.

Usage Example:

parser = RAGFlowHtmlParser()
chunks = parser("example.html", chunk_token_num=512)

Implementation Notes:
If binary is provided, the encoding is detected via find_codec and decoded accordingly. Otherwise, the file is opened and read with auto-detected encoding via get_encoding.

`@classmethod parser_txt(cls, txt, chunk_token_num) -> list[str]`

Parses raw HTML text string and extracts structured content chunks.

Parameters:
- txt (str): HTML content as a string.
- chunk_token_num (int): Token limit for chunking.
Returns:
List[str] — List of text chunks and tables extracted from the HTML.
Raises:
TypeError if txt is not a string.
Description:
This method cleans unwanted tags and attributes (style, script, inline styles, comments), then calls read_text_recursively to extract content recursively. It merges block-level text with merge_block_text and finally chunks the content with chunk_block. Tables are appended as-is.

`@classmethod split_table(cls, html_table, chunk_token_num=512) -> list[str]`

Splits a large HTML <table> element into smaller tables, each fitting within the token limit.

Parameters:
- html_table (str): HTML string representing a <table>.
- chunk_token_num (int): Maximum tokens per chunk (default 512).
Returns:
List[str] — List of HTML strings representing smaller table chunks.

Usage Example:

small_tables = RAGFlowHtmlParser.split_table(large_table_html, 512)

Implementation Details:
Iterates over <tr> rows, accumulating rows until token limit is reached, then starts a new chunk table.

`@classmethod read_text_recursively(cls, element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) -> list[dict]`

Recursively traverses HTML elements, extracting text and tables into structured dictionaries.

Parameters:
- element (bs4.element.Tag or NavigableString): Current HTML element or text node.
- parser_result (list): Accumulates resulting parsed content dicts.
- chunk_token_num (int): Token limit (used in some recursive calls, default 512).
- parent_name (str, optional): Tag name of parent element.
- block_id (str, optional): UUID string identifying a block of text.
Returns:
List[dict] — List of content dictionaries with keys:
- "content" (str): Extracted text or HTML snippet.
- "tag_name" (str): Tag name or "inner_text".
- "metadata" (dict): Additional info, e.g. block or table IDs.
Description:
- For text nodes (NavigableString), strips and returns content if valid.
- For <table> tags, returns the entire table as a single content dict.
- For block-level tags (from BLOCK_TAGS), assigns a unique block_id to group text.
- Recurses into children elements, aggregating content dicts.

`@classmethod merge_block_text(cls, parser_result) -> tuple[list[str], list[dict]]`

Merges text fragments belonging to the same block into single strings and separates tables.

Parameters:
- parser_result (list): List of content dicts generated by read_text_recursively.
Returns:
Tuple of two elements:
- List[str]: Merged block text strings.
- List[dict]: List of table content dicts.
Implementation Details:
Groups content by block_id. Titles (h1-h6) are prefixed with markdown-style heading markers. Tables are extracted separately.

`@classmethod chunk_block(cls, block_txt_list, chunk_token_num=512) -> list[str]`

Chunks large text blocks into smaller pieces constrained by token count.

Parameters:
- block_txt_list (List[str]): List of merged text blocks.
- chunk_token_num (int): Token limit per chunk.
Returns:
List[str] — List of chunked text strings.
Description:
For blocks exceeding token limits, splits tokens evenly into multiple chunks. Smaller blocks are concatenated until the token limit is reached, then appended as a chunk.

Utility Function

`get_encoding(file) -> str`

Detects and returns the character encoding of a file by reading its bytes.

Parameters:
- file (str): Path to the file.
Returns:
str — Detected encoding name (e.g., "utf-8").

Usage Example:

encoding = get_encoding("example.html")

Implementation Details:
Uses chardet library to detect encoding from raw bytes.

Constants

BLOCK_TAGS: List of HTML tags considered as block-level elements for grouping text.
TITLE_TAGS: Mapping of header tags (h1-h6) to markdown-style heading prefixes.

Implementation Details and Algorithms

Encoding detection:
The file encoding is detected dynamically to correctly read diverse HTML files.
HTML cleaning:
Removes non-content elements such as <style>, <script>, inline styles, and comments to reduce noise.
Recursive parsing:
Uses BeautifulSoup to traverse the DOM tree. Text nodes are collected and grouped by parent block tags. Tables are handled distinctly as entire HTML chunks.
Block grouping:
Text fragments within the same block tag share a UUID block_id. This allows merging related texts while preserving structural context.
Token-based chunking:
Content is tokenized using rag_tokenizer. Large blocks or tables are split into chunks not exceeding the token limit to optimize processing in downstream NLP pipelines.

Integration and Interaction with Other Components

Dependencies:
- rag.nlp.find_codec & rag.nlp.rag_tokenizer: External NLP utilities for encoding detection and tokenization.
- beautifulsoup4: For HTML parsing and traversal.
- chardet: For encoding detection on raw files.
- html: For unescaping HTML entities.
Role in System:
This parser module provides a clean, token-chunked textual representation of HTML documents which can be used for information retrieval, embedding generation, or language model inputs within the InfiniFlow RAG pipeline.
Data Flow:
1. Input HTML file or binary → Encoding detection → Read HTML text.
2. Clean HTML → Recursively extract text and tables.
3. Merge and chunk content based on token limits.
4. Output list of textual chunks and table HTML strings for downstream processing.

Usage Example

from html_parser import RAGFlowHtmlParser

parser = RAGFlowHtmlParser()
chunks = parser("document.html", chunk_token_num=512)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*20}")

Mermaid Class Diagram

classDiagram
    class RAGFlowHtmlParser {
        +__call__(fnm, binary=None, chunk_token_num=None) list~str~
        +parser_txt(txt, chunk_token_num) list~str~
        +split_table(html_table, chunk_token_num=512) list~str~
        +read_text_recursively(element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) list~dict~
        +merge_block_text(parser_result) tuple~list~str~, list~dict~~
        +chunk_block(block_txt_list, chunk_token_num=512) list~str~
    }

Summary

html_parser.py is a specialized HTML parsing utility focused on extracting and chunking both text and tables from HTML documents into token-bounded segments, suitable for NLP workflows. It handles encoding detection, content cleaning, recursive extraction, and token-aware chunking, making it a crucial preprocessing step in the InfiniFlow RAG pipeline.

html_parser.py

Overview

Classes and Methods

class RAGFlowHtmlParser

__call__(fnm, binary=None, chunk_token_num=None) -> list[str]

@classmethod parser_txt(cls, txt, chunk_token_num) -> list[str]

@classmethod split_table(cls, html_table, chunk_token_num=512) -> list[str]

@classmethod read_text_recursively(cls, element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) -> list[dict]

@classmethod merge_block_text(cls, parser_result) -> tuple[list[str], list[dict]]

@classmethod chunk_block(cls, block_txt_list, chunk_token_num=512) -> list[str]

Utility Function

get_encoding(file) -> str

Constants

Implementation Details and Algorithms

Integration and Interaction with Other Components

Usage Example

Mermaid Class Diagram

Summary

`class RAGFlowHtmlParser`

`call(fnm, binary=None, chunk_token_num=None) -> list[str]`

`@classmethod parser_txt(cls, txt, chunk_token_num) -> list[str]`

`@classmethod split_table(cls, html_table, chunk_token_num=512) -> list[str]`

`@classmethod read_text_recursively(cls, element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) -> list[dict]`

`@classmethod merge_block_text(cls, parser_result) -> tuple[list[str], list[dict]]`

`@classmethod chunk_block(cls, block_txt_list, chunk_token_num=512) -> list[str]`

`get_encoding(file) -> str`