html_parser.py


Overview

html_parser.py is a utility module designed to parse HTML content with the specific goal of extracting and chunking text and tables in a structured manner. It is part of the InfiniFlow project and facilitates the conversion of HTML documents into manageable text blocks optimized for token-based processing, presumably for downstream NLP tasks such as retrieval-augmented generation (RAG).

The core functionality includes:

This modular parser is built around the main class RAGFlowHtmlParser which provides a callable interface and multiple class methods to handle parsing workflows.


Classes and Methods

class RAGFlowHtmlParser

This is the principal class that encapsulates all parsing logic for HTML content. It can be used as a callable or by invoking its class methods individually.


__call__(fnm, binary=None, chunk_token_num=None) -> list[str]

Parses an HTML file or binary HTML content and returns a list of text chunks and table content strings, each chunk approximately bounded by chunk_token_num tokens.


@classmethod parser_txt(cls, txt, chunk_token_num) -> list[str]

Parses raw HTML text string and extracts structured content chunks.


@classmethod split_table(cls, html_table, chunk_token_num=512) -> list[str]

Splits a large HTML <table> element into smaller tables, each fitting within the token limit.


@classmethod read_text_recursively(cls, element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) -> list[dict]

Recursively traverses HTML elements, extracting text and tables into structured dictionaries.


@classmethod merge_block_text(cls, parser_result) -> tuple[list[str], list[dict]]

Merges text fragments belonging to the same block into single strings and separates tables.


@classmethod chunk_block(cls, block_txt_list, chunk_token_num=512) -> list[str]

Chunks large text blocks into smaller pieces constrained by token count.


Utility Function

get_encoding(file) -> str

Detects and returns the character encoding of a file by reading its bytes.


Constants


Implementation Details and Algorithms


Integration and Interaction with Other Components


Usage Example

from html_parser import RAGFlowHtmlParser

parser = RAGFlowHtmlParser()
chunks = parser("document.html", chunk_token_num=512)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*20}")

Mermaid Class Diagram

classDiagram
    class RAGFlowHtmlParser {
        +__call__(fnm, binary=None, chunk_token_num=None) list~str~
        +parser_txt(txt, chunk_token_num) list~str~
        +split_table(html_table, chunk_token_num=512) list~str~
        +read_text_recursively(element, parser_result, chunk_token_num=512, parent_name=None, block_id=None) list~dict~
        +merge_block_text(parser_result) tuple~list~str~, list~dict~~
        +chunk_block(block_txt_list, chunk_token_num=512) list~str~
    }

Summary

html_parser.py is a specialized HTML parsing utility focused on extracting and chunking both text and tables from HTML documents into token-bounded segments, suitable for NLP workflows. It handles encoding detection, content cleaning, recursive extraction, and token-aware chunking, making it a crucial preprocessing step in the InfiniFlow RAG pipeline.