markdown_parser.py

Overview

markdown_parser.py provides utilities for parsing and processing Markdown content with a particular focus on handling tables and extracting distinct Markdown elements. The file contains two main classes:

RAGFlowMarkdownParser: Parses Markdown text to identify and extract tables, optionally separating them from the rest of the content or replacing them with rendered HTML.
MarkdownElementExtractor: Breaks down Markdown content into discrete elements such as headers, code blocks, lists, blockquotes, and text blocks by scanning the document line by line.

This functionality is essential in applications that require detailed analysis or manipulation of Markdown documents, such as document ingestion pipelines, content extraction for natural language processing, or rendering workflows where table handling is critical.

Classes and Methods

Class: `RAGFlowMarkdownParser`

Handles extraction and optional separation of tables from Markdown text.

Initialization

def __init__(self, chunk_token_num=128):

Parameters:
- chunk_token_num (int, optional): Number of tokens to chunk by (default 128). Although set during initialization, this parameter is not actively used in the current methods but may be intended for future chunking features.

Method: `extract_tables_and_remainder`

def extract_tables_and_remainder(self, markdown_text, separate_tables=True)

Purpose: Extracts Markdown and HTML tables from input Markdown text. It returns the text with tables optionally removed or replaced by rendered HTML, and a list of the extracted raw tables.
Parameters:
- markdown_text (str): The raw Markdown content to process.
- separate_tables (bool, default=True): If True, tables are removed from the main text and collected separately. If False, tables remain in the text but can be replaced by their rendered HTML version.
Returns:
- Tuple (working_text, tables):
  - working_text (str): Markdown text with tables removed or replaced.
  - tables (List[str]): List of raw table strings extracted from the input.
Usage Example:

parser = RAGFlowMarkdownParser()
clean_text, extracted_tables = parser.extract_tables_and_remainder(markdown_text, separate_tables=True)

Implementation Details:
- Uses regular expressions to detect:
  - Standard Markdown tables with borders.
  - Borderless Markdown tables.
  - HTML tables, including those nested inside <html> or <body> tags.
- Employs the markdown library to optionally render Markdown tables to HTML.
- The method optimizes for performance by checking for table indicators (| for Markdown tables, <table> for HTML tables) before applying regex.
- When separate_tables is True, tables are removed from the text and collected in a list. When False, tables are replaced inline with rendered HTML or left as-is.

Class: `MarkdownElementExtractor`

Parses Markdown content into discrete elements such as headers, code blocks, lists, blockquotes, and text blocks using line-by-line analysis.

Initialization

def __init__(self, markdown_content)

Parameters:
- markdown_content (str): Raw Markdown text to parse.
Attributes:
- self.lines (List[str]): Markdown content split into lines.
- self.ast_parser: A mistune Markdown parser instance configured to generate an Abstract Syntax Tree (AST). (Note: AST is parsed but currently not directly used in element extraction.)

Method: `extract_elements`

def extract_elements(self)

Purpose: Extracts and returns a list of Markdown elements, each representing a block such as a header, code block, list, blockquote, or text paragraph.
Returns:
- List[str]: List of Markdown content blocks as strings representing individual elements.
Usage Example:

extractor = MarkdownElementExtractor(markdown_content)
elements = extractor.extract_elements()
for element in elements:
    print(element)

Implementation Details:
- Processes the Markdown line-by-line with a loop.
- Uses regex and line characteristics to identify element types:
  - Headers: Lines starting with 1-6 # characters.
  - Code blocks: Lines starting with triple backticks ```.
  - Lists: Ordered and unordered list markers.
  - Blockquotes: Lines starting with >.
  - Text blocks: Paragraphs or inline text until next block element.
- For each element type, delegates to a private extraction method that determines the start and end lines of that element and returns the content.
- Skips empty lines between elements.
- Filters out any empty extracted elements before returning.

Private Extraction Methods

All return a dictionary with keys: "type", "content", "start_line", "end_line".

_extract_header(start_pos): Extracts a single header line.
_extract_code_block(start_pos): Extracts a fenced code block starting at start_pos.
_extract_list_block(start_pos): Extracts a contiguous block of list items.
_extract_blockquote(start_pos): Extracts a blockquote, including contiguous lines starting with > or blank lines within the block.
_extract_text_block(start_pos): Extracts a paragraph or text block until the next block element or end of text.

Implementation Highlights

Regex Usage: Extensive use of verbose and multiline regex patterns to identify Markdown tables and block elements robustly.
Performance Optimizations: Early checks for presence of table-specific characters before applying regex to reduce unnecessary computation.
Separation of Concerns: Clear division between table extraction and element extraction in separate classes.
Use of External Libraries:
- mistune for Markdown AST parsing (though not fully leveraged).
- markdown package to render Markdown tables to HTML when requested.
Line-based Parsing: MarkdownElementExtractor relies on line-based parsing rather than AST traversal, which offers simpler, deterministic extraction logic.

Interaction with Other System Components

This module is likely a core component in an ingestion or preprocessing pipeline in the InfiniFlow project:
- Downstream components may use the separated tables for structured data processing.
- Extracted elements may feed into NLP pipelines, summarization, or indexing components.
- Rendered HTML tables can be used for visualization or exporting Markdown content.
Could be used by user interfaces that need to display Markdown content with enhanced table handling or element-wise manipulation.
The chunk_token_num parameter in RAGFlowMarkdownParser implies potential integration with token-based chunking or embedding generation modules.

Visual Diagram

classDiagram
    class RAGFlowMarkdownParser {
        -chunk_token_num: int
        +__init__(chunk_token_num=128)
        +extract_tables_and_remainder(markdown_text, separate_tables=True) str, List[str]
    }

    class MarkdownElementExtractor {
        -markdown_content: str
        -lines: List[str]
        -ast_parser
        -ast_nodes
        +__init__(markdown_content)
        +extract_elements() List[str]
        -_extract_header(start_pos) dict
        -_extract_code_block(start_pos) dict
        -_extract_list_block(start_pos) dict
        -_extract_blockquote(start_pos) dict
        -_extract_text_block(start_pos) dict
    }

Summary

markdown_parser.py provides robust tools for parsing Markdown documents with advanced handling of tables and fine-grained extraction of Markdown block elements. It uses regex-based detection combined with optional rendering and line-wise parsing strategies. This makes it suitable for workflows requiring clean separation of tables and detailed content segmentation, supporting various downstream applications like content analysis, rendering, or indexing within the InfiniFlow system.

markdown_parser.py

Overview

Classes and Methods

Class: RAGFlowMarkdownParser

Initialization

Method: extract_tables_and_remainder

Class: MarkdownElementExtractor

Initialization

Method: extract_elements

Private Extraction Methods

Implementation Highlights

Interaction with Other System Components

Visual Diagram

Summary

Class: `RAGFlowMarkdownParser`

Method: `extract_tables_and_remainder`

Class: `MarkdownElementExtractor`

Method: `extract_elements`