markdown_parser.py

Overview

markdown_parser.py provides utilities for parsing and processing Markdown content with a particular focus on handling tables and extracting distinct Markdown elements. The file contains two main classes:

This functionality is essential in applications that require detailed analysis or manipulation of Markdown documents, such as document ingestion pipelines, content extraction for natural language processing, or rendering workflows where table handling is critical.


Classes and Methods

Class: RAGFlowMarkdownParser

Handles extraction and optional separation of tables from Markdown text.

Initialization

def __init__(self, chunk_token_num=128):

Method: extract_tables_and_remainder

def extract_tables_and_remainder(self, markdown_text, separate_tables=True)
parser = RAGFlowMarkdownParser()
clean_text, extracted_tables = parser.extract_tables_and_remainder(markdown_text, separate_tables=True)

Class: MarkdownElementExtractor

Parses Markdown content into discrete elements such as headers, code blocks, lists, blockquotes, and text blocks using line-by-line analysis.

Initialization

def __init__(self, markdown_content)

Method: extract_elements

def extract_elements(self)
extractor = MarkdownElementExtractor(markdown_content)
elements = extractor.extract_elements()
for element in elements:
    print(element)

Private Extraction Methods

All return a dictionary with keys: "type", "content", "start_line", "end_line".


Implementation Highlights


Interaction with Other System Components


Visual Diagram

classDiagram
    class RAGFlowMarkdownParser {
        -chunk_token_num: int
        +__init__(chunk_token_num=128)
        +extract_tables_and_remainder(markdown_text, separate_tables=True) str, List[str]
    }

    class MarkdownElementExtractor {
        -markdown_content: str
        -lines: List[str]
        -ast_parser
        -ast_nodes
        +__init__(markdown_content)
        +extract_elements() List[str]
        -_extract_header(start_pos) dict
        -_extract_code_block(start_pos) dict
        -_extract_list_block(start_pos) dict
        -_extract_blockquote(start_pos) dict
        -_extract_text_block(start_pos) dict
    }

Summary

markdown_parser.py provides robust tools for parsing Markdown documents with advanced handling of tables and fine-grained extraction of Markdown block elements. It uses regex-based detection combined with optional rendering and line-wise parsing strategies. This makes it suitable for workflows requiring clean separation of tables and detailed content segmentation, supporting various downstream applications like content analysis, rendering, or indexing within the InfiniFlow system.