document_parsing.py
Overview
The `document_parsing.py` module provides functionality to parse markdown files into structured textual units for further processing or analysis. It is designed to:
Identify and separate sections based on markdown headers.
Extract paragraphs and code blocks within markdown content.
Handle multiline code blocks by temporarily replacing them with placeholders during parsing.
Aggregate parsed units from multiple markdown files into a flat list of textual document units.
This module is useful in applications involving document analysis, natural language processing, or any workflow that requires structured breakdown of markdown documentation.
Detailed Description of Classes, Functions, and Methods
1. parse_markdown_units(md_text: str) -> List[Dict]
**Purpose:** Parses a given markdown text string into a list of content units. Each unit is a dictionary representing either a section with a header or smaller content blocks such as paragraphs or code blocks.
**Parameters:**
md_text(str): Raw markdown text to parse.
**Returns:**
List[Dict]: A list of dictionaries. Each dictionary represents a document unit with keys including:"title": The header text for sections or an empty string for paragraphs/code."type": The type of unit ("section","paragraph", or"code")."text": The content text of the unit.
**Usage:**
markdown_text = """
# Introduction
This is an introductory paragraph.
```python
print("Hello, world!")
""" units = parse_markdown_units(markdown_text) for unit in units: print(unit["type"], unit["title"])
**Implementation Details:**
- Uses regex to temporarily replace fenced code blocks (triple backticks) with placeholders to protect code from being split during header-based partitioning.
- Splits text by markdown headers (`#`, `##`, ...).
- Restores code blocks from placeholders after splitting.
- Non-header leading text is further split into paragraphs or code units.
---
### 2. `_restore_codeblocks(text: str, code_blocks: List[str]) -> str`
**Purpose:**
Reinserts original code blocks back into text by replacing placeholders with actual code content.
**Parameters:**
- `text` (str): Text containing code block placeholders.
- `code_blocks` (List[str]): List of extracted code block strings.
**Returns:**
- `str`: Text with placeholders replaced by actual code blocks.
**Usage:**
This function is internal and used by the parser functions to restore code snippets during processing.
---
### 3. `_split_md_text_to_units(md_segment: str, code_blocks: List[str]) -> List[Dict]`
**Purpose:**
Splits a markdown text segment that contains no headers into smaller units: paragraphs or code blocks.
**Parameters:**
- `md_segment` (str): Markdown text segment without headers.
- `code_blocks` (List[str]): List of code blocks extracted earlier.
**Returns:**
- `List[Dict]`: List of dictionaries representing paragraphs or code blocks with keys similar to `parse_markdown_units`.
**Usage:**
Called internally by `parse_markdown_units` to break down text sections that do not begin with headers.
**Implementation Details:**
- Restores code blocks.
- Splits on double newlines to separate paragraphs.
- Determines if a block is a code block by checking for fenced code markers.
---
### 4. `chunk_markdown_files(md_paths: List[str]) -> List[str]`
**Purpose:**
Reads multiple markdown files and returns a flattened list of textual document units extracted from all files.
**Parameters:**
- `md_paths` (List[str]): List of paths to markdown files to process.
**Returns:**
- `List[str]`: List of strings, each representing a document unit (section or paragraph/code text).
**Usage:**
```python
file_paths = ["docs/introduction.md", "docs/usage.md"]
units = chunk_markdown_files(file_paths)
for unit in units:
print(unit)
**Implementation Details:**
Utilizes
read_text_filefromfile_utilsto read file contents.Parses each file with
parse_markdown_units.For sections, combines header and text with spacing.
For paragraphs or code blocks, appends their text directly if non-empty.
Important Implementation Details and Algorithms
Code block handling:
The parser protects fenced code blocks by replacing them with unique placeholders before splitting the markdown text. This prevents inadvertent splitting of code blocks during header detection or paragraph splitting. After splitting, these placeholders are replaced by the original code blocks.Header-based splitting:
Markdown headers (#to######) are used as delimiters to segment the document into sections. Each detected header forms a "section" unit with the header as the title and the following text as its content.Paragraph splitting:
Within non-header segments, text is split on blank lines (two or more consecutive newlines) to separate paragraphs.Use of regex:
Multiple regex patterns are used to identify code blocks, headers, and paragraph boundaries.
Interaction with Other Parts of the Application
Dependency on
file_utils.read_text_file:
The functionchunk_markdown_filesrelies on an external utility to read the contents of markdown files. This implies tight integration with the file I/O subsystem managed byfile_utils.Downstream usage:
The parsed units returned by this module can feed into document indexing, search, summarization, or natural language understanding components that require structured input from markdown documentation.
Visual Diagram
classDiagram
class document_parsing {
+parse_markdown_units(md_text: str) List~Dict~
+chunk_markdown_files(md_paths: List~str~) List~str~
}
document_parsing ..> file_utils : uses
document_parsing : -_restore_codeblocks(text: str, code_blocks: List~str~) str
document_parsing : -_split_md_text_to_units(md_segment: str, code_blocks: List~str~) List~Dict~
Summary
The `document_parsing.py` module is a utility for converting markdown documents into manageable, semantically meaningful units by leveraging regex-based parsing and careful handling of markdown syntax elements, especially code blocks and headers. It serves as a foundational step in workflows that analyze or repurpose markdown content programmatically.