document_parsing.py


Overview

The `document_parsing.py` module provides functionality to parse markdown files into structured textual units for further processing or analysis. It is designed to:

This module is useful in applications involving document analysis, natural language processing, or any workflow that requires structured breakdown of markdown documentation.


Detailed Description of Classes, Functions, and Methods

1. parse_markdown_units(md_text: str) -> List[Dict]

**Purpose:** Parses a given markdown text string into a list of content units. Each unit is a dictionary representing either a section with a header or smaller content blocks such as paragraphs or code blocks.

**Parameters:**

**Returns:**

**Usage:**

markdown_text = """
# Introduction

This is an introductory paragraph.

```python
print("Hello, world!")

""" units = parse_markdown_units(markdown_text) for unit in units: print(unit["type"], unit["title"])


**Implementation Details:**  
- Uses regex to temporarily replace fenced code blocks (triple backticks) with placeholders to protect code from being split during header-based partitioning.
- Splits text by markdown headers (`#`, `##`, ...).
- Restores code blocks from placeholders after splitting.
- Non-header leading text is further split into paragraphs or code units.

---

### 2. `_restore_codeblocks(text: str, code_blocks: List[str]) -> str`

**Purpose:**  
Reinserts original code blocks back into text by replacing placeholders with actual code content.

**Parameters:**  
- `text` (str): Text containing code block placeholders.
- `code_blocks` (List[str]): List of extracted code block strings.

**Returns:**  
- `str`: Text with placeholders replaced by actual code blocks.

**Usage:**  
This function is internal and used by the parser functions to restore code snippets during processing.

---

### 3. `_split_md_text_to_units(md_segment: str, code_blocks: List[str]) -> List[Dict]`

**Purpose:**  
Splits a markdown text segment that contains no headers into smaller units: paragraphs or code blocks.

**Parameters:**  
- `md_segment` (str): Markdown text segment without headers.
- `code_blocks` (List[str]): List of code blocks extracted earlier.

**Returns:**  
- `List[Dict]`: List of dictionaries representing paragraphs or code blocks with keys similar to `parse_markdown_units`.

**Usage:**  
Called internally by `parse_markdown_units` to break down text sections that do not begin with headers.

**Implementation Details:**  
- Restores code blocks.
- Splits on double newlines to separate paragraphs.
- Determines if a block is a code block by checking for fenced code markers.

---

### 4. `chunk_markdown_files(md_paths: List[str]) -> List[str]`

**Purpose:**  
Reads multiple markdown files and returns a flattened list of textual document units extracted from all files.

**Parameters:**  
- `md_paths` (List[str]): List of paths to markdown files to process.

**Returns:**  
- `List[str]`: List of strings, each representing a document unit (section or paragraph/code text).

**Usage:**  
```python
file_paths = ["docs/introduction.md", "docs/usage.md"]
units = chunk_markdown_files(file_paths)
for unit in units:
    print(unit)

**Implementation Details:**


Important Implementation Details and Algorithms


Interaction with Other Parts of the Application


Visual Diagram

classDiagram
    class document_parsing {
        +parse_markdown_units(md_text: str) List~Dict~
        +chunk_markdown_files(md_paths: List~str~) List~str~
    }
    document_parsing ..> file_utils : uses
    document_parsing : -_restore_codeblocks(text: str, code_blocks: List~str~) str
    document_parsing : -_split_md_text_to_units(md_segment: str, code_blocks: List~str~) List~Dict~

Summary

The `document_parsing.py` module is a utility for converting markdown documents into manageable, semantically meaningful units by leveraging regex-based parsing and careful handling of markdown syntax elements, especially code blocks and headers. It serves as a foundational step in workflows that analyze or repurpose markdown content programmatically.