code_parsing.py


Overview

The `code_parsing.py` module provides functionality to extract meaningful code units or semantic chunks from source code files. It leverages the **Tree-sitter** parsing library to accurately parse and extract syntactic units such as functions, classes, imports, exports, and variable declarations from supported programming languages.

When Tree-sitter support is not available for a given language or parsing fails, the module falls back to a heuristic approach based on regular expressions and blank line separation to segment the source code into logical chunks.

This file is primarily designed to support workflows that require granular code analysis, indexing, or transformation by supplying well-defined code units instead of entire source files.


Detailed Explanation of Components

Functions


detect_language_from_extension(path: str) -> Optional[str]

Detects the programming language of a file based on its file extension.


extract_code_units_from_file(path: str) -> List[str]

Extracts semantic code units (functions, classes, imports, exports, variable declarations) from the source code file at `path`.


heuristic_code_chunks(source: str) -> List[str]

Fallback method to split source code into chunks using regex heuristics and blank lines.


Important Implementation Details


Interaction with Other Parts of the System


Mermaid Diagram

The following class diagram illustrates the functional structure of `code_parsing.py`, focusing on its key functions and their relationships.

flowchart TD
    A[detect_language_from_extension(path)] --> B[extract_code_units_from_file(path)]
    B --> C[get_language(lang_name), get_parser(lang_name)]
    B --> D[walk(root_node)]
    B --> E[heuristic_code_chunks(source)]
    E --> F[Split by keywords & blank lines]
    E --> G[Split large chunks > 200 lines]

Summary

The `code_parsing.py` file is a utility module that abstracts the complexity of parsing source code files into meaningful semantic units. It smartly combines powerful Tree-sitter parsing when available with a pragmatic heuristic fallback, enabling reliable code chunk extraction across various languages and file types. This makes it a foundational component for any system needing structured code analysis or processing.