code_parsing.py
Overview
The `code_parsing.py` module provides functionality to extract meaningful code units or semantic chunks from source code files. It leverages the **Tree-sitter** parsing library to accurately parse and extract syntactic units such as functions, classes, imports, exports, and variable declarations from supported programming languages.
When Tree-sitter support is not available for a given language or parsing fails, the module falls back to a heuristic approach based on regular expressions and blank line separation to segment the source code into logical chunks.
This file is primarily designed to support workflows that require granular code analysis, indexing, or transformation by supplying well-defined code units instead of entire source files.
Detailed Explanation of Components
Functions
detect_language_from_extension(path: str) -> Optional[str]
Detects the programming language of a file based on its file extension.
Parameters:
path(str): The file path or filename whose language needs to be detected.
Returns:
Optional[str]: The language identifier string recognized by Tree-sitter (e.g.,"python","javascript"), orNoneif the extension is unknown.
Description:
Uses theEXT_LANG_MAPdictionary (imported fromconfig) which maps file extensions (like.py,.js) to language names. It extracts the file extension from the path, converts it to lowercase, and looks it up.Example Usage:
lang = detect_language_from_extension("example.py") print(lang) # Output: "python"
extract_code_units_from_file(path: str) -> List[str]
Extracts semantic code units (functions, classes, imports, exports, variable declarations) from the source code file at `path`.
Parameters:
path(str): The path to the source code file.
Returns:
List[str]: A list of strings, each representing a semantic unit or chunk of code extracted from the file.
Description:
Detects the language of the file via
detect_language_from_extension.Reads the file content using
read_text_file(fromfile_utils).If the language is supported by Tree-sitter:
Initializes the Tree-sitter parser for the language.
Parses the source code into an abstract syntax tree (AST).
Walks the AST recursively, extracting nodes corresponding to key code constructs:
Functions (
function_definition,function_declaration,method_declaration,method_definition)Classes (
class_definition,class_declaration)Imports/Exports (
import_statement,import_declaration,export_statement,export_declaration)Variable declarations (
variable_declaration,lexical_declaration,assignment,const_declaration)
Extracts the source code snippet corresponding to each node and adds it to the result list.
If parsing fails or language is unsupported:
Falls back to the
heuristic_code_chunksmethod to segment the source code heuristically.
Error Handling:
Catches all exceptions during tree-sitter parsing and prints a warning message, then falls back on heuristics.Example Usage:
units = extract_code_units_from_file("example.js") for unit in units: print(unit) print("---")
heuristic_code_chunks(source: str) -> List[str]
Fallback method to split source code into chunks using regex heuristics and blank lines.
Parameters:
source(str): The raw source code text.
Returns:
List[str]: List of code chunks as strings.
Description:
Splits the source code into lines.
Starts a new chunk whenever a line matches certain keywords indicating the start of a code unit:
def,class,function,func,export,import,public,private
Also splits chunks at blank lines to separate logical blocks.
After initial chunking, splits any chunk exceeding 200 lines into smaller windows of max 200 lines.
Returns the list of non-empty chunks.
Implementation Notes:
This heuristic is language-agnostic and quite simplistic, designed as a fallback when Tree-sitter parsing is not possible.Example Usage:
code = ''' import os def foo(): pass class Bar: def method(self): pass ''' chunks = heuristic_code_chunks(code) for c in chunks: print(c) print("---")
Important Implementation Details
Tree-sitter Integration:
The module usestree_sitter_languagespackage to fetch the language and parser objects for the detected language. This allows robust syntactic parsing of supported languages.Recursion for AST Walk:
The functionwalk(node)is defined insideextract_code_units_from_fileto recursively traverse the AST, collecting nodes of interest based on their type. This approach neatly encapsulates the traversal logic and allows filtering specific code constructs.Fallback Strategy:
The fallback heuristic ensures the module is robust and can operate on any source code, even if no Tree-sitter grammar is available for the file's language.Chunk Size Limiting:
Large chunks are split into 200-line windows to keep units manageable and prevent extremely large code snippets which may be hard to process downstream.
Interaction with Other Parts of the System
EXT_LANG_MAP(fromconfig):
Provides the mapping from file extensions to Tree-sitter language names. This mapping is crucial for detecting which language parser to use.read_text_file(fromfile_utils):
Used to read the source code file content as a string.tree_sitter_languagespackage:
Provides language definitions and parsers for various programming languages to perform syntactic analysis.Downstream Usage:
The extracted code units can be used by other components of the system for indexing, code summarization, refactoring, or analysis tasks.
Mermaid Diagram
The following class diagram illustrates the functional structure of `code_parsing.py`, focusing on its key functions and their relationships.
flowchart TD
A[detect_language_from_extension(path)] --> B[extract_code_units_from_file(path)]
B --> C[get_language(lang_name), get_parser(lang_name)]
B --> D[walk(root_node)]
B --> E[heuristic_code_chunks(source)]
E --> F[Split by keywords & blank lines]
E --> G[Split large chunks > 200 lines]
Summary
The `code_parsing.py` file is a utility module that abstracts the complexity of parsing source code files into meaningful semantic units. It smartly combines powerful Tree-sitter parsing when available with a pragmatic heuristic fallback, enabling reliable code chunk extraction across various languages and file types. This makes it a foundational component for any system needing structured code analysis or processing.