file_utils.py
Overview
The `file_utils.py` module provides utility functions to discover and read files within specified directories. Its primary purpose is to facilitate file system operations related to source code and documentation files by:
Recursively searching given directories for source code files and markdown documentation files.
Reading the contents of text files efficiently with proper encoding and error handling.
This module is particularly useful in projects where automated processing or analysis of source code and documentation files is required, such as code documentation generation, static analysis, or project indexing.
Detailed Description of Functions
discover_files(code_dir: str, docs_dir: str) -> Tuple[List[str], List[str]]
Purpose
Discovers source code files within a given code directory and markdown documentation files within a documentation directory.
Parameters
code_dir(str): The root directory path where source code files are located.docs_dir(str): The root directory path where markdown documentation files are stored.
Returns
Tuple[List[str], List[str]]: A tuple containing two sorted lists:The first list includes full paths to source code files found in
code_dir.The second list includes full paths to markdown (
*.md) files found indocs_dir.
Behavior and Implementation Details
Uses
os.walk()to traverse directories recursively.For source code files, it filters files based on the file extensions present in the external
EXT_LANG_MAPdictionary imported from theconfigmodule. This map presumably lists supported file extensions keyed by language.Files with unknown or unsupported extensions in the code directory are ignored.
For documentation files, only files ending with
.md(case-insensitive) are collected.Returned lists are sorted alphabetically for consistency.
Usage Example
from file_utils import discover_files
code_files, doc_files = discover_files('/path/to/source', '/path/to/docs')
print("Source code files found:")
for file_path in code_files:
print(file_path)
print("Documentation files found:")
for doc_path in doc_files:
print(doc_path)
read_text_file(path: str) -> str
Purpose
Reads the entire content of a text file and returns it as a string.
Parameters
path(str): The full path to the text file to be read.
Returns
str: The content of the file as a Unicode string.
Behavior and Implementation Details
Opens the file in read mode with UTF-8 encoding.
Uses
errors="ignore"to skip over any characters that cannot be decoded, preventing the function from throwing exceptions on malformed files.Reads the entire file content at once and returns it.
Usage Example
from file_utils import read_text_file
content = read_text_file('/path/to/some/file.md')
print(content)
Important Implementation Details
Dependency on
EXT_LANG_MAP: The functiondiscover_filesrelies on an external dictionaryEXT_LANG_MAPimported from aconfigmodule. This map defines which file extensions are considered source code files. The structure and contents ofEXT_LANG_MAPdetermine which files are discovered.File Extension Filtering: The filtering is strict for source files and only recognizes extensions in
EXT_LANG_MAP. Files without extensions or with unknown extensions are currently ignored.Encoding Strategy: The
read_text_filefunction uses UTF-8 encoding with error ignoring to maximize compatibility with various text files and avoid exceptions due to encoding issues.Sorting: Both returned file lists from
discover_filesare sorted to provide predictable ordering, which is helpful for further processing or UI display.
Interaction with Other Parts of the System
Config Module: This module depends on the
configmodule, specifically on theEXT_LANG_MAPvariable, to identify valid source code file extensions. Changes inEXT_LANG_MAPwill directly affect which files are discovered.File Processing Pipelines: The lists generated by
discover_filesare typically used downstream in processing pipelines that analyze, index, or generate documentation for source code and markdown files.Text Analysis or Documentation Tools: The
read_text_filefunction supports reading content for further parsing or analysis by other components, such as syntax highlighters, linters, or doc generators.
Visual Diagram - Flowchart of Main Functions and Their Relationships
flowchart TD
A[Start] --> B[discover_files(code_dir, docs_dir)]
B --> C{Traverse code_dir}
C -->|For each file| D[Check file extension in EXT_LANG_MAP]
D -- Yes --> E[Add file path to code_paths]
D -- No --> F[Skip file]
C --> G{Traverse docs_dir}
G -->|For each file| H[Check if file ends with .md]
H -- Yes --> I[Add file path to doc_paths]
H -- No --> J[Skip file]
E & I --> K[Sort code_paths and doc_paths]
K --> L[Return (code_paths, doc_paths)]
M[read_text_file(path)] --> N[Open file with UTF-8 encoding]
N --> O[Read entire content]
O --> P[Return content as string]
Summary
The `file_utils.py` module is a concise and focused utility for discovering source code and markdown documentation files within specified directories and reading text file contents safely. Its reliance on a configurable extension map allows adaptability to different programming languages and project structures. This module acts as a foundational component in systems that require organized access to project files for processing or documentation purposes.