Dimensionality Reduction and Clustering of Code Embeddings
Overview
This module addresses the challenge of analyzing and interpreting high-dimensional semantic embeddings derived from source code chunks. Since embeddings typically exist in high-dimensional vector spaces (e.g., hundreds of dimensions), it is difficult to directly visualize or extract meaningful groupings from them. Dimensionality reduction and clustering techniques reduce complexity and reveal latent structures and relationships within the codebase embeddings.
The main goals are:
Dimensionality Reduction: Project high-dimensional embeddings into lower-dimensional spaces (2D or 3D) for visualization and further analysis.
Clustering: Group semantically similar embeddings into clusters to identify related code chunks, files, or functional components.
This capability supports exploratory analysis, codebase understanding, and identification of semantic patterns across the project.
Core Functionalities and Workflows
Dimensionality Reduction
Dimensionality reduction transforms high-dimensional embedding vectors into a lower-dimensional space while preserving meaningful relationships. This module supports multiple algorithms, each suited for different data characteristics:
PCA (Principal Component Analysis): A linear method that captures directions of maximum variance.
t-SNE (t-Distributed Stochastic Neighbor Embedding): A nonlinear technique emphasizing local neighborhood preservation, useful for visual clustering.
UMAP (Uniform Manifold Approximation and Projection): A nonlinear method balancing local and global structure preservation, efficient on large datasets.
The reduction can be applied in multi-step pipelines (e.g., PCA followed by UMAP) for improved performance and interpretability.
**Example from `visualizer.py`:**
def reduce_dimensions(embeddings, method, params):
if method == "PCA":
reducer = PCA(n_components=params["n_components"], random_state=params["random_state"])
elif method == "t-SNE":
reducer = TSNE(
n_components=params["n_components"],
perplexity=params["perplexity"],
learning_rate=params["learning_rate"],
n_iter=params["n_iter"],
random_state=params["random_state"]
)
elif method == "UMAP":
reducer = umap.UMAP(
n_components=params["n_components"],
n_neighbors=params["n_neighbors"],
min_dist=params["min_dist"],
metric=params["metric"],
random_state=params["random_state"]
)
return reducer.fit_transform(np.array(embeddings))
Clustering
Clustering algorithms group embeddings into clusters to identify semantically related code segments. Multiple clustering methods are supported, offering flexibility depending on data distribution and desired cluster shapes:
KMeans: Partitioning into a fixed number of spherical clusters.
Agglomerative Clustering: Hierarchical clustering based on linkage criteria.
Spectral Clustering: Uses graph Laplacian for clustering.
DBSCAN: Density-based clustering identifying clusters of arbitrary shape.
HDBSCAN: Hierarchical density-based clustering for variable density clusters.
OPTICS: Density-based method capturing clustering structure with varying densities.
The module allows tuning of key parameters such as number of clusters, distance metrics, and density thresholds.
**Example from `visualizer.py`:**
def apply_clustering(embeddings, method, params):
if method == "KMeans":
clusterer = KMeans(n_clusters=params["n_clusters"], random_state=params["random_state"])
elif method == "DBSCAN":
clusterer = DBSCAN(eps=params["eps"], min_samples=params["min_samples"], metric=params["metric"])
# ... other methods ...
return clusterer.fit_predict(embeddings)
Visualization Pipeline
The typical workflow involves:
Pre-Clustering Reduction: Optionally applying one or more dimensionality reduction steps to embed the data in a manageable space before clustering.
Clustering: Applying a chosen clustering algorithm on the reduced data to assign cluster labels.
Visualization Reduction: Performing a final dimensionality reduction (usually to 2D or 3D) optimized for visualization.
Interactive Plotting: Rendering an interactive scatter plot with points colored by cluster assignment and enriched with metadata hover information.
This multi-stage approach balances computational efficiency and visualization clarity.
Interaction with Other System Components
Input Data: Receives embeddings and corresponding metadata generated by the semantic splitting and embedding computation module (
tools.pyandmain.py).Embedding Source: Embeddings are produced by transformer models and stored or loaded from pickled files or the PostgreSQL database.
Visualization Frontend: The
visualizer.pymodule integrates dimensionality reduction and clustering with the Streamlit UI, facilitating user interaction for method selection and parameter tuning.Backend Orchestration: The
main.pyfile handles fetching embeddings, optionally computing them, and can invoke visualization functions for exploratory analysis.
This module thus sits between embedding generation and interactive exploration, acting as the analytical core that transforms raw embeddings into interpretable clusters and visualizations.
Design Patterns and Unique Approaches
Modular and Extensible Design
The system uses separate well-defined functions for reduction and clustering, enabling easy addition of new algorithms or parameter options.
Reduction and clustering steps accept method names and parameter dictionaries, allowing dynamic configuration through the UI.
Multi-Step Dimensionality Reduction Pipeline
The ability to chain multiple reduction steps (e.g., PCA followed by UMAP) is implemented to improve clustering quality and visualization.
This staged approach can reduce noise and computational cost for complex embeddings.
Integration with Interactive UI
Parameters for all algorithms are exposed interactively via Streamlit sidebar controls, allowing real-time experimentation and fine-tuning.
Errors during reduction or clustering are caught and displayed to the user for better debugging and usability.
Key Code Snippets and Their Roles
Loading Data and Preparing Embeddings
The module loads embeddings and metadata from files or example datasets, preparing them for analysis:
@st.cache_data
def load_data(embeddings_path=None, metadata_path=None):
with open(embeddings_path, "rb") as f:
embeddings = pickle.load(f)
with open(metadata_path, "rb") as f:
metadata = pickle.load(f)
return embeddings, metadata
Applying Multi-Step Dimensionality Reduction
Users can configure up to three sequential reduction steps, each selectable as PCA or UMAP, with parameter sliders:
for step in range(1, max_steps + 1):
use_step = st.sidebar.checkbox(f"Step {step}: Apply Reduction", value=(step == 1))
if use_step:
method, params = configure_reduction_step(step, current_max_components)
reduction_pipeline.append((method, params))
current_max_components = params["n_components"]
else:
break
for i, (method, params) in enumerate(reduction_pipeline):
reduced = reduce_dimensions(current_data, method, params)
current_data = reduced
Clustering with User-Selected Algorithms and Parameters
Depending on the selected clustering technique, the module dynamically builds the clustering object with parameters, then fits and predicts cluster labels:
if use_clustering:
clustering_method = st.sidebar.selectbox("Clustering Method", [...])
clustering_params = {...} # Set based on method
cluster_labels = apply_clustering(pre_reduced_data, clustering_method, clustering_params)
Interactive Visualization with Metadata
The reduced embeddings and cluster labels are visualized with hover data showing file metadata and cluster assignments:
fig = px.scatter(
x=reduced_data[:, 0],
y=reduced_data[:, 1],
hover_data=hover_data,
color=color,
labels={'x': 'Component 1', 'y': 'Component 2'},
title="File Overview Embeddings Visualization",
)
st.plotly_chart(fig, use_container_width=True)
Summary of File Roles in This Module
visualizer.py
Implements the interactive Streamlit app that orchestrates loading embeddings, configuring and applying dimensionality reduction and clustering, and rendering visualizations. It exposes user controls for all relevant parameters and handles error reporting and multi-step pipelines.main.py
Contains backend functions to fetch embeddings and metadata (from database or files), perform preliminary dimensionality reduction and clustering experiments, and create static or exploratory visualizations. It also shows examples of chaining PCA, UMAP, and clustering (e.g., SpectralClustering), illustrating alternative analysis workflows.
Together, these files implement a flexible and user-friendly system for reducing, clustering, and visualizing code embeddings, supporting exploratory data analysis and insight discovery in large codebases.