Dimensionality Reduction and Clustering of Code Embeddings

Overview

This module addresses the challenge of analyzing and interpreting high-dimensional semantic embeddings derived from source code chunks. Since embeddings typically exist in high-dimensional vector spaces (e.g., hundreds of dimensions), it is difficult to directly visualize or extract meaningful groupings from them. Dimensionality reduction and clustering techniques reduce complexity and reveal latent structures and relationships within the codebase embeddings.

The main goals are:

This capability supports exploratory analysis, codebase understanding, and identification of semantic patterns across the project.


Core Functionalities and Workflows

Dimensionality Reduction

Dimensionality reduction transforms high-dimensional embedding vectors into a lower-dimensional space while preserving meaningful relationships. This module supports multiple algorithms, each suited for different data characteristics:

The reduction can be applied in multi-step pipelines (e.g., PCA followed by UMAP) for improved performance and interpretability.

**Example from `visualizer.py`:**

def reduce_dimensions(embeddings, method, params):
    if method == "PCA":
        reducer = PCA(n_components=params["n_components"], random_state=params["random_state"])
    elif method == "t-SNE":
        reducer = TSNE(
            n_components=params["n_components"],
            perplexity=params["perplexity"],
            learning_rate=params["learning_rate"],
            n_iter=params["n_iter"],
            random_state=params["random_state"]
        )
    elif method == "UMAP":
        reducer = umap.UMAP(
            n_components=params["n_components"],
            n_neighbors=params["n_neighbors"],
            min_dist=params["min_dist"],
            metric=params["metric"],
            random_state=params["random_state"]
        )
    return reducer.fit_transform(np.array(embeddings))

Clustering

Clustering algorithms group embeddings into clusters to identify semantically related code segments. Multiple clustering methods are supported, offering flexibility depending on data distribution and desired cluster shapes:

The module allows tuning of key parameters such as number of clusters, distance metrics, and density thresholds.

**Example from `visualizer.py`:**

def apply_clustering(embeddings, method, params):
    if method == "KMeans":
        clusterer = KMeans(n_clusters=params["n_clusters"], random_state=params["random_state"])
    elif method == "DBSCAN":
        clusterer = DBSCAN(eps=params["eps"], min_samples=params["min_samples"], metric=params["metric"])
    # ... other methods ...
    return clusterer.fit_predict(embeddings)

Visualization Pipeline

The typical workflow involves:

  1. Pre-Clustering Reduction: Optionally applying one or more dimensionality reduction steps to embed the data in a manageable space before clustering.

  2. Clustering: Applying a chosen clustering algorithm on the reduced data to assign cluster labels.

  3. Visualization Reduction: Performing a final dimensionality reduction (usually to 2D or 3D) optimized for visualization.

  4. Interactive Plotting: Rendering an interactive scatter plot with points colored by cluster assignment and enriched with metadata hover information.

This multi-stage approach balances computational efficiency and visualization clarity.


Interaction with Other System Components

This module thus sits between embedding generation and interactive exploration, acting as the analytical core that transforms raw embeddings into interpretable clusters and visualizations.


Design Patterns and Unique Approaches

Modular and Extensible Design

Multi-Step Dimensionality Reduction Pipeline

Integration with Interactive UI


Key Code Snippets and Their Roles

Loading Data and Preparing Embeddings

The module loads embeddings and metadata from files or example datasets, preparing them for analysis:

@st.cache_data
def load_data(embeddings_path=None, metadata_path=None):
    with open(embeddings_path, "rb") as f:
        embeddings = pickle.load(f)
    with open(metadata_path, "rb") as f:
        metadata = pickle.load(f)
    return embeddings, metadata

Applying Multi-Step Dimensionality Reduction

Users can configure up to three sequential reduction steps, each selectable as PCA or UMAP, with parameter sliders:

for step in range(1, max_steps + 1):
    use_step = st.sidebar.checkbox(f"Step {step}: Apply Reduction", value=(step == 1))
    if use_step:
        method, params = configure_reduction_step(step, current_max_components)
        reduction_pipeline.append((method, params))
        current_max_components = params["n_components"]
    else:
        break

for i, (method, params) in enumerate(reduction_pipeline):
    reduced = reduce_dimensions(current_data, method, params)
    current_data = reduced

Clustering with User-Selected Algorithms and Parameters

Depending on the selected clustering technique, the module dynamically builds the clustering object with parameters, then fits and predicts cluster labels:

if use_clustering:
    clustering_method = st.sidebar.selectbox("Clustering Method", [...])
    clustering_params = {...}  # Set based on method
    cluster_labels = apply_clustering(pre_reduced_data, clustering_method, clustering_params)

Interactive Visualization with Metadata

The reduced embeddings and cluster labels are visualized with hover data showing file metadata and cluster assignments:

fig = px.scatter(
    x=reduced_data[:, 0],
    y=reduced_data[:, 1],
    hover_data=hover_data,
    color=color,
    labels={'x': 'Component 1', 'y': 'Component 2'},
    title="File Overview Embeddings Visualization",
)
st.plotly_chart(fig, use_container_width=True)

Summary of File Roles in This Module

Together, these files implement a flexible and user-friendly system for reducing, clustering, and visualizing code embeddings, supporting exploratory data analysis and insight discovery in large codebases.