Interactive Visualization and Exploration of Codebase Structures

Overview

This module implements an interactive web application that enables users to explore, analyze, and visualize precomputed embeddings and clustering results of source code files. Its primary purpose is to provide an intuitive interface for inspecting semantic relationships and cluster groupings of code chunks, facilitating insightful analysis of codebase structure and semantics.

By supporting dynamic configuration of dimensionality reduction and clustering parameters, the module empowers users to tailor visualizations to their specific needs and data characteristics. The interactive plots allow detailed examination of individual code chunks and their cluster assignments through rich hover information.

Core Concepts and Purpose

Functionality and Workflow

Data Loading

Users can upload pickled files containing:

The module supports loading example data if available or user-provided files via Streamlit's sidebar upload widgets.

embeddings_file = st.sidebar.file_uploader("Upload embeddings pickle file", type=["pkl"])
metadata_file = st.sidebar.file_uploader("Upload metadata pickle file", type=["pkl"])

A caching mechanism (`@st.cache_data`) ensures efficient reloading.

Dimensionality Reduction Configuration

Two stages of dimensionality reduction are supported:

  1. Pre-Clustering Reduction Pipeline:

    • Users can chain up to three reduction steps (e.g., PCA followed by UMAP).

    • Each step is configurable via sidebar controls for method and parameters.

    • The output of one step feeds as input to the next, progressively reducing dimensionality before clustering.

  2. Visualization Reduction:

    • A final reduction step projects embeddings into 2D or 3D space for plotting.

    • Users select from PCA, t-SNE, or UMAP with adjustable parameters such as number of components, perplexity (t-SNE), neighbors and min distance (UMAP).

The reduction operations apply robust parameter validation and error handling to ensure stability.

Clustering Configuration

Clustering can be toggled on or off. When enabled, users choose from several algorithms:

Each algorithm exposes relevant parameters via sidebar widgets, e.g., number of clusters for KMeans, epsilon for DBSCAN, minimum cluster size for HDBSCAN.

The clustering is applied to the embeddings after pre-clustering reduction, allowing flexible exploration of cluster structures at different abstraction levels.

Visualization

The core output is an interactive scatter plot created with Plotly Express, embedded in the Streamlit app. Features include:

hover_data = {
    'file_id': [meta[0] for meta in metadata],
    'filename': [meta[3] for meta in metadata],
    'path': [meta[2] for meta in metadata],
    'page_id': [meta[4] for meta in metadata],
    'overview': [meta[1][:100] + "..." if len(meta[1]) > 100 else meta[1] for meta in metadata]
}

if cluster_labels is not None:
    hover_data['cluster'] = [f"Cluster {label}" for label in cluster_labels]

The plot configuration includes customized hover templates and styling for clarity and usability.

User Interaction Flow

  1. Upload Data: User uploads embeddings and metadata or chooses example data.

  2. Configure Pre-Reduction: Optionally chain multiple dimensionality reduction steps to preprocess data before clustering.

  3. Configure Visualization Reduction: Select method and parameters for final 2D/3D projection.

  4. Configure Clustering: Enable clustering and select algorithm and parameters.

  5. Generate Visualization: The system processes data as per configuration and renders the interactive scatter plot.

  6. Explore: Users hover over points to inspect metadata and cluster membership, gaining insights into codebase semantic structure.

Module Interactions and Relationships

Design Patterns and Approaches


This module forms the interactive frontend cornerstone of the project, translating complex embedding and clustering computations into actionable visual insights for understanding source code semantics and structure.