Interactive Visualization and Exploration of Codebase Structures

Overview

This module implements an interactive web application that enables users to explore, analyze, and visualize precomputed embeddings and clustering results of source code files. Its primary purpose is to provide an intuitive interface for inspecting semantic relationships and cluster groupings of code chunks, facilitating insightful analysis of codebase structure and semantics.

By supporting dynamic configuration of dimensionality reduction and clustering parameters, the module empowers users to tailor visualizations to their specific needs and data characteristics. The interactive plots allow detailed examination of individual code chunks and their cluster assignments through rich hover information.

Core Concepts and Purpose

Interactive Exploration: The module offers a graphical user interface where users can load embeddings and metadata, adjust reduction and clustering algorithms and parameters, and immediately see the impact on the visualization.
Dimensionality Reduction: High-dimensional embeddings (often hundreds of dimensions) are projected into 2D or 3D space for visual interpretation using techniques like PCA, t-SNE, and UMAP.
Clustering: Various clustering algorithms identify groups of semantically similar code chunks, helping to uncover latent structure and relationships within the codebase.
Rich Metadata Display: Hover tooltips on visualization points show file-level metadata such as filename, path, overview snippet, and cluster assignment, bridging the gap between abstract embeddings and concrete source code.
Flexibility: Multiple reduction steps can be chained, allowing complex preprocessing pipelines before final visualization, and clustering can be toggled on or off.

Functionality and Workflow

Data Loading

Users can upload pickled files containing:

Embeddings: Numeric vector representations of source code chunks.
Metadata: Corresponding descriptive information for each embedding (e.g., file ID, filename, file path, overview snippet).

The module supports loading example data if available or user-provided files via Streamlit's sidebar upload widgets.

embeddings_file = st.sidebar.file_uploader("Upload embeddings pickle file", type=["pkl"])
metadata_file = st.sidebar.file_uploader("Upload metadata pickle file", type=["pkl"])

A caching mechanism (`@st.cache_data`) ensures efficient reloading.

Dimensionality Reduction Configuration

Two stages of dimensionality reduction are supported:

Pre-Clustering Reduction Pipeline:
- Users can chain up to three reduction steps (e.g., PCA followed by UMAP).
- Each step is configurable via sidebar controls for method and parameters.
- The output of one step feeds as input to the next, progressively reducing dimensionality before clustering.
Visualization Reduction:
- A final reduction step projects embeddings into 2D or 3D space for plotting.
- Users select from PCA, t-SNE, or UMAP with adjustable parameters such as number of components, perplexity (t-SNE), neighbors and min distance (UMAP).

The reduction operations apply robust parameter validation and error handling to ensure stability.

Clustering Configuration

Clustering can be toggled on or off. When enabled, users choose from several algorithms:

KMeans
Agglomerative Clustering
Spectral Clustering
DBSCAN
HDBSCAN
OPTICS

Each algorithm exposes relevant parameters via sidebar widgets, e.g., number of clusters for KMeans, epsilon for DBSCAN, minimum cluster size for HDBSCAN.

The clustering is applied to the embeddings after pre-clustering reduction, allowing flexible exploration of cluster structures at different abstraction levels.

Visualization

The core output is an interactive scatter plot created with Plotly Express, embedded in the Streamlit app. Features include:

Points Representing Embeddings: Positioned according to the chosen dimensionality reduction.
Color Coding: Points colored by cluster assignment or default grouping.
Hover Information: Displays metadata fields such as:
- File ID
- Filename
- File path
- Page ID
- Overview snippet (truncated for brevity)
- Cluster label (if clustering applied)

hover_data = {
    'file_id': [meta[0] for meta in metadata],
    'filename': [meta[3] for meta in metadata],
    'path': [meta[2] for meta in metadata],
    'page_id': [meta[4] for meta in metadata],
    'overview': [meta[1][:100] + "..." if len(meta[1]) > 100 else meta[1] for meta in metadata]
}

if cluster_labels is not None:
    hover_data['cluster'] = [f"Cluster {label}" for label in cluster_labels]

The plot configuration includes customized hover templates and styling for clarity and usability.

User Interaction Flow

Upload Data: User uploads embeddings and metadata or chooses example data.
Configure Pre-Reduction: Optionally chain multiple dimensionality reduction steps to preprocess data before clustering.
Configure Visualization Reduction: Select method and parameters for final 2D/3D projection.
Configure Clustering: Enable clustering and select algorithm and parameters.
Generate Visualization: The system processes data as per configuration and renders the interactive scatter plot.
Explore: Users hover over points to inspect metadata and cluster membership, gaining insights into codebase semantic structure.

Module Interactions and Relationships

Input Data: This module consumes embeddings and metadata generated by earlier processing steps (semantic splitting and embedding computation).
Dimensionality Reduction and Clustering Libraries: Utilizes algorithms from scikit-learn, umap-learn, and HDBSCAN libraries to perform reduction and clustering.
Visualization Backend: Leverages Plotly Express for rendering interactive plots and Streamlit for web UI components.
File Management: Supports loading data from local pickle files and temporary files created from user uploads, handling file I/O and cleanup seamlessly.

Design Patterns and Approaches

Modular Pipeline: The separation of pre-clustering reduction, clustering, and visualization reduction steps allows flexible experimentation and progressive refinement of data representations.
User-Centric Configuration: Extensive use of sidebar controls with sensible defaults enables both novice and expert users to tailor analyses without code changes.
Caching for Performance: Data loading is cached to minimize latency when adjusting parameters repeatedly.
Robust Error Handling: Try-except blocks and user feedback mechanisms ensure graceful handling of invalid inputs or processing failures.
Interactive Feedback: Use of spinners and status messages keeps users informed of ongoing operations and results.

This module forms the interactive frontend cornerstone of the project, translating complex embedding and clustering computations into actionable visual insights for understanding source code semantics and structure.