visualizer.py


Overview

`visualizer.py` is an interactive Streamlit web application designed for the exploration, analysis, and visualization of high-dimensional embeddings representing source code chunks. Its core purpose is to enable users—such as developers, data scientists, or researchers—to intuitively explore semantic relationships and cluster structures within codebase embeddings.

The file provides functionality to:

This facilitates exploratory data analysis and insight generation into code semantics and structure through visual patterns and clusters.


Detailed Documentation of Components

Imports and Configuration


Function: load_data(embeddings_path=None, metadata_path=None)

Loads embeddings and metadata from pickle files.


Function: reduce_dimensions(embeddings, method, params)

Applies dimensionality reduction to embeddings using the specified method and parameters.


Function: apply_clustering(embeddings, method, params)

Clusters embeddings using the specified clustering algorithm and parameters.


Function: visualize_data(reduced_data, cluster_labels, metadata)

Creates an interactive Plotly scatter plot of embeddings with cluster coloring and metadata hover info.


Function: configure_reduction_step(step_num, max_components)

Generates UI controls in the Streamlit sidebar for configuring a dimensionality reduction step.


Main Application: main()

The main Streamlit app function that orchestrates the full workflow:

  1. UI Title and Sidebar Setup:

    • Upload widgets for embeddings and metadata files.

    • Option to use example data if available.

  2. Data Loading:

    • Loads embeddings and metadata from files or example data.

    • Handles temporary file saving/removal for uploaded files.

    • Shows info if data is missing.

  3. Pre-Clustering Dimensionality Reduction Pipeline:

    • User can enable/disable up to 3 sequential reduction steps (PCA or UMAP).

    • Each step configured with configure_reduction_step.

    • Applies each step in sequence, updating the data and providing feedback.

  4. Visualization Dimensionality Reduction:

    • Selects from PCA, t-SNE, or UMAP.

    • Configures parameters (e.g., perplexity, learning rate, iterations for t-SNE; neighbors, min_dist, metric for UMAP).

    • Allows 2 or 3 components for visualization.

  5. Clustering Configuration:

    • Optionally apply clustering.

    • Select clustering method from six algorithms: KMeans, AgglomerativeClustering, SpectralClustering, DBSCAN, HDBSCAN, OPTICS.

    • Configure parameters specific to the selected clustering method:

      • For KMeans, AgglomerativeClustering, SpectralClustering: number of clusters, random state, and for AgglomerativeClustering, metric and linkage method.

      • For DBSCAN: epsilon, min samples, metric.

      • For HDBSCAN: min cluster size, min samples, metric.

      • For OPTICS: min samples, metric.

  6. Processing and Visualization:

    • Applies clustering on pre-reduced data if enabled, reporting the number of clusters found.

    • Applies final dimensionality reduction on the original embeddings for visualization.

    • Calls visualize_data to generate interactive Plotly scatter plot.

    • Displays plot or error messages accordingly.

  7. Error Handling and Feedback:

    • Uses Streamlit spinners, success, warning, error, and info messages to guide user.

  8. Entry Point:

    • Runs main() if script is executed directly.


Important Implementation Details and Algorithms


Interaction with Other System Components


Usage Summary

To use this tool:

  1. Launch the app (streamlit run visualizer.py).

  2. Upload embeddings and metadata pickle files or select example data if available.

  3. Configure optional multi-step pre-clustering dimensionality reduction in the sidebar, applying up to three sequential PCA or UMAP steps.

  4. Choose a visualization dimensionality reduction method (PCA, t-SNE, or UMAP) and configure its parameters.

  5. Optionally enable clustering and configure clustering parameters, selecting from six clustering algorithms.

  6. View the interactive scatter plot showing embeddings colored by cluster and enriched with metadata.

  7. Hover over points to inspect detailed file information.

  8. Adjust parameters to explore different views and clusterings dynamically.


Example

# Example: Apply PCA reduction with 10 components and KMeans clustering with 5 clusters.

# Load embeddings and metadata
embeddings, metadata = load_data("embeddings.pkl", "metadata.pkl")

# Reduce dimensions with PCA
reduced_embeddings = reduce_dimensions(embeddings, "PCA", {"n_components": 10, "random_state": 42})

# Cluster with KMeans
cluster_labels = apply_clustering(reduced_embeddings, "KMeans", {"n_clusters": 5, "random_state": 42})

# Visualize
fig = visualize_data(reduced_embeddings, cluster_labels, metadata)
st.plotly_chart(fig)

Summary

`visualizer.py` is a comprehensive, user-configurable visualization and clustering tool for high-dimensional code embeddings. It:

It plays a central role in the overall system by turning raw embeddings into actionable insights via visual patterns and cluster structures, thereby facilitating semantic understanding of large codebases.


End of Documentation for visualizer.py