visualizer.py
Overview
`visualizer.py` is an interactive Streamlit web application designed for the exploration, analysis, and visualization of high-dimensional embeddings representing source code chunks. Its core purpose is to enable users—such as developers, data scientists, or researchers—to intuitively explore semantic relationships and cluster structures within codebase embeddings.
The file provides functionality to:
Load embeddings and corresponding metadata from uploaded files or example datasets.
Apply configurable multi-step dimensionality reduction pipelines (using PCA, t-SNE, UMAP) to reduce embedding dimensionality.
Perform clustering on embeddings with multiple algorithms (e.g., KMeans, Agglomerative Clustering, Spectral Clustering, DBSCAN, HDBSCAN, OPTICS).
Visualize embeddings interactively in 2D or 3D scatter plots with rich metadata shown on hover.
Provide a user-friendly UI for parameter tuning and immediate feedback on the effects of different reduction and clustering settings.
This facilitates exploratory data analysis and insight generation into code semantics and structure through visual patterns and clusters.
Detailed Documentation of Components
Imports and Configuration
Uses standard libraries:
os,pickle,numpy.Visualization and UI:
streamlit,plotly.express.Dimensionality reduction:
sklearn.manifold.TSNE,sklearn.decomposition.PCA,umap.Clustering:
KMeans,AgglomerativeClustering,SpectralClustering,DBSCAN(fromsklearn),HDBSCAN,OPTICS.Streamlit page configured for wide layout and titled "Embeddings Visualizer".
Function: load_data(embeddings_path=None, metadata_path=None)
Loads embeddings and metadata from pickle files.
Parameters:
embeddings_path(str or None): Path to pickle file containing embeddings.metadata_path(str or None): Path to pickle file containing metadata.
Returns: Tuple
(embeddings, metadata)where each is loaded from the respective file orNoneif loading fails.Caching: Uses
@st.cache_datato cache results for faster reloads.Usage Example:
embeddings, metadata = load_data("embeddings.pkl", "metadata.pkl")Details:
Safely handles file existence and errors.
If files are missing or invalid, shows Streamlit error messages.
Function: reduce_dimensions(embeddings, method, params)
Applies dimensionality reduction to embeddings using the specified method and parameters.
Parameters:
embeddings(list or np.ndarray): High-dimensional embedding vectors.method(str): Reduction method, one of"PCA","t-SNE","UMAP".params(dict): Method-specific parameters:For PCA:
n_components,random_state.For t-SNE:
n_components,perplexity,learning_rate,n_iter,random_state.For UMAP:
n_components,n_neighbors,min_dist,metric,random_state.
Returns:
np.ndarrayof reduced embeddings orNoneon failure.Usage Example:
reduced = reduce_dimensions(embeddings, "PCA", {"n_components": 10, "random_state": 42})Implementation Notes:
Converts input to numpy array.
Caps parameters like
n_componentsandperplexityto valid ranges based on data size.Wraps calls in try-except to catch errors and report them via Streamlit.
Function: apply_clustering(embeddings, method, params)
Clusters embeddings using the specified clustering algorithm and parameters.
Parameters:
embeddings(np.ndarray): Embeddings to cluster.method(str): Clustering method, one of"KMeans","AgglomerativeClustering","SpectralClustering","DBSCAN","HDBSCAN","OPTICS".params(dict): Method-specific parameters such as:n_clusters,random_state(for KMeans, Agglomerative, Spectral).metric,linkage(for AgglomerativeClustering).eps,min_samples,metric(for DBSCAN).min_cluster_size,min_samples,metric(for HDBSCAN).min_samples,metric(for OPTICS).
Returns: Array of cluster labels (int) or
Noneif clustering fails.Usage Example:
labels = apply_clustering(embeddings, "KMeans", {"n_clusters": 5, "random_state": 42})Details:
Dynamically constructs clustering objects based on selected method.
Uses safeguards for parameters (e.g., min cluster size capped by data size).
Errors are caught and reported.
Function: visualize_data(reduced_data, cluster_labels, metadata)
Creates an interactive Plotly scatter plot of embeddings with cluster coloring and metadata hover info.
Parameters:
reduced_data(np.ndarray): 2D or 3D reduced embeddings (usually 2D).cluster_labels(array-like or None): Cluster assignments for points. IfNone, no coloring by cluster.metadata(list): List of metadata tuples corresponding to embeddings. Expected format per item in metadata:(file_id, overview, path, filename, page_id, ...)
Returns: Plotly Figure object or
Noneif visualization fails.Usage Example:
fig = visualize_data(reduced_data, cluster_labels, metadata) st.plotly_chart(fig)Implementation Details:
Prepares hover data dictionary with file info and truncated overview.
Colors points by cluster if labels provided; otherwise uses cluster info from metadata.
Customizes hover templates to show detailed info including file name, path, IDs, overview, and cluster.
Sets marker style and layout size for clarity.
Handles exceptions gracefully.
Function: configure_reduction_step(step_num, max_components)
Generates UI controls in the Streamlit sidebar for configuring a dimensionality reduction step.
Parameters:
step_num(int): The sequence number of the reduction step (used in labels/keys).max_components(int): Maximum allowed components (dimensions) for this step.
Returns: Tuple
(method, params)where:methodis the selected reduction method ("PCA"or"UMAP").paramsis a dictionary of method parameters as configured through sidebar widgets.
Usage Example:
method, params = configure_reduction_step(1, 50)Details:
Provides method selector (PCA or UMAP).
For PCA: Slider for number of components.
For UMAP: Sliders/selectors for components, neighbors, min_dist, metric.
Fixed random state of 42 for reproducibility.
Main Application: main()
The main Streamlit app function that orchestrates the full workflow:
UI Title and Sidebar Setup:
Upload widgets for embeddings and metadata files.
Option to use example data if available.
Data Loading:
Loads embeddings and metadata from files or example data.
Handles temporary file saving/removal for uploaded files.
Shows info if data is missing.
Pre-Clustering Dimensionality Reduction Pipeline:
User can enable/disable up to 3 sequential reduction steps (PCA or UMAP).
Each step configured with
configure_reduction_step.Applies each step in sequence, updating the data and providing feedback.
Visualization Dimensionality Reduction:
Selects from PCA, t-SNE, or UMAP.
Configures parameters (e.g., perplexity, learning rate, iterations for t-SNE; neighbors, min_dist, metric for UMAP).
Allows 2 or 3 components for visualization.
Clustering Configuration:
Optionally apply clustering.
Select clustering method from six algorithms: KMeans, AgglomerativeClustering, SpectralClustering, DBSCAN, HDBSCAN, OPTICS.
Configure parameters specific to the selected clustering method:
For KMeans, AgglomerativeClustering, SpectralClustering: number of clusters, random state, and for AgglomerativeClustering, metric and linkage method.
For DBSCAN: epsilon, min samples, metric.
For HDBSCAN: min cluster size, min samples, metric.
For OPTICS: min samples, metric.
Processing and Visualization:
Applies clustering on pre-reduced data if enabled, reporting the number of clusters found.
Applies final dimensionality reduction on the original embeddings for visualization.
Calls
visualize_datato generate interactive Plotly scatter plot.Displays plot or error messages accordingly.
Error Handling and Feedback:
Uses Streamlit spinners, success, warning, error, and info messages to guide user.
Entry Point:
Runs
main()if script is executed directly.
Important Implementation Details and Algorithms
Multi-Step Dimensionality Reduction:
Supports chaining up to three dimensionality reduction steps before clustering. This allows noise reduction and feature extraction in stages, potentially improving clustering quality.Dynamic Parameter Bounding:
Parameters liken_components,perplexity,n_neighbors, and others are capped dynamically based on dataset size to prevent invalid configurations.Robust Error Handling:
Almost all critical operations are wrapped in try-except blocks, with errors reported via UI to avoid crashes.Caching Data Loading:
Uses Streamlit's@st.cache_datadecorator to cache embedding and metadata loading, speeding up repeated runs with the same data.Integration with Streamlit UI:
The entire pipeline is controlled via sidebar widgets, enabling interactive parameter tuning and immediate visualization updates.Metadata Usage:
Metadata is assumed to be a list of tuples with specific positional fields (file ID, overview text, file path, filename, page ID). This metadata enriches the visualizations with informative hover tooltips.Use of Plotly Express for Visualization:
Enables rich interactive scatter plots with custom hover templates and coloring by cluster labels.Expanded Clustering Options:
Supports six clustering methods including OPTICS, expanding analytical capabilities.Temporary File Handling:
Uploaded files are saved temporarily and deleted after loading, ensuring secure and clean file management.
Interaction with Other System Components
Input Data Source:
Embeddings and metadata are typically generated by upstream modules responsible for semantic splitting and embedding generation (e.g.,tools.py,main.pyin the project).Data Storage:
Supports loading embeddings/metadata from pickle files (uploaded or example files). Temporary files are managed internally for uploaded data.Visualization Frontend:
Acts as the interactive frontend for embedding analysis, using Streamlit for UI and Plotly for plotting.Dimensionality Reduction and Clustering Libraries:
Leverages scikit-learn, umap-learn, and HDBSCAN libraries for core ML algorithms, ensuring robustness and performance.User Interaction:
Enables users to experiment with different reduction and clustering methods and parameters without modifying code, facilitating exploratory data analysis.
Usage Summary
To use this tool:
Launch the app (
streamlit run visualizer.py).Upload embeddings and metadata pickle files or select example data if available.
Configure optional multi-step pre-clustering dimensionality reduction in the sidebar, applying up to three sequential PCA or UMAP steps.
Choose a visualization dimensionality reduction method (PCA, t-SNE, or UMAP) and configure its parameters.
Optionally enable clustering and configure clustering parameters, selecting from six clustering algorithms.
View the interactive scatter plot showing embeddings colored by cluster and enriched with metadata.
Hover over points to inspect detailed file information.
Adjust parameters to explore different views and clusterings dynamically.
Example
# Example: Apply PCA reduction with 10 components and KMeans clustering with 5 clusters.
# Load embeddings and metadata
embeddings, metadata = load_data("embeddings.pkl", "metadata.pkl")
# Reduce dimensions with PCA
reduced_embeddings = reduce_dimensions(embeddings, "PCA", {"n_components": 10, "random_state": 42})
# Cluster with KMeans
cluster_labels = apply_clustering(reduced_embeddings, "KMeans", {"n_clusters": 5, "random_state": 42})
# Visualize
fig = visualize_data(reduced_embeddings, cluster_labels, metadata)
st.plotly_chart(fig)
Summary
`visualizer.py` is a comprehensive, user-configurable visualization and clustering tool for high-dimensional code embeddings. It:
Supports flexible multi-step reduction pipelines.
Provides diverse clustering algorithms including OPTICS.
Integrates detailed metadata for meaningful interpretation.
Utilizes an interactive web UI for dynamic exploration.
Handles errors gracefully and caches data for performance.
It plays a central role in the overall system by turning raw embeddings into actionable insights via visual patterns and cluster structures, thereby facilitating semantic understanding of large codebases.