plotting.py
Overview
The `plotting.py` file provides visualization utilities designed to analyze and display relationships between **code embeddings** and **documentation embeddings** in a software project. Its primary focus is on semantic coverage and similarity metrics between code and associated documentation units.
There are two main visualization functions:
plotly_radar_and_bar: Generates a combined radar (polar) chart and bar chart summarizing key metrics related to coverage, relevance, and novelty, as well as a breakdown of code units coverage based on similarity thresholds.plot_semantic_scatter: Creates a 2D scatter plot representing semantic relationships between code and documentation units. It uses dimensionality reduction (t-SNE or UMAP) to map high-dimensional embeddings into a 2D space, highlighting closely related pairs and clusters.
Both functions leverage Plotly for interactive visualization and allow saving the plots as HTML files.
Detailed Documentation
Imports and Constants
External Libraries:
numpyfor array operations.plotly.graph_objectsandplotly.subplotsfor interactive plotting.scipy.spatial.ConvexHullto compute convex hulls around clusters.sklearn.manifold.TSNEfor dimensionality reduction.umapfor alternative dimensionality reduction (UMAP).
Internal Modules:
configproviding threshold and parameter constants (SIM_THRESHOLD,PARTIAL_THRESHOLD,TSNE_PERPLEXITY).metricsproviding acosine_similarityfunction to measure similarity between embeddings.
Function: plotly_radar_and_bar
def plotly_radar_and_bar(results_ces: Dict, code_embeddings: List[np.ndarray], doc_embeddings: List[np.ndarray],
out_path: Optional[str] = None)
Purpose
Visualizes summary metrics and coverage distribution of code units against documentation using:
A radar chart for CES (Coverage, Relevance, Novelty) metrics.
A bar chart showing counts of code units categorized as Covered, Partial, or Missing based on similarity thresholds.
Parameters
results_ces(Dict): A dictionary containing CES metrics with keys"DirectCoverage","Relevance", and"Novelty", each mapped to float values in [0, 1].code_embeddings(List[np.ndarray]): List of embeddings representing code units.doc_embeddings(List[np.ndarray]): List of embeddings representing documentation units.out_path(Optional[str]): If provided, the plot is saved as an HTML file at this path; otherwise, it is shown interactively.
Returns
None. It either shows the plot or saves it to a file.
Usage Example
results = {"DirectCoverage": 0.75, "Relevance": 0.65, "Novelty": 0.40}
plotly_radar_and_bar(results, code_embeddings, doc_embeddings, out_path="coverage.html")
Implementation Details
Radar Chart:
Plots the three CES metrics on a polar plot.
The categories loop back to the start to close the radar shape.
Radial axis normalized between 0 and 1.
Coverage Calculation:
For each code embedding, computes max cosine similarity to any doc embedding.
Categorizes each code unit as:
Covered: similarity ≥
SIM_THRESHOLD.Partial: similarity ≥
PARTIAL_THRESHOLDbut <SIM_THRESHOLD.Missing: similarity <
PARTIAL_THRESHOLD.
Bar Chart:
Shows counts of Covered, Partial, and Missing code units with color coding (green, orange, red).
Plot Layout:
Arranges radar and bar charts side-by-side.
Configured for clarity and interactive display or saving.
Function: plot_semantic_scatter
def plot_semantic_scatter(code_embeddings: List[np.ndarray],
doc_embeddings: List[np.ndarray],
code_units: List[str],
doc_units: List[str],
out_path: Optional[str] = None,
use_umap: bool = False)
Purpose
Creates an interactive 2D scatter plot visualizing semantic relationships between code and documentation units in a shared embedding space.
Parameters
code_embeddings(List[np.ndarray]): Embeddings for code units.doc_embeddings(List[np.ndarray]): Embeddings for documentation units.code_units(List[str]): String labels or identifiers for code units (used for hover text).doc_units(List[str]): String labels or identifiers for documentation units (used for hover text).out_path(Optional[str]): Optional path to save the plot as an HTML file.use_umap(bool): IfTrue, uses UMAP for dimensionality reduction; otherwise uses t-SNE.
Returns
None. Displays or saves the visualization.
Usage Example
plot_semantic_scatter(code_embeddings, doc_embeddings, code_names, doc_names, out_path="semantic_map.html")
Implementation Details
Dimensionality Reduction:
Combines code and doc embeddings into one array.
Applies either UMAP or t-SNE to project embeddings into 2D.
t-SNE perplexity is dynamically adjusted based on data size and config param.
Scatter Plot:
Code units plotted as blue markers; documentation units as red markers.
Hover text shows truncated unit content or name for context.
Similarity Lines:
For each code unit, finds the doc unit with max cosine similarity.
If similarity ≥
SIM_THRESHOLD, draws a green line connecting the two points.Line width and opacity scale with similarity strength.
Convex Hull:
Computes convex hull around covered code points (those above
SIM_THRESHOLD).Hull is drawn as a green dashed polygon with translucent fill to highlight coverage cluster.
Plot Layout:
Sets axis titles and plot dimensions for clear visualization.
Supports interactive exploration or export.
Key Algorithms and Concepts
Cosine Similarity for Coverage:
Used to measure semantic similarity between vector representations of code and documentation units.
Dimensionality Reduction:
t-SNE and UMAP reduce high-dimensional embeddings to 2D for visualization.
t-SNE perplexity parameter is adaptive to dataset size.
UMAP offers faster computation and often better global structure preservation.
Convex Hull:
Encapsulates the cluster of code units considered covered by documentation.
Visualizes the semantic "coverage area" on the scatter plot.
Interactions with Other Modules
configmodule:Provides constants affecting thresholds and parameters:
SIM_THRESHOLD: similarity cutoff to consider a code unit covered.PARTIAL_THRESHOLD: lower similarity cutoff for partial coverage.TSNE_PERPLEXITY: controls perplexity parameter for t-SNE.
metricsmodule:Provides
cosine_similarityfunction to compute similarity scores between embeddings.
umappackage:Used optionally for dimensionality reduction instead of t-SNE.
Plotly library:
Used extensively for creating interactive visualizations.
These connections enable the plotting functions to integrate seamlessly into a larger analysis pipeline that generates embeddings, computes metrics, and visualizes results.
Mermaid Diagram: Function Flowchart for plotting.py
flowchart TD
A[plotly_radar_and_bar] --> B[Calculate coverage counts]
B --> C[Create radar chart for CES metrics]
B --> D[Create bar chart for coverage counts]
C & D --> E[Display or save plot]
F[plot_semantic_scatter] --> G[Combine embeddings]
G --> H{Use UMAP?}
H -- Yes --> I[Apply UMAP]
H -- No --> J[Apply t-SNE]
I & J --> K[Split reduced coords]
K --> L[Plot code units scatter]
K --> M[Plot doc units scatter]
L & M --> N[Draw similarity lines]
N --> O[Compute convex hull on covered code units]
O --> P[Add hull polygon]
P --> Q[Display or save plot]
Summary
`plotting.py` is a specialized visualization utility focused on semantic coverage and relationships between code and documentation embeddings. It provides interactive, insightful visualizations using advanced dimensionality reduction and similarity metrics, aiding developers and analysts in understanding and improving documentation coverage and relevance.
**End of documentation for `plotting.py`.**