Project Overview

Project Purpose and Objectives

This project aims to facilitate the exploration, analysis, and visualization of embeddings generated from source code files. It focuses on converting code into semantically meaningful vector representations and applying advanced dimensionality reduction and clustering techniques to uncover latent structure and relationships within the codebase. The core objectives include:

The system is designed to support developers and analysts in understanding large codebases by revealing structural and semantic patterns in the code, aiding tasks such as code review, refactoring, and architecture analysis.


Example Workflows and Use Cases

Workflow 1: Generating and Visualizing Embeddings for a Codebase

  1. Load Source Code Files
    The system fetches file overviews and metadata from the PostgreSQL database.

  2. Semantic Splitting
    Each source code file is split into semantically coherent chunks using the semantic_split function in tools.py, which leverages sentence embeddings and cosine similarity.

  3. Embedding Computation
    Each chunk is embedded using a pretrained transformer model (via HuggingFace in tools.py), producing vector representations that capture semantic content.

  4. Dimensionality Reduction & Clustering
    Using algorithms like UMAP or t-SNE from visualizer.py or main.py, embeddings are projected into a lower-dimensional space. Clustering algorithms such as HDBSCAN or KMeans group similar chunks.

  5. Interactive Visualization
    The clustered and reduced embeddings are visualized in an interactive Plotly scatter plot within a Streamlit app (visualizer.py), where users can hover over points to view metadata or cluster assignment.

Workflow 2: Exploring Existing Embeddings via Web App

  1. Upload Embeddings & Metadata
    Users upload pickled embeddings and metadata files through the Streamlit sidebar.

  2. Select Reduction and Clustering Methods
    The app allows configuration of reduction (PCA, t-SNE, UMAP) and clustering (various sklearn and HDBSCAN algorithms) parameters interactively.

  3. Visual Exploration
    Results update dynamically, presenting visual clusters and data distributions for in-depth investigation.


Stack and Technologies

**Rationale:** This technology stack balances cutting-edge NLP embedding capabilities with rich clustering and visualization tools. Streamlit and Plotly enable rapid development of interactive frontends. PostgreSQL ensures robust metadata and embedding storage. The use of multiple reduction and clustering methods provides flexibility to adapt to different dataset characteristics.


High-Level Architecture

+----------------------+       +---------------------+       +---------------------+
|                      |       |                     |       |                     |
|    Source Code Files  | ----> |   Semantic Splitting | ----> |  Embedding Computa- |
|  (Codebase + Metadata)|       |  (tools.py)          |       |  tion (tools.py)    |
|                      |       |                     |       |  (transformers +    |
+----------------------+       +---------------------+       |   PyTorch)          |
                                                                  +---------------------+
                                                                       |
                                                                       v
                                                        +-----------------------------+
                                                        | Dimensionality Reduction &   |
                                                        | Clustering (main.py &        |
                                                        | visualizer.py)               |
                                                        | - PCA, t-SNE, UMAP           |
                                                        | - KMeans, Agglomerative,     |
                                                        |   DBSCAN, HDBSCAN, OPTICS    |
                                                        +-----------------------------+
                                                                       |
                                                                       v
                                                        +-----------------------------+
                                                        | Interactive Visualization    |
                                                        | (visualizer.py using Streamlit|
                                                        | and Plotly)                  |
                                                        +-----------------------------+
                                                                       |
                                                                       v
                                                        +-----------------------------+
                                                        | PostgreSQL Database          |
                                                        | - Stores file metadata       |
                                                        | - Stores embeddings          |
                                                        +-----------------------------+

**Component Interactions:**


Developer Navigation

Frontend Developers Start Here

Backend Developers Focus Here

NLP / Embedding Specialists

Data Scientists / Algorithm Developers


This overview provides a concise yet thorough roadmap for contributors to quickly comprehend project goals, architecture, workflows, and entry points for development. It balances technical depth with clarity suitable for both new and experienced developers.