main.py

{'page_id': '55996', 'language_code': 'en', 'content': '# main.py\n\n---\n\n## Overview\n\nmain.py serves as a central orchestration script within the code embedding and analysis system. Its primary purpose is to:\n\n* Fetch metadata about source code files from a PostgreSQL database.\n\n* Load or compute semantic embeddings for those files by splitting them into semantically coherent chunks.\n\n* Store embeddings and metadata persistently.\n\n* Provide functions to visualize embeddings, both raw and clustered, via interactive plots.\n\n* Manage database connections and data retrieval.\n\n* Illustrate the use of transformer-based models (specifically CodeBERT) for embedding source code chunks.\n\n* Demonstrate preliminary clustering and dimensionality reduction workflows on computed embeddings.\n\nThis file acts as the glue between the database layer, embedding computation (leveraging the semantic_split tool), and visualization components (using Plotly). It integrates various machine learning and NLP libraries such as Hugging Face Transformers, scikit-learn, UMAP, and t-SNE to enable exploratory analysis of code semantics at a chunk and file level.\n\n---\n\n## Detailed Breakdown of Classes, Functions, and Methods\n\nmain.py does not define any classes but includes several key functions and a main() execution routine. Below is a detailed explanation of each function along with usage notes and examples where applicable.\n\n---\n\n### visualize_embeddings(embeddings, metadata)\n\nCreates an interactive 2D scatter plot visualization of semantic embeddings using t-SNE for dimensionality reduction.\n\n#### Parameters\n\n* embeddings (List[List[float]]): A list of high-dimensional embedding vectors, typically generated from code chunks or file overviews.\n\n* metadata (List[Tuple]): A list of tuples, each containing metadata fields about the corresponding embedding. Expected fields in each tuple:\n\n * file_id (int)\n\n * overview_text (str)\n\n * path (str)\n\n * name (str)\n\n * page_id (int)\n\n#### Returns\n\n* None. The function displays an interactive Plotly scatter plot.\n\n#### Description\n\n* Converts embeddings to a NumPy array.\n\n* Applies t-SNE to reduce embeddings to 2D.\n\n* Creates hover information combining file metadata and truncated overview text.\n\n* Displays the plot with customized marker styles and hover templates.\n\n#### Usage Example\n\npython\nvisualize_embeddings(embeddings, metadata)\n\n\n---\n\n### get_db_connection()\n\nEstablishes and returns a connection to a PostgreSQL database using environment variables for configuration.\n\n#### Parameters\n\n* None\n\n#### Returns\n\n* psycopg.Connection: A live database connection object.\n\n#### Description\n\n* Reads DB connection parameters from environment variables: DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD.\n\n* Connects to the PostgreSQL database using psycopg.connect().\n\n#### Usage Example\n\npython\nconn = get_db_connection()\n\n\n---\n\n### get_file_overviews(db: psycopg.Connection, project_id: int)\n\nFetches distinct file overview records for a specified project from the database, including associated page content.\n\n#### Parameters\n\n* db (psycopg.Connection): Active database connection.\n\n* project_id (int): Identifier of the project whose file overviews are requested.\n\n#### Returns\n\n* List[Tuple]: A list of tuples, each containing (file_id, overview, path, name, page_id, content).\n\n#### Description\n\n* Executes a SQL query joining file_overviews, project_files, and page_content tables.\n\n* Filters by the given project_id.\n\n* Returns all matching rows, including the page content associated with each file.\n\n#### Usage Example\n\npython\nconn = get_db_connection()\noverviews = get_file_overviews(conn, project_id=30)\n\n\n---\n\n### visualize_clustered_embeddings(embeddings, overviews)\n\nCreates an interactive 2D visualization of embeddings with cluster assignments, combining dimensionality reduction and clustering.\n\n#### Parameters\n\n* embeddings (List[List[float]]): List of embedding vectors.\n\n* overviews (List[Tuple]): Metadata tuples corresponding to embeddings (same format as in visualize_embeddings).\n\n#### Returns\n\n* None. Displays a Plotly scatter plot with clusters.\n\n#### Description\n\n* Converts embeddings to NumPy array.\n\n* Performs multi-step dimensionality reduction:\n\n * PCA reduces to 79 components.\n\n * UMAP reduces to 50 components.\n\n* Applies Spectral Clustering (fixed at 8 clusters).\n\n* Performs final t-SNE reduction to 2D for visualization.\n\n* Prepares hover metadata including cluster labels.\n\n* Displays an interactive scatter plot with points colored by cluster.\n\n#### Important Notes about Implementation\n\n* Several alternative clustering algorithms (HDBSCAN, AgglomerativeClustering, KMeans, OPTICS) are commented out, showing exploration of different approaches.\n\n* The pipeline reduces embedding dimensionality progressively before clustering to improve cluster quality and computational efficiency.\n\n#### Usage Example\n\npython\nvisualize_clustered_embeddings(embeddings, overviews)\n\n\n---\n\n### main()\n\nMain entry point of the script orchestrating the entire workflow:\n\n1. Attempts to load precomputed embeddings from embeddings.pkl.\n\n2. Establishes a database connection.\n\n3. Fetches file overviews for a specific project ID (hardcoded as 30).\n\n4. Saves overviews metadata to a local pickle file for caching (./overviews/coverage_overviews.pkl).\n\n5. If embeddings are not loaded, computes embeddings by:\n\n * Loading CodeBERT tokenizer and model.\n\n * Iterating over each file overview:\n\n * Reads source code file from disk.\n\n * Splits the file content into semantically coherent chunks using semantic_split.\n\n * For each chunk, computes an embedding by mean pooling the last hidden states of the transformer model.\n\n * Accumulates embeddings and metadata.\n\n6. Saves computed embeddings and metadata to disk under ./embeddings/code_embeddings.pkl and ./embeddings/metadata.pkl.\n\n7. Optionally calls visualization functions (currently commented out).\n\n8. Closes the database connection.\n\n#### Parameters\n\n* None\n\n#### Returns\n\n* None\n\n#### Usage\n\nRun directly as a script:\n\nbash\npython main.py\n\n\n#### Important Implementation Details\n\n* The base_path for source code files is currently set as Path(\"\") (project root) and should be configured as needed.\n\n* Embeddings are stored in ./embeddings/code_embeddings.pkl.\n\n* Metadata is stored in ./embeddings/metadata.pkl.\n\n* The semantic splitting logic is encapsulated in the imported semantic_split function from the tools module (see project documentation for details).\n\n* Uses Hugging Face's microsoft/codebert-base model, specialized for source code embedding.\n\n---\n\n## Important Implementation Details and Algorithms\n\n### Semantic Splitting and Embedding Computation\n\n* The file content is split into semantically coherent chunks using the semantic_split function (imported from tools), which utilizes sentence tokenization, embedding similarity, and token length constraints to form chunks.\n\n* Each chunk’s embedding is computed by tokenizing the chunk and passing it through the CodeBERT model, then averaging the token embeddings (last_hidden_state.mean(dim=1)).\n\n* This chunk-level embedding approach enables fine-grained semantic representation beyond file-level embeddings.\n\n### Dimensionality Reduction and Clustering\n\n* For visualization, high-dimensional embeddings are reduced to 2D using t-SNE.\n\n* For clustering, a multi-step pipeline combines PCA (to 79 dims), UMAP (to 50 dims), then Spectral Clustering (8 clusters).\n\n* Final visualization again uses t-SNE for plotting.\n\n* Different clustering algorithms and parameters have been explored (some commented out), indicating experimental flexibility.\n\n### Database Interaction\n\n* Uses psycopg (PostgreSQL client) to retrieve file overview metadata.\n\n* SQL query joins file_overviews, project_files, and page_content tables ensuring data consistency and enriching metadata with page content.\n\n* Connection parameters are loaded from environment variables managed by dotenv.\n\n### Data Persistence\n\n* Embeddings and metadata are serialized using Python’s pickle for quick load/save.\n\n* Overviews are also cached locally to avoid repeated database queries.\n\n---\n\n## Interactions With Other Parts of the System\n\n* Database Layer:\\n Retrieves file metadata (file IDs, paths, overviews, page content) from PostgreSQL. This metadata drives which files are processed and annotated in embeddings.\n\n* Semantic Splitting (tools.semantic_split):\\n The semantic splitting logic is implemented externally in the tools module. It is a crucial preprocessing step before embedding computation.\n\n* Pretrained Transformer Models:\\n Leverages Hugging Face’s CodeBERT model for embedding code chunks, ensuring semantically meaningful vector representations.\n\n* Embedding Storage and Loading:\\n Embeddings and metadata are saved to disk, enabling reuse and integration with visualization and clustering modules.\n\n* Visualization Pipeline:\\n The plotting functions use Plotly Express for interactive embedding visualization, facilitating exploratory data analysis.\n\n* Downstream Analysis:\\n The computed embeddings can be used by other scripts or modules (e.g., visualizer.py) for advanced dimensionality reduction, clustering, and user interface presentation.\n\n---\n\n## Summary\n\nThe main.py file is a key orchestration script responsible for:\n\n* Connecting to the project database and retrieving file overview metadata, including page content.\n\n* Computing semantic embeddings for source code files using a transformer model and semantic chunking.\n\n* Storing embeddings and metadata for reuse.\n\n* Providing visualization functions to interactively explore embeddings and clustering results.\n\n* Demonstrating exploratory clustering pipelines for grouping semantically similar code chunks.\n\nIt integrates database access, advanced NLP embedding techniques, dimensionality reduction, clustering algorithms, and interactive visualization into a cohesive workflow for analyzing source code semantics at scale.\n\n---\n\n## Example Workflow for Running main.py\n\nbash\n# Ensure environment variables for DB connection are set, for example in a .env file\nDB_HOST=localhost\nDB_PORT=5432\nDB_NAME=my_database\nDB_USER=my_user\nDB_PASSWORD=my_password\n\n# Activate virtual environment with required packages installed\n\n# Run main script\npython main.py\n\n# This will:\n# - connect to PostgreSQL\n# - fetch file overviews for project_id=30\n# - load or compute embeddings for source files\n# - save embeddings and metadata to disk\n# - (optionally) visualize embeddings\n\n\n---\n\n# End of Documentation for main.py', 'created_at': '2025-06-19 16:21:19.446360', 'updated_at': '2025-06-19 16:23:15.179517'}