Semantic Splitting and Embedding Computation

Overview and Purpose

This module focuses on transforming raw source code files into semantically meaningful vector representations. The core challenge addressed here is to break down code files into coherent chunks that preserve semantic context and then compute numerical embeddings for these chunks using pretrained transformer models. These embeddings serve as the foundational data for downstream analysis, such as clustering, dimensionality reduction, and visualization.

**Why this module exists:**

Source code files can be large and contain multiple logical components. Processing entire files as single units may dilute semantic granularity.
Splitting code into semantically coherent chunks allows for finer-grained analysis, enabling better understanding of code structure and relationships.
Generating high-quality embeddings captures the semantic content of code chunks, facilitating tasks like similarity search, clustering, and visualization.

Core Concepts

Semantic Splitting

Semantic splitting divides a file’s content into smaller chunks that are internally coherent and distinct from one another based on semantic similarity. Unlike naive splitting by lines or fixed sizes, this approach uses sentence embeddings and similarity metrics to determine chunk boundaries dynamically.

Embedding Computation

Each chunk is converted into a fixed-size numeric vector (embedding) using a pretrained transformer model. This embedding encodes semantic information about the chunk's content, enabling comparison and clustering across chunks.

How the Module Works

Semantic Splitting Process

Input: Raw file content as a string, alongside a pretrained tokenizer and model for embedding computation.
Sentence Tokenization: The file content is split into individual sentences using a sentence tokenizer (nltk.sent_tokenize). This breaks the content into manageable semantic units.
Sentence Embeddings: Each sentence is embedded independently using the transformer model, producing a vector representation.
Chunk Building: Sentences are aggregated into chunks by evaluating two criteria:
1. Token Length Limit: The combined tokens of the current chunk plus the new sentence should not exceed a predefined maximum (512 tokens here).
2. Semantic Similarity Threshold: The cosine similarity between the current chunk's embedding and the candidate sentence's embedding must exceed a threshold (0.80 in this system).
If either criterion fails, the current chunk is closed and a new chunk starts with the new sentence. Otherwise, the sentence is appended to the current chunk, and the chunk’s embedding is updated as a weighted average of its constituent sentence embeddings, weighted by sentence length for better semantic representation.
Output: A list of text chunks that represent semantically coherent parts of the original file.

This approach balances chunk size constraints with semantic cohesion, ensuring chunks neither become too large nor semantically fragmented.

Embedding Computation for Sentences and Chunks

Sentence Embeddings:
Each sentence is tokenized and passed through the transformer model to obtain token-level embeddings. The model's output (last_hidden_state) is mean-pooled across tokens (excluding padding tokens) to produce a fixed-size sentence embedding vector.
Chunk Embeddings:
During chunk construction, embeddings of sentences are combined into a weighted average to represent the entire chunk. For final chunk embeddings (in main.py), chunks are tokenized again and passed through the model, with the mean of the last hidden states used as the embedding.

Integration in Workflow (Reference to main.py)

The main.py script orchestrates the loading of source file metadata, reading file contents, applying semantic splitting, computing embeddings, and storing the results.
It retrieves file overviews and paths from the database, reads each file from disk, and applies the semantic_split function with a pretrained tokenizer and model (microsoft/codebert-base).
For each chunk obtained, it computes an embedding by tokenizing and feeding the chunk through the model, averaging the output token embeddings.
Embeddings and associated metadata for each chunk are accumulated and serialized for later use in visualization and analysis.

**Excerpt illustrating chunk splitting and embedding computation:**

chunks = semantic_split(file_content, tokenizer, model)
for chunk in chunks:
    inputs = tokenizer(str(chunk), return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().tolist())
    metadata.append((file_id, overview_text, path, name, page_id))

Interaction with Other Parts of the System

Database Layer:
The module depends on metadata (file IDs, paths, overviews) fetched from the PostgreSQL database in main.py. This metadata guides which files to process and associates embeddings back to source files.
Embedding Storage:
After computation, embeddings and associated metadata are saved as pickle files for persistence and later loading by visualization or clustering modules.
Downstream Analysis:
The embeddings generated here feed into dimensionality reduction and clustering algorithms implemented in main.py and visualizer.py. Thus, the quality and granularity of semantic splitting directly impact the effectiveness of subsequent analysis.
Visualization Module:
Embeddings computed for semantically split chunks enable detailed exploration of code semantics in the interactive visualization powered by visualizer.py. Users can inspect clusters and relationships at chunk-level granularity.

Important Concepts and Design Patterns

Sentence-Level Embedding and Aggregation:
Using sentence embeddings as building blocks for semantic chunks leverages transformer models' strength in capturing fine-grained semantic details. Weighted averaging to update chunk embeddings preserves semantic relevance as chunks grow.
Semantic Similarity Guided Chunking:
The use of cosine similarity to decide chunk boundaries enforces semantic coherence, as only sentences closely related to the current chunk are included.
Dynamic Chunk Size Management:
Combining token length thresholds with semantic similarity ensures chunks are neither too large (which could dilute embeddings) nor semantically heterogeneous.
Use of Pretrained Transformer Models for Embeddings:
The module employs a pretrained code-focused transformer model to generate embeddings, which captures code semantics more effectively than generic language models.
Separation of Concerns:
The semantic splitting and embedding logic reside in tools.py, encapsulating NLP and embedding-specific functionality. The orchestration, database interaction, and file handling are in main.py, enabling modular development and easier maintenance.

Summary

The Semantic Splitting and Embedding Computation module transforms raw source code files into semantically coherent chunks and computes dense vector embeddings for each chunk. This process underpins the project's ability to analyze, cluster, and visualize code semantics at a granular level, enabling better insight into code structure and relationships. The module leverages sentence tokenization, transformer-based sentence embeddings, semantic similarity metrics, and dynamic chunk building to achieve this goal, integrating closely with the database backend and downstream analytical components.