Semantic Splitting and Embedding Computation

Overview and Purpose

This module focuses on transforming raw source code files into semantically meaningful vector representations. The core challenge addressed here is to break down code files into coherent chunks that preserve semantic context and then compute numerical embeddings for these chunks using pretrained transformer models. These embeddings serve as the foundational data for downstream analysis, such as clustering, dimensionality reduction, and visualization.

**Why this module exists:**

Core Concepts

Semantic Splitting

Semantic splitting divides a file’s content into smaller chunks that are internally coherent and distinct from one another based on semantic similarity. Unlike naive splitting by lines or fixed sizes, this approach uses sentence embeddings and similarity metrics to determine chunk boundaries dynamically.

Embedding Computation

Each chunk is converted into a fixed-size numeric vector (embedding) using a pretrained transformer model. This embedding encodes semantic information about the chunk's content, enabling comparison and clustering across chunks.

How the Module Works

Semantic Splitting Process

This approach balances chunk size constraints with semantic cohesion, ensuring chunks neither become too large nor semantically fragmented.

Embedding Computation for Sentences and Chunks

Integration in Workflow (Reference to main.py)

**Excerpt illustrating chunk splitting and embedding computation:**

chunks = semantic_split(file_content, tokenizer, model)
for chunk in chunks:
    inputs = tokenizer(str(chunk), return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().tolist())
    metadata.append((file_id, overview_text, path, name, page_id))

Interaction with Other Parts of the System

Important Concepts and Design Patterns

Summary

The Semantic Splitting and Embedding Computation module transforms raw source code files into semantically coherent chunks and computes dense vector embeddings for each chunk. This process underpins the project's ability to analyze, cluster, and visualize code semantics at a granular level, enabling better insight into code structure and relationships. The module leverages sentence tokenization, transformer-based sentence embeddings, semantic similarity metrics, and dynamic chunk building to achieve this goal, integrating closely with the database backend and downstream analytical components.