tools.py

Overview

`tools.py` is a utility module designed to process textual content—particularly source code files—by splitting them into semantically coherent chunks and generating dense vector embeddings for these chunks using pretrained transformer models. The module leverages natural language processing techniques such as sentence tokenization and transformer-based sentence embeddings to break down large text data into meaningful segments, facilitating downstream tasks like clustering, similarity search, and visualization of code semantics.

Specifically, this file contains two main functions:

semantic_split: Splits a large text into smaller chunks based on semantic similarity and token length constraints.
get_sentence_embedding: Computes embedding vectors for individual sentences using a pretrained HuggingFace transformer model.

This module plays a critical role in preparing raw source file content for semantic analysis within the broader system.

Detailed Documentation

Imports

torch: PyTorch framework for tensor computations and model inference.
numpy: Numerical operations, especially arrays.
typing.List: Type hinting for list return types.
transformers: HuggingFace transformers library for tokenization and embedding models.
nltk: Natural Language Toolkit, used here for sentence tokenization.
sklearn.metrics.pairwise.cosine_similarity: To compute cosine similarity between embedding vectors.

Function: `semantic_split`

def semantic_split(file_content: str, tokenizer: PreTrainedTokenizer, model: PreTrainedModel) -> List[str]:

Purpose

Splits a long text string into smaller semantically coherent chunks by:

Tokenizing the text into sentences.
Computing embeddings for each sentence.
Aggregating sentences into chunks based on token limits and semantic similarity thresholds.

This approach ensures that chunks are both manageable in size and semantically consistent, which is particularly useful for processing source code or other technical documents.

Parameters

file_content (str): The raw text content to be split into chunks.
tokenizer (PreTrainedTokenizer): A HuggingFace tokenizer for tokenizing sentences and chunks.
model (PreTrainedModel): A HuggingFace transformer model used to generate vector embeddings for sentences.

Returns

List[str]: A list of text chunks, each representing a semantically coherent segment of the original text.

Implementation Details

Sentence Tokenization: Uses NLTK's sent_tokenize to split text into sentences. Downloads the 'punkt' tokenizer if not present.
Sentence Embedding: Calls get_sentence_embedding on each sentence to obtain a numerical vector.
Chunking Algorithm:
- Initializes the first chunk with the first sentence and its embedding.
- Iterates over subsequent sentences, deciding whether to append the sentence to the current chunk or start a new chunk based on:
  - Token limit: Concatenated chunk must not exceed 512 tokens.
  - Semantic similarity: Cosine similarity between the current chunk embedding and the candidate sentence embedding must be ≥ 0.80.
- If either condition fails, closes the current chunk and starts a new one.
- When adding sentences to a chunk, updates the chunk embedding as a weighted average based on sentence lengths (number of characters).
Returns all accumulated chunks as a list of strings.

Usage Example

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")

file_content = """
def add(a, b):
    return a + b

# This function subtracts two numbers
def subtract(a, b):
    return a - b
"""

chunks = semantic_split(file_content, tokenizer, model)
for chunk in chunks:
    print("Chunk:")
    print(chunk)

Function: `get_sentence_embedding`

def get_sentence_embedding(sentence: str, tokenizer: PreTrainedTokenizer, model: PreTrainedModel) -> np.ndarray:

Purpose

Generates a fixed-size numerical embedding vector for a single sentence by passing it through a pretrained transformer model.

Parameters

sentence (str): The input sentence to embed.
tokenizer (PreTrainedTokenizer): HuggingFace tokenizer to convert the sentence into tokens.
model (PreTrainedModel): HuggingFace model to compute token embeddings.

Returns

np.ndarray: A 1D numpy array representing the sentence embedding vector.

Implementation Details

Tokenizes the sentence with padding and truncation.
Passes tokens through the transformer model in evaluation mode (no gradient computation).
Extracts the last_hidden_state from the model output, which contains token embeddings.
Applies mean pooling across all token embeddings, weighted by the attention mask to ignore padding tokens:
- Multiplies token embeddings by the expanded attention mask.
- Sums the embeddings and divides by the sum of the mask to get an average.
If last_hidden_state is not available, falls back to using the pooler_output (usually the CLS token embedding).
Converts the resulting PyTorch tensor to a numpy array for downstream compatibility.

Usage Example

sentence = "def add(a, b): return a + b"
embedding = get_sentence_embedding(sentence, tokenizer, model)
print(embedding.shape)  # e.g., (768,)

Important Implementation Notes

Semantic Chunk Embedding Update: When adding sentences to a chunk, the embedding is updated as a weighted average based on the length of sentences in characters, not tokens. This heuristic balances the influence of longer sentences in the chunk embedding.
Token Limit: The maximum chunk size is set to 512 tokens to avoid overly large inputs that can degrade embedding quality or exceed model limits.
Similarity Threshold: The cosine similarity cutoff of 0.80 is a tunable parameter ensuring that only semantically related sentences aggregate into the same chunk.
NLTK Dependency: The module relies on the punkt tokenizer from NLTK for sentence segmentation. It automatically downloads the resource if missing.
Model Agnostic: Though examples use microsoft/codebert-base, the functions accept any HuggingFace tokenizer and model compatible with the interface.

Interaction With Other System Components

main.py: This module is primarily invoked by the main orchestration script, which:
- Reads raw source code files from disk.
- Calls semantic_split to divide files into semantic chunks.
- Computes chunk embeddings for later use (sometimes re-tokenizing the chunk for final embedding).
- Stores embeddings and metadata for visualization and clustering.
Database Layer: The system fetches file metadata (e.g., file paths, IDs) from a database to locate source files for processing.
Embedding Storage: Generated embeddings are serialized and stored, enabling efficient retrieval for downstream analysis.
Visualizer Module (visualizer.py): Uses the embeddings produced by this module to support interactive exploration of code semantics, cluster inspection, and visualization.

Summary

`tools.py` encapsulates the core logic for semantically splitting source code or text files into meaningful chunks and generating their vector embeddings with pretrained transformer models. It employs sentence tokenization, transformer-based embedding, and a heuristic chunking algorithm balancing token count and semantic similarity. This modular approach supports fine-grained semantic analysis of code, enabling improved clustering, visualization, and understanding within the broader software project.

References

HuggingFace Transformers Documentation: https://huggingface.co/docs/transformers/index
NLTK Tokenization: https://www.nltk.org/api/nltk.tokenize.html
Scikit-learn Cosine Similarity: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
PyTorch: https://pytorch.org/docs/stable/index.html

tools.py

Overview

Detailed Documentation

Imports

Function: semantic_split

Purpose

Parameters

Returns

Implementation Details

Usage Example

Function: get_sentence_embedding

Purpose

Parameters

Returns

Implementation Details

Usage Example

Important Implementation Notes

Interaction With Other System Components

Summary

References

Function: `semantic_split`

Function: `get_sentence_embedding`