tools.py
Overview
`tools.py` is a utility module designed to process textual content—particularly source code files—by splitting them into semantically coherent chunks and generating dense vector embeddings for these chunks using pretrained transformer models. The module leverages natural language processing techniques such as sentence tokenization and transformer-based sentence embeddings to break down large text data into meaningful segments, facilitating downstream tasks like clustering, similarity search, and visualization of code semantics.
Specifically, this file contains two main functions:
semantic_split: Splits a large text into smaller chunks based on semantic similarity and token length constraints.get_sentence_embedding: Computes embedding vectors for individual sentences using a pretrained HuggingFace transformer model.
This module plays a critical role in preparing raw source file content for semantic analysis within the broader system.
Detailed Documentation
Imports
torch: PyTorch framework for tensor computations and model inference.numpy: Numerical operations, especially arrays.typing.List: Type hinting for list return types.transformers: HuggingFace transformers library for tokenization and embedding models.nltk: Natural Language Toolkit, used here for sentence tokenization.sklearn.metrics.pairwise.cosine_similarity: To compute cosine similarity between embedding vectors.
Function: semantic_split
def semantic_split(file_content: str, tokenizer: PreTrainedTokenizer, model: PreTrainedModel) -> List[str]:
Purpose
Splits a long text string into smaller semantically coherent chunks by:
Tokenizing the text into sentences.
Computing embeddings for each sentence.
Aggregating sentences into chunks based on token limits and semantic similarity thresholds.
This approach ensures that chunks are both manageable in size and semantically consistent, which is particularly useful for processing source code or other technical documents.
Parameters
file_content(str): The raw text content to be split into chunks.tokenizer(PreTrainedTokenizer): A HuggingFace tokenizer for tokenizing sentences and chunks.model(PreTrainedModel): A HuggingFace transformer model used to generate vector embeddings for sentences.
Returns
List[str]: A list of text chunks, each representing a semantically coherent segment of the original text.
Implementation Details
Sentence Tokenization: Uses NLTK's
sent_tokenizeto split text into sentences. Downloads the 'punkt' tokenizer if not present.Sentence Embedding: Calls
get_sentence_embeddingon each sentence to obtain a numerical vector.Chunking Algorithm:
Initializes the first chunk with the first sentence and its embedding.
Iterates over subsequent sentences, deciding whether to append the sentence to the current chunk or start a new chunk based on:
Token limit: Concatenated chunk must not exceed 512 tokens.
Semantic similarity: Cosine similarity between the current chunk embedding and the candidate sentence embedding must be ≥ 0.80.
If either condition fails, closes the current chunk and starts a new one.
When adding sentences to a chunk, updates the chunk embedding as a weighted average based on sentence lengths (number of characters).
Returns all accumulated chunks as a list of strings.
Usage Example
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")
file_content = """
def add(a, b):
return a + b
# This function subtracts two numbers
def subtract(a, b):
return a - b
"""
chunks = semantic_split(file_content, tokenizer, model)
for chunk in chunks:
print("Chunk:")
print(chunk)
Function: get_sentence_embedding
def get_sentence_embedding(sentence: str, tokenizer: PreTrainedTokenizer, model: PreTrainedModel) -> np.ndarray:
Purpose
Generates a fixed-size numerical embedding vector for a single sentence by passing it through a pretrained transformer model.
Parameters
sentence(str): The input sentence to embed.tokenizer(PreTrainedTokenizer): HuggingFace tokenizer to convert the sentence into tokens.model(PreTrainedModel): HuggingFace model to compute token embeddings.
Returns
np.ndarray: A 1D numpy array representing the sentence embedding vector.
Implementation Details
Tokenizes the sentence with padding and truncation.
Passes tokens through the transformer model in evaluation mode (no gradient computation).
Extracts the
last_hidden_statefrom the model output, which contains token embeddings.Applies mean pooling across all token embeddings, weighted by the attention mask to ignore padding tokens:
Multiplies token embeddings by the expanded attention mask.
Sums the embeddings and divides by the sum of the mask to get an average.
If
last_hidden_stateis not available, falls back to using thepooler_output(usually the CLS token embedding).Converts the resulting PyTorch tensor to a numpy array for downstream compatibility.
Usage Example
sentence = "def add(a, b): return a + b"
embedding = get_sentence_embedding(sentence, tokenizer, model)
print(embedding.shape) # e.g., (768,)
Important Implementation Notes
Semantic Chunk Embedding Update: When adding sentences to a chunk, the embedding is updated as a weighted average based on the length of sentences in characters, not tokens. This heuristic balances the influence of longer sentences in the chunk embedding.
Token Limit: The maximum chunk size is set to 512 tokens to avoid overly large inputs that can degrade embedding quality or exceed model limits.
Similarity Threshold: The cosine similarity cutoff of 0.80 is a tunable parameter ensuring that only semantically related sentences aggregate into the same chunk.
NLTK Dependency: The module relies on the
punkttokenizer from NLTK for sentence segmentation. It automatically downloads the resource if missing.Model Agnostic: Though examples use
microsoft/codebert-base, the functions accept any HuggingFace tokenizer and model compatible with the interface.
Interaction With Other System Components
main.py: This module is primarily invoked by the main orchestration script, which:
Reads raw source code files from disk.
Calls
semantic_splitto divide files into semantic chunks.Computes chunk embeddings for later use (sometimes re-tokenizing the chunk for final embedding).
Stores embeddings and metadata for visualization and clustering.
Database Layer: The system fetches file metadata (e.g., file paths, IDs) from a database to locate source files for processing.
Embedding Storage: Generated embeddings are serialized and stored, enabling efficient retrieval for downstream analysis.
Visualizer Module (
visualizer.py): Uses the embeddings produced by this module to support interactive exploration of code semantics, cluster inspection, and visualization.
Summary
`tools.py` encapsulates the core logic for semantically splitting source code or text files into meaningful chunks and generating their vector embeddings with pretrained transformer models. It employs sentence tokenization, transformer-based embedding, and a heuristic chunking algorithm balancing token count and semantic similarity. This modular approach supports fine-grained semantic analysis of code, enabling improved clustering, visualization, and understanding within the broader software project.
References
HuggingFace Transformers Documentation: https://huggingface.co/docs/transformers/index
NLTK Tokenization: https://www.nltk.org/api/nltk.tokenize.html
Scikit-learn Cosine Similarity: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html