embeddings.py
Overview
The `embeddings.py` file provides utilities for generating vector embeddings from code snippets using the Ollama embedding model. It handles preprocessing of code units by chunking large pieces of code into manageable segments, batching requests to the Ollama API to optimize performance, and gracefully handling errors during embedding generation.
Key functionalities include:
Splitting large code units into smaller chunks to maintain embedding quality.
Batching multiple texts to reduce overhead in API calls.
Fetching embeddings from Ollama with fallback mechanisms.
Minimal delay insertion to avoid overloading the embedding service.
This file is primarily designed to be used as a utility module within a larger system that requires converting code snippets into numerical vector representations for tasks such as code search, classification, or similarity analysis.
Detailed Documentation
Functions
chunk_code_units_for_embedding(code_units: List[str], max_lines: int = 50) -> List[str]
Splits long code units into smaller chunks limited by a maximum number of lines per chunk to keep embeddings focused and consistent in size.
Parameters:
code_units(List[str]): A list of code strings, each potentially containing multiple lines.max_lines(int, optional): The maximum number of lines allowed per chunk (default is 50).
Returns:
List[str]: A list of code chunks, where each chunk contains at mostmax_lineslines.
Usage example:
code_snippets = [ "def foo():\n pass\n" * 60, # 60 lines of code "print('hello world')" ] chunks = chunk_code_units_for_embedding(code_snippets, max_lines=50) # chunks will contain two items: first 50 lines of foo, then remaining 10 lines, plus the print statementImplementation details:
Each code unit is split by lines.
If the unit is smaller than or equal to
max_lines, it is added as-is.Otherwise, it is split into multiple chunks each containing
max_lineslines.Empty or whitespace-only chunks are filtered out before returning.
batch(iterable, n=EMBED_BATCH)
A generator function that yields successive batches of size `n` from the input iterable.
Parameters:
iterable(sequence type): The input list or sequence to be batched.n(int, optional): Batch size (default is the constantEMBED_BATCHimported from config).
Yields:
List: Subsequences ofiterableeach of lengthn(except possibly the last one).
Usage example:
items = [1,2,3,4,5,6,7] for batch_items in batch(items, n=3): print(batch_items) # Output: # [1, 2, 3] # [4, 5, 6] # [7]Implementation details:
Simple slicing based batching.
Uses a
forloop incremented bynto yield chunks.
get_embeddings_ollama(texts: List[str], model: str = EMBED_MODEL) -> List[np.ndarray]
Generates embeddings for a list of text strings (code chunks) using the Ollama API in batches.
Parameters:
texts(List[str]): List of strings to embed.model(str, optional): The embedding model name to use (default isEMBED_MODELfrom config).
Returns:
List[np.ndarray]: A list of NumPy arrays representing the embedding vectors for each input text.
Usage example:
code_chunks = ["def foo():\n pass", "print('hello')"] embeddings = get_embeddings_ollama(code_chunks) for emb in embeddings: print(emb.shape) # e.g., (768,)Implementation details:
The function processes texts in batches defined by
EMBED_BATCH.For each text in a batch:
Calls
ollama.embeddings()with the specified model and text.Converts the returned embedding to a NumPy array of floats.
If embedding generation fails for any text (e.g., API failure), it logs a warning and returns a zero vector of length 768 as a fallback.
Adds a short sleep (0.01 seconds) after processing each batch to prevent overwhelming the Ollama service.
Important Implementation Details and Algorithms
Chunking algorithm: The
chunk_code_units_for_embeddingfunction ensures that embeddings are generated on reasonably sized code segments (no longer than 50 lines by default). This improves the quality and focus of embeddings since extremely long texts may dilute the representation.Batch processing: The
batchgenerator allows the embedding calls to be performed in fixed-size groups, optimizing API usage and improving throughput.Error handling and fallback:
get_embeddings_ollamacatches any exceptions during embedding generation, logs a warning, and substitutes a zero vector embedding to maintain consistent output length and avoid breaking downstream processes.Rate limiting: A minimal delay is introduced between batches to avoid overwhelming the embedding service, which is a practical consideration for robust system behavior.
Interaction with Other Parts of the System
Imports:
EMBED_MODELandEMBED_BATCHare imported from a separateconfigmodule, indicating that model parameters and batch sizes are configurable at the application level.The
ollamamodule is used to interface with an external embedding service or API.numpyis used for numerical array handling of embeddings.
Usage context:
This file acts as a utility within a larger pipeline that processes code units (possibly source files or code snippets) and converts them into embeddings.
Other system components likely call
chunk_code_units_for_embeddingto preprocess raw code, thenget_embeddings_ollamato obtain embeddings for downstream ML models or search systems.
Structure Diagram
classDiagram
class embeddings {
+chunk_code_units_for_embedding(code_units: List[str], max_lines: int=50) List[str]
+batch(iterable, n: int=EMBED_BATCH) generator
+get_embeddings_ollama(texts: List[str], model: str=EMBED_MODEL) List[np.ndarray]
}
This diagram illustrates the `embeddings.py` file as a module containing three main functions. There are no classes defined in this file.
Summary
The `embeddings.py` file is a focused utility module that prepares and generates numerical embeddings from code snippets using the Ollama embedding API. It efficiently manages input chunking, batch processing, and error handling to provide reliable embeddings for downstream tasks in code analysis or machine learning workflows.