embeddings.py

Overview

The `embeddings.py` file provides utilities for generating vector embeddings from code snippets using the Ollama embedding model. It handles preprocessing of code units by chunking large pieces of code into manageable segments, batching requests to the Ollama API to optimize performance, and gracefully handling errors during embedding generation.

Key functionalities include:

Splitting large code units into smaller chunks to maintain embedding quality.
Batching multiple texts to reduce overhead in API calls.
Fetching embeddings from Ollama with fallback mechanisms.
Minimal delay insertion to avoid overloading the embedding service.

This file is primarily designed to be used as a utility module within a larger system that requires converting code snippets into numerical vector representations for tasks such as code search, classification, or similarity analysis.

Detailed Documentation

Functions

`chunk_code_units_for_embedding(code_units: List[str], max_lines: int = 50) -> List[str]`

Splits long code units into smaller chunks limited by a maximum number of lines per chunk to keep embeddings focused and consistent in size.

Parameters:
- code_units (List[str]): A list of code strings, each potentially containing multiple lines.
- max_lines (int, optional): The maximum number of lines allowed per chunk (default is 50).
Returns:
- List[str]: A list of code chunks, where each chunk contains at most max_lines lines.

Usage example:

code_snippets = [
    "def foo():\n    pass\n" * 60,  # 60 lines of code
    "print('hello world')"
]
chunks = chunk_code_units_for_embedding(code_snippets, max_lines=50)
# chunks will contain two items: first 50 lines of foo, then remaining 10 lines, plus the print statement

Implementation details:
- Each code unit is split by lines.
- If the unit is smaller than or equal to max_lines, it is added as-is.
- Otherwise, it is split into multiple chunks each containing max_lines lines.
- Empty or whitespace-only chunks are filtered out before returning.

`batch(iterable, n=EMBED_BATCH)`

A generator function that yields successive batches of size `n` from the input iterable.

Parameters:
- iterable (sequence type): The input list or sequence to be batched.
- n (int, optional): Batch size (default is the constant EMBED_BATCH imported from config).
Yields:
- List: Subsequences of iterable each of length n (except possibly the last one).

Usage example:

items = [1,2,3,4,5,6,7]
for batch_items in batch(items, n=3):
    print(batch_items)
# Output:
# [1, 2, 3]
# [4, 5, 6]
# [7]

Implementation details:
- Simple slicing based batching.
- Uses a for loop incremented by n to yield chunks.

`get_embeddings_ollama(texts: List[str], model: str = EMBED_MODEL) -> List[np.ndarray]`

Generates embeddings for a list of text strings (code chunks) using the Ollama API in batches.

Parameters:
- texts (List[str]): List of strings to embed.
- model (str, optional): The embedding model name to use (default is EMBED_MODEL from config).
Returns:
- List[np.ndarray]: A list of NumPy arrays representing the embedding vectors for each input text.

Usage example:

code_chunks = ["def foo():\n    pass", "print('hello')"]
embeddings = get_embeddings_ollama(code_chunks)
for emb in embeddings:
    print(emb.shape)  # e.g., (768,)

Implementation details:
- The function processes texts in batches defined by EMBED_BATCH.
- For each text in a batch:
  - Calls ollama.embeddings() with the specified model and text.
  - Converts the returned embedding to a NumPy array of floats.
- If embedding generation fails for any text (e.g., API failure), it logs a warning and returns a zero vector of length 768 as a fallback.
- Adds a short sleep (0.01 seconds) after processing each batch to prevent overwhelming the Ollama service.

Important Implementation Details and Algorithms

Chunking algorithm: The chunk_code_units_for_embedding function ensures that embeddings are generated on reasonably sized code segments (no longer than 50 lines by default). This improves the quality and focus of embeddings since extremely long texts may dilute the representation.
Batch processing: The batch generator allows the embedding calls to be performed in fixed-size groups, optimizing API usage and improving throughput.
Error handling and fallback: get_embeddings_ollama catches any exceptions during embedding generation, logs a warning, and substitutes a zero vector embedding to maintain consistent output length and avoid breaking downstream processes.
Rate limiting: A minimal delay is introduced between batches to avoid overwhelming the embedding service, which is a practical consideration for robust system behavior.

Interaction with Other Parts of the System

Imports:
- EMBED_MODEL and EMBED_BATCH are imported from a separate config module, indicating that model parameters and batch sizes are configurable at the application level.
- The ollama module is used to interface with an external embedding service or API.
- numpy is used for numerical array handling of embeddings.
Usage context:
- This file acts as a utility within a larger pipeline that processes code units (possibly source files or code snippets) and converts them into embeddings.
- Other system components likely call chunk_code_units_for_embedding to preprocess raw code, then get_embeddings_ollama to obtain embeddings for downstream ML models or search systems.

Structure Diagram

classDiagram
    class embeddings {
        +chunk_code_units_for_embedding(code_units: List[str], max_lines: int=50) List[str]
        +batch(iterable, n: int=EMBED_BATCH) generator
        +get_embeddings_ollama(texts: List[str], model: str=EMBED_MODEL) List[np.ndarray]
    }

This diagram illustrates the `embeddings.py` file as a module containing three main functions. There are no classes defined in this file.

Summary

The `embeddings.py` file is a focused utility module that prepares and generates numerical embeddings from code snippets using the Ollama embedding API. It efficiently manages input chunking, batch processing, and error handling to provide reliable embeddings for downstream tasks in code analysis or machine learning workflows.