config.py

Overview

The `config.py` file serves as a centralized configuration module for a code analysis or code summarization system. It defines a set of constants and mappings that control the behavior of embedding generation, code summarization, similarity thresholds, batch processing parameters, and language mappings based on file extensions. These configurations are used across the system to maintain consistent settings for models, thresholds, and processing parameters.

This file does not contain executable functions or classes but provides important parameters that influence the runtime behavior and performance of the application components related to code embeddings, summarization, and evaluation.

Configuration Constants and Their Usage

EMBED_MODEL

EMBED_MODEL = "unclemusclez/jina-embeddings-v2-base-code"

Type: str
Description: Specifies the identifier or path of the model used for generating embeddings of code snippets. This embedding model is likely designed for code embedding tasks, enabling semantic similarity computations.
Usage: Passed to embedding generation components to instantiate or select the embedding model.

SUMMARY_MODEL

SUMMARY_MODEL = "hf.co/ertghiu256/qwen3-4b-code-reasoning-gguf:Q4_K_M"

Type: str
Description: Specifies the Ollama chat model used for summarizing code units. This is a large language model fine-tuned or configured specifically for code reasoning and summarization tasks.
Usage: Used by summarization components to load or reference the summarization model.

SIM_THRESHOLD

SIM_THRESHOLD = 0.7

Type: float
Description: The similarity threshold above which two code embeddings are considered similar. Used in tasks such as clustering, duplicate detection, or similarity filtering.
Usage: Applied to similarity scores to make decisions about code similarity.

PARTIAL_THRESHOLD

PARTIAL_THRESHOLD = 0.5

Type: float
Description: A lower similarity threshold potentially used for partial matching or fuzzy matching scenarios.
Usage: May be used in workflows where partial similarity suffices or for initial filtering.

BERTSCORE_LANG

BERTSCORE_LANG = "en"

Type: str
Description: Language code used for BERTScore evaluation, which is a metric to evaluate text similarity (often used for summarization quality).
Usage: Passed to BERTScore evaluation modules to specify the language context.

EMBED_BATCH

EMBED_BATCH = 16

Type: int
Description: Defines the batch size when generating embeddings. Adjusting this affects performance and memory usage.
Usage: Used to batch code snippets for embedding generation to optimize throughput.

SUMMARIZE_BATCH

SUMMARIZE_BATCH = 1

Type: int
Description: Number of code units to summarize concurrently. Allows for parallelization if concurrency is implemented.
Usage: Controls parallel summarization workload size.

TSNE_PERPLEXITY

TSNE_PERPLEXITY = 30

Type: int
Description: Perplexity parameter for t-SNE visualization algorithm, which is used for dimensionality reduction and visualization of embeddings.
Usage: Passed to t-SNE functions to influence the clustering behavior in visualization.

EXT_LANG_MAP

EXT_LANG_MAP = {
    ".py": "python", ".js": "javascript", ".ts": "typescript", ".jsx": "javascript",
    ".tsx": "typescript", ".java": "java", ".rs": "rust", ".cs": "c_sharp",
    ".cpp": "cpp", ".c": "c", ".h": "c", ".html": "html", ".css": "css",
    ".go": "go", ".php": "php",
}

Type: dict[str, str]
Description: Maps file extensions to their respective programming language names used in the system. This mapping is critical for language-specific processing, such as syntax highlighting, parsing, or model selection.
Usage: Used when processing files to determine the language context based on the extension.

Implementation Details

The file is purely declarative, containing only constants and a dictionary.
The constants are designed to be imported and used globally across the system to ensure configuration consistency.
The choice of batch sizes, thresholds, and model identifiers suggests an emphasis on balancing performance and accuracy.
The language map supports a variety of common programming languages, enabling multi-language support in the application.

Interaction With Other Parts of the System

Embedding Components: EMBED_MODEL and EMBED_BATCH influence how code snippets are converted into vector representations.
Summarization Modules: SUMMARY_MODEL and SUMMARIZE_BATCH control code summarization behavior and concurrency.
Similarity and Evaluation: SIM_THRESHOLD, PARTIAL_THRESHOLD, and BERTSCORE_LANG are used in similarity detection and evaluation metrics.
Visualization: TSNE_PERPLEXITY is used in visualization modules that reduce embedding dimensionality for display.
File Handling and Parsing: EXT_LANG_MAP is essential for correctly identifying the programming language of input files, affecting downstream parsing, analysis, and model selection.

This modular configuration allows other modules to import these constants and adapt their behavior accordingly without hardcoding values.

Example Usage

from config import EMBED_MODEL, SIM_THRESHOLD, EXT_LANG_MAP

def is_similar(embedding1, embedding2):
    similarity = compute_cosine_similarity(embedding1, embedding2)
    return similarity >= SIM_THRESHOLD

def get_language_from_extension(filename):
    ext = os.path.splitext(filename)[1]
    return EXT_LANG_MAP.get(ext, "unknown")

Diagram: Flowchart of Configuration Relationships

flowchart TD
    A[config.py] --> B[Embedding Components]
    A --> C[Summarization Modules]
    A --> D[Similarity & Evaluation]
    A --> E[Visualization Modules]
    A --> F[File Parsing & Language Detection]

    B --> B1[Uses EMBED_MODEL]
    B --> B2[Uses EMBED_BATCH]

    C --> C1[Uses SUMMARY_MODEL]
    C --> C2[Uses SUMMARIZE_BATCH]

    D --> D1[Uses SIM_THRESHOLD]
    D --> D2[Uses PARTIAL_THRESHOLD]
    D --> D3[Uses BERTSCORE_LANG]

    E --> E1[Uses TSNE_PERPLEXITY]

    F --> F1[Uses EXT_LANG_MAP]

Summary

The `config.py` file is a key configuration resource in a code analysis system, defining important constants for embedding generation, code summarization, similarity evaluation, batch processing, visualization, and file language mapping. It ensures parameter consistency and flexibility across the system without embedding hardcoded values in the implementation logic.