Alerting Configuration

Purpose

The alerting configuration subtopic focuses on setting up and managing automated alerts for the ShapeShift Unchained platform’s infrastructure and services. It addresses the need to promptly detect and notify operators about critical and warning conditions affecting blockchain node daemons, indexers, API services, and Kubernetes resources. By integrating Alertmanager with Discord notifications, this configuration ensures that alerts are grouped, routed, and delivered effectively to appropriate channels, enabling efficient incident response and system reliability.

Functionality

This configuration defines how Prometheus alerts are handled, grouped, inhibited, and routed to different Discord channels based on severity and environment. The key functionalities include:

Alert Routing: Alerts are classified and routed to receivers corresponding to critical, warning, or development-level notifications. For example, critical issues in the production namespace are sent to the discord_critical receiver, while warnings are sent to discord_warning.
Grouping and Deduplication: Alerts with similar labels (such as alertname and namespace) are grouped to avoid notification flooding. Group wait, group interval, and repeat interval settings control the timing of alert notifications.
Inhibition Rules: Certain alerts suppress others to reduce noise. For instance, critical alerts inhibit warnings and info alerts for the same namespace and alert name, preventing redundant notifications.
Discord Notification Templates: Custom templates format alert messages for Discord with clear titles and detailed descriptions including severity, summary, and labels to provide context.
Environment-Specific Handling: Separate routes and receivers handle alerts for production (unchained) and development (unchained-dev) namespaces with distinct timing and grouping parameters.

Key Configuration Elements

Route Configuration:
Routes specify matching criteria for alerts (e.g., alert name, namespace, severity) and assign them to specific Discord receivers. Each route also defines grouping and timing parameters to control notification behavior.
Receivers:
Defined receivers use Discord webhook URLs to send notifications, each tailored for critical, warning, or development alerts.
Inhibit Rules:
Manage alert suppression logic to avoid alert storms by inhibiting lower severity alerts when higher severity alerts are active for the same issue.
Templates:
The discord.tmpl file uses Go templating to generate human-readable messages for Discord, enhancing alert clarity.

Integration

This alerting configuration complements the overall Prometheus & Grafana monitoring setup by handling the notification aspect of observability. While Prometheus collects metrics and evaluates alerting rules (defined in `rules.json`), Alertmanager processes these alerts with this configuration to ensure the right stakeholders are informed.

It integrates tightly with:

Metrics Collection: Alerts trigger based on Prometheus metrics collected from Kubernetes resources and blockchain services. For example, alerts like UnchainedStatefulSetDown rely on Kubernetes replica availability metrics.
Prometheus Alerting Rules: The alert definitions in rules.json generate alert events that Alertmanager routes according to this configuration.
Discord Channels: Using webhook URLs, notifications are sent directly to Discord, facilitating team awareness and quick response without requiring direct access to the monitoring system.

This setup enhances operational visibility by bridging metric collection with actionable notifications, ensuring alerts are meaningful, timely, and directed to appropriate teams.

Code Snippets Illustrating Core Interactions

Alert Routing Example (from `config.yaml`)

routes:
  - receiver: "discord_critical"
    group_by: ["alertname", "namespace", "statefulset"]
    group_wait: 5m
    group_interval: 30m
    repeat_interval: 1h
    matchers:
      - alertname = "UnchainedStatefulSetDown"
      - namespace = "unchained"
      - severity = "critical"

This route sends critical stateful set down alerts in production to the `discord_critical` receiver, grouping alerts by alert name, namespace, and stateful set, with specified notification intervals.

Discord Notification Template (from `discord.tmpl`)

{{ define "discord.title" }}
Unchained Alert {{ .Status | title }}: {{ .GroupLabels.alertname }}
{{ end }}

{{ define "discord.message" }}
{{ range .Alerts }}
**{{ .Labels.severity | toUpper }}**

**Alert:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}

**Details:**
{{ range .Labels.SortedPairs }}- {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}
{{ end }}

This template formats alert titles and messages sent to Discord channels, clearly showing status, severity, summary, description, and key labels for context.

Alert Rule Example (from `rules.json`)

{
  "alert": "UnchainedStatefulSetDown",
  "annotations": {
    "summary": "Unchained stateful set is currently down",
    "description": "Service {{ $labels.statefulset }} has been down for more than 15 minutes"
  },
  "expr": "kube_statefulset_status_replicas_available == 0",
  "for": "15m",
  "labels": {
    "severity": "critical"
  }
}

This Prometheus alert rule triggers a critical alert if a Kubernetes stateful set has zero available replicas for 15 minutes.

Diagram

flowchart TD
    Prometheus[Prometheus] -->|Evaluates Alert Rules| Alertmanager[Alertmanager]
    Alertmanager -->|Applies Routing & Inhibition| Router[Routing Logic]
    Router -->|Sends Notification| DiscordCritical[Discord Critical Channel]
    Router -->|Sends Notification| DiscordWarning[Discord Warning Channel]
    Router -->|Sends Notification| DiscordDev[Discord Dev Channel]
    Alertmanager -->|Uses Templates| TemplateEngine[Message Templates]
    Prometheus -->|Scrapes Metrics| Kubernetes[Kubernetes & Services]

    classDef prod fill:#f96,stroke:#333,stroke-width:1px;
    classDef dev fill:#bbf,stroke:#333,stroke-width:1px;

    DiscordCritical:::prod
    DiscordWarning:::prod
    DiscordDev:::dev

This flowchart illustrates how Prometheus sends alerts to Alertmanager, which applies routing and inhibition rules, formats messages with templates, and dispatches notifications to appropriate Discord channels based on alert severity and environment.