rules.json


Overview

The `rules.json` file defines Prometheus alerting and recording rules for monitoring the ShapeShift Unchained platform's Kubernetes infrastructure and API services. It specifies conditions under which alerts are triggered based on metrics collected from Kubernetes resources like StatefulSets, Deployments, and Pods, as well as API request metrics.

This file plays a critical role in the observability stack by enabling proactive detection of service downtime, pod instability, and elevated API error rates. Alerts generated here are consumed by the Alertmanager configuration, which routes notifications to appropriate channels (e.g., Discord) to inform operators for timely incident response.


Structure and Functionality

The file is a JSON object containing a `"groups"` array. Each group represents a logical set of rules, organized by their scope or subsystem.

Group: general

This group defines alerting rules related to Kubernetes resource availability and pod health.

Group: api

This group focuses on API metrics and error rates.


Detailed Explanation of Rules

Alerting Rules

Each alerting rule has the following components:

Example: UnchainedStatefulSetDown

{
  "alert": "UnchainedStatefulSetDown",
  "expr": "kube_statefulset_status_replicas_available == 0",
  "for": "15m",
  "labels": {
    "severity": "critical"
  },
  "annotations": {
    "summary": "Unchained stateful set is currently down",
    "description": "Service {{ $labels.statefulset }} has been down for more than 15 minutes"
  }
}

This alert enables operators to be notified if a critical stateful service in Kubernetes becomes unavailable.

Recording Rules

Recording rules precompute frequently used PromQL expressions to optimize and simplify queries in dashboards or alert rules.

Example: namespace_coinstack:unchained_http_request_count:sum_rate


Implementation Details and Algorithms


Integration and Interaction


Usage Example

Suppose a StatefulSet named `unchained-indexer` in the `unchained` namespace goes down (zero replicas) for more than 15 minutes:

  1. Prometheus evaluates kube_statefulset_status_replicas_available == 0 and finds it true for unchained-indexer.

  2. After 15 minutes, the UnchainedStatefulSetDown alert fires.

  3. Alertmanager receives the alert and routes it to the critical notification channel (e.g., Discord).

  4. Operators receive a message:
    Summary: Unchained stateful set is currently down
    Description: Service unchained-indexer has been down for more than 15 minutes

This enables prompt investigation and remediation.


Mermaid Diagram: Rule Groups and Their Rules

flowchart TD
    subgraph General Rules
        GS1[UnchainedStatefulSetDown]
        GS2[UnchainedDeploymentDown]
        GS3[UnchainedHighPodRestartCount]
    end

    subgraph API Rules
        API1[Record: namespace_coinstack:unchained_http_request_count:sum_rate]
        API2[Record: namespace_coinstack:unchained_http_request_count_5xx:sum_rate]
        API3[Alert: UnchainedHigh5xxApiErrorRate]
    end

    GS1 -->|Alert on| kube_statefulset_status_replicas_available
    GS2 -->|Alert on| kube_deployment_status_replicas_available
    GS3 -->|Alert on| kube_pod_container_status_restarts_total

    API1 -->|Record sum rate| unchained_http_request_count
    API2 -->|Record 5xx rate| unchained_http_request_count{statusCode=~"5.*"}
    API3 -->|Alert on error rate| API1 & API2

    style General Rules fill:#f9f,stroke:#333,stroke-width:1px
    style API Rules fill:#bbf,stroke:#333,stroke-width:1px

Summary


For further details on how these rules fit into the overall alerting and notification architecture, see the related Alertmanager configuration and Discord notification templates.