rules.json

Overview

The `rules.json` file defines Prometheus alerting and recording rules for monitoring the ShapeShift Unchained platform's Kubernetes infrastructure and API services. It specifies conditions under which alerts are triggered based on metrics collected from Kubernetes resources like StatefulSets, Deployments, and Pods, as well as API request metrics.

This file plays a critical role in the observability stack by enabling proactive detection of service downtime, pod instability, and elevated API error rates. Alerts generated here are consumed by the Alertmanager configuration, which routes notifications to appropriate channels (e.g., Discord) to inform operators for timely incident response.

Structure and Functionality

The file is a JSON object containing a `"groups"` array. Each group represents a logical set of rules, organized by their scope or subsystem.

groups: An array of rule groups.
- Each group has a "name" (string) and a "rules" array.
- Each rule is either an alerting rule or a recording rule.

Group: `general`

This group defines alerting rules related to Kubernetes resource availability and pod health.

Alert: UnchainedStatefulSetDown
- Triggered when any stateful set has zero available replicas for at least 15 minutes.
- Expression: kube_statefulset_status_replicas_available == 0
- Severity: critical
- Annotations provide summary and description templates using label placeholders.
Alert: UnchainedDeploymentDown
- Triggered when any deployment has zero available replicas for at least 15 minutes.
- Expression: kube_deployment_status_replicas_available == 0
- Severity: critical
Alert: UnchainedHighPodRestartCount
- Triggered when any pod container restarts 5 or more times within the last 15 minutes.
- Expression: increase(kube_pod_container_status_restarts_total[15m]) >= 5
- Severity: warning
- Fires if condition persists for 1 minute.

Group: `api`

This group focuses on API metrics and error rates.

Recording Rule: namespace_coinstack:unchained_http_request_count:sum_rate
- Records the sum rate of HTTP requests over the last 5 minutes, grouped by Kubernetes namespace and coinstack label.
- Expression: sum(rate(unchained_http_request_count[5m])) by (namespace, coinstack)
Recording Rule: namespace_coinstack:unchained_http_request_count_5xx:sum_rate
- Records the sum rate of HTTP 5xx errors over the last 5 minutes, grouped by namespace and coinstack.
- Expression: sum(rate(unchained_http_request_count{statusCode=~"5.*"}[5m])) by (namespace, coinstack)
Alert: UnchainedHigh5xxApiErrorRate
- Fires when the 5xx error rate exceeds 1% over 15 minutes.
- Expression calculates the ratio of 5xx error rate over total request rate multiplied by 100 (percentage).
- Severity: warning
- Annotations include a summary and description using label and value placeholders.

Detailed Explanation of Rules

Alerting Rules

Each alerting rule has the following components:

alert: The alert name.
expr: PromQL expression evaluated continuously.
for: Duration for which the expression must be true before firing.
labels: Key-value pairs assigning metadata such as severity.
annotations: Human-readable information for alert summary and description, using Go templating syntax for dynamic label substitution.

Example: `UnchainedStatefulSetDown`

Purpose: Detect when a Kubernetes StatefulSet has zero available replicas, indicating a service outage.
Expression: Checks if the metric kube_statefulset_status_replicas_available equals 0.
Duration: 15 minutes sustained condition.
Severity: Critical.
Usage:

{
  "alert": "UnchainedStatefulSetDown",
  "expr": "kube_statefulset_status_replicas_available == 0",
  "for": "15m",
  "labels": {
    "severity": "critical"
  },
  "annotations": {
    "summary": "Unchained stateful set is currently down",
    "description": "Service {{ $labels.statefulset }} has been down for more than 15 minutes"
  }
}

This alert enables operators to be notified if a critical stateful service in Kubernetes becomes unavailable.

Recording Rules

Recording rules precompute frequently used PromQL expressions to optimize and simplify queries in dashboards or alert rules.

Example: `namespace_coinstack:unchained_http_request_count:sum_rate`

Purpose: Aggregate rate of HTTP requests per namespace and coinstack over 5 minutes.
Expression: sum(rate(unchained_http_request_count[5m])) by (namespace, coinstack)
Usage: This metric is then used in further alert expressions like calculating error rates.

Implementation Details and Algorithms

PromQL Expressions: The file relies heavily on PromQL to evaluate metrics exposed by Kubernetes exporters (kube-state-metrics) and application metrics (unchained_http_request_count).
Alerting Logic:
- Alerts use threshold-based conditions (e.g., zero replicas, restart counts, error rates).
- The for field prevents alert flapping by requiring conditions to hold for a specified duration.
- Label placeholders in annotations allow dynamic insertion of relevant entity names (e.g., pod, deployment, coinstack).
Aggregation by Labels:
- Metrics are grouped by labels such as namespace, coinstack, pod, deployment, and statefulset to generate scoped alerts.
- This granularity supports targeted notifications and operational troubleshooting.

Integration and Interaction

With Prometheus:
This file is loaded into Prometheus as part of its alerting rules configuration. Prometheus periodically evaluates these rules against ingested metrics.
With Alertmanager:
Alerts fired by Prometheus based on these rules are sent to Alertmanager, which applies routing, grouping, inhibition, and notification policies (e.g., sending alerts to Discord channels).
With Kubernetes:
Metrics used in the rules come from Kubernetes resource exporters, ensuring real-time visibility into the health of stateful sets, deployments, and pods.
With API Metrics:
Custom application metrics (unchained_http_request_count) are used to monitor API reliability and error rates per namespace and coinstack.

Usage Example

Suppose a StatefulSet named `unchained-indexer` in the `unchained` namespace goes down (zero replicas) for more than 15 minutes:

Prometheus evaluates kube_statefulset_status_replicas_available == 0 and finds it true for unchained-indexer.
After 15 minutes, the UnchainedStatefulSetDown alert fires.
Alertmanager receives the alert and routes it to the critical notification channel (e.g., Discord).
Operators receive a message:
Summary: Unchained stateful set is currently down
Description: Service unchained-indexer has been down for more than 15 minutes

This enables prompt investigation and remediation.

Mermaid Diagram: Rule Groups and Their Rules

flowchart TD
    subgraph General Rules
        GS1[UnchainedStatefulSetDown]
        GS2[UnchainedDeploymentDown]
        GS3[UnchainedHighPodRestartCount]
    end

    subgraph API Rules
        API1[Record: namespace_coinstack:unchained_http_request_count:sum_rate]
        API2[Record: namespace_coinstack:unchained_http_request_count_5xx:sum_rate]
        API3[Alert: UnchainedHigh5xxApiErrorRate]
    end

    GS1 -->|Alert on| kube_statefulset_status_replicas_available
    GS2 -->|Alert on| kube_deployment_status_replicas_available
    GS3 -->|Alert on| kube_pod_container_status_restarts_total

    API1 -->|Record sum rate| unchained_http_request_count
    API2 -->|Record 5xx rate| unchained_http_request_count{statusCode=~"5.*"}
    API3 -->|Alert on error rate| API1 & API2

    style General Rules fill:#f9f,stroke:#333,stroke-width:1px
    style API Rules fill:#bbf,stroke:#333,stroke-width:1px

Summary

Purpose: Defines Prometheus alerting and recording rules to monitor Kubernetes resources and API error rates in the Unchained platform.
Key Alerts: Detect service unavailability (UnchainedStatefulSetDown, UnchainedDeploymentDown), pod instability (UnchainedHighPodRestartCount), and API reliability issues (UnchainedHigh5xxApiErrorRate).
Function: Uses PromQL expressions to evaluate live metrics, triggering alerts after sustained conditions.
Integration: Part of the monitoring stack feeding into Alertmanager for notification routing.
Benefits: Enables early detection of infrastructure and service issues, improving system reliability and operational awareness.

For further details on how these rules fit into the overall alerting and notification architecture, see the related Alertmanager configuration and Discord notification templates.