rules.json
Overview
The `rules.json` file defines Prometheus alerting and recording rules for monitoring the ShapeShift Unchained platform's Kubernetes infrastructure and API services. It specifies conditions under which alerts are triggered based on metrics collected from Kubernetes resources like StatefulSets, Deployments, and Pods, as well as API request metrics.
This file plays a critical role in the observability stack by enabling proactive detection of service downtime, pod instability, and elevated API error rates. Alerts generated here are consumed by the Alertmanager configuration, which routes notifications to appropriate channels (e.g., Discord) to inform operators for timely incident response.
Structure and Functionality
The file is a JSON object containing a `"groups"` array. Each group represents a logical set of rules, organized by their scope or subsystem.
groups: An array of rule groups.
Each group has a
"name"(string) and a"rules"array.Each rule is either an alerting rule or a recording rule.
Group: general
This group defines alerting rules related to Kubernetes resource availability and pod health.
Alert:
UnchainedStatefulSetDownTriggered when any stateful set has zero available replicas for at least 15 minutes.
Expression:
kube_statefulset_status_replicas_available == 0Severity:
criticalAnnotations provide summary and description templates using label placeholders.
Alert:
UnchainedDeploymentDownTriggered when any deployment has zero available replicas for at least 15 minutes.
Expression:
kube_deployment_status_replicas_available == 0Severity:
critical
Alert:
UnchainedHighPodRestartCountTriggered when any pod container restarts 5 or more times within the last 15 minutes.
Expression:
increase(kube_pod_container_status_restarts_total[15m]) >= 5Severity:
warningFires if condition persists for 1 minute.
Group: api
This group focuses on API metrics and error rates.
Recording Rule:
namespace_coinstack:unchained_http_request_count:sum_rateRecords the sum rate of HTTP requests over the last 5 minutes, grouped by Kubernetes namespace and
coinstacklabel.Expression:
sum(rate(unchained_http_request_count[5m])) by (namespace, coinstack)
Recording Rule:
namespace_coinstack:unchained_http_request_count_5xx:sum_rateRecords the sum rate of HTTP 5xx errors over the last 5 minutes, grouped by namespace and coinstack.
Expression:
sum(rate(unchained_http_request_count{statusCode=~"5.*"}[5m])) by (namespace, coinstack)
Alert:
UnchainedHigh5xxApiErrorRateFires when the 5xx error rate exceeds 1% over 15 minutes.
Expression calculates the ratio of 5xx error rate over total request rate multiplied by 100 (percentage).
Severity:
warningAnnotations include a summary and description using label and value placeholders.
Detailed Explanation of Rules
Alerting Rules
Each alerting rule has the following components:
alert: The alert name.
expr: PromQL expression evaluated continuously.
for: Duration for which the expression must be true before firing.
labels: Key-value pairs assigning metadata such as severity.
annotations: Human-readable information for alert summary and description, using Go templating syntax for dynamic label substitution.
Example: UnchainedStatefulSetDown
Purpose: Detect when a Kubernetes StatefulSet has zero available replicas, indicating a service outage.
Expression: Checks if the metric
kube_statefulset_status_replicas_availableequals 0.Duration: 15 minutes sustained condition.
Severity: Critical.
Usage:
{
"alert": "UnchainedStatefulSetDown",
"expr": "kube_statefulset_status_replicas_available == 0",
"for": "15m",
"labels": {
"severity": "critical"
},
"annotations": {
"summary": "Unchained stateful set is currently down",
"description": "Service {{ $labels.statefulset }} has been down for more than 15 minutes"
}
}
This alert enables operators to be notified if a critical stateful service in Kubernetes becomes unavailable.
Recording Rules
Recording rules precompute frequently used PromQL expressions to optimize and simplify queries in dashboards or alert rules.
Example: namespace_coinstack:unchained_http_request_count:sum_rate
Purpose: Aggregate rate of HTTP requests per namespace and coinstack over 5 minutes.
Expression:
sum(rate(unchained_http_request_count[5m])) by (namespace, coinstack)Usage: This metric is then used in further alert expressions like calculating error rates.
Implementation Details and Algorithms
PromQL Expressions: The file relies heavily on PromQL to evaluate metrics exposed by Kubernetes exporters (
kube-state-metrics) and application metrics (unchained_http_request_count).Alerting Logic:
Alerts use threshold-based conditions (e.g., zero replicas, restart counts, error rates).
The
forfield prevents alert flapping by requiring conditions to hold for a specified duration.Label placeholders in annotations allow dynamic insertion of relevant entity names (e.g., pod, deployment, coinstack).
Aggregation by Labels:
Metrics are grouped by labels such as
namespace,coinstack,pod,deployment, andstatefulsetto generate scoped alerts.This granularity supports targeted notifications and operational troubleshooting.
Integration and Interaction
With Prometheus:
This file is loaded into Prometheus as part of its alerting rules configuration. Prometheus periodically evaluates these rules against ingested metrics.With Alertmanager:
Alerts fired by Prometheus based on these rules are sent to Alertmanager, which applies routing, grouping, inhibition, and notification policies (e.g., sending alerts to Discord channels).With Kubernetes:
Metrics used in the rules come from Kubernetes resource exporters, ensuring real-time visibility into the health of stateful sets, deployments, and pods.With API Metrics:
Custom application metrics (unchained_http_request_count) are used to monitor API reliability and error rates per namespace and coinstack.
Usage Example
Suppose a StatefulSet named `unchained-indexer` in the `unchained` namespace goes down (zero replicas) for more than 15 minutes:
Prometheus evaluates
kube_statefulset_status_replicas_available == 0and finds it true forunchained-indexer.After 15 minutes, the
UnchainedStatefulSetDownalert fires.Alertmanager receives the alert and routes it to the critical notification channel (e.g., Discord).
Operators receive a message:
Summary: Unchained stateful set is currently down
Description: Serviceunchained-indexerhas been down for more than 15 minutes
This enables prompt investigation and remediation.
Mermaid Diagram: Rule Groups and Their Rules
flowchart TD
subgraph General Rules
GS1[UnchainedStatefulSetDown]
GS2[UnchainedDeploymentDown]
GS3[UnchainedHighPodRestartCount]
end
subgraph API Rules
API1[Record: namespace_coinstack:unchained_http_request_count:sum_rate]
API2[Record: namespace_coinstack:unchained_http_request_count_5xx:sum_rate]
API3[Alert: UnchainedHigh5xxApiErrorRate]
end
GS1 -->|Alert on| kube_statefulset_status_replicas_available
GS2 -->|Alert on| kube_deployment_status_replicas_available
GS3 -->|Alert on| kube_pod_container_status_restarts_total
API1 -->|Record sum rate| unchained_http_request_count
API2 -->|Record 5xx rate| unchained_http_request_count{statusCode=~"5.*"}
API3 -->|Alert on error rate| API1 & API2
style General Rules fill:#f9f,stroke:#333,stroke-width:1px
style API Rules fill:#bbf,stroke:#333,stroke-width:1px
Summary
Purpose: Defines Prometheus alerting and recording rules to monitor Kubernetes resources and API error rates in the Unchained platform.
Key Alerts: Detect service unavailability (
UnchainedStatefulSetDown,UnchainedDeploymentDown), pod instability (UnchainedHighPodRestartCount), and API reliability issues (UnchainedHigh5xxApiErrorRate).Function: Uses PromQL expressions to evaluate live metrics, triggering alerts after sustained conditions.
Integration: Part of the monitoring stack feeding into Alertmanager for notification routing.
Benefits: Enables early detection of infrastructure and service issues, improving system reliability and operational awareness.
For further details on how these rules fit into the overall alerting and notification architecture, see the related Alertmanager configuration and Discord notification templates.