Prometheus & Grafana Monitoring

Overview

The Prometheus & Grafana Monitoring module provides a comprehensive observability solution for the ShapeShift Unchained platform’s blockchain services and infrastructure. It integrates Prometheus for metrics collection with Grafana for visualization, enabling operators and developers to monitor system health, performance, and resource usage across all deployed blockchain coinstacks and their supporting components.

This module addresses the need for proactive system monitoring, alerting, and diagnostics by collecting critical runtime metrics from Kubernetes-managed services (daemons, indexers, APIs), consolidating these metrics, and displaying them through tailored Grafana dashboards. It also supports alerting rules that notify teams about system anomalies through communication channels like Discord.

Core Concepts and Purpose

Why This Module Exists

Centralized Observability: The system runs multiple blockchain coinstacks, each consisting of numerous services. Monitoring all these components in a centralized, standardized manner is essential for maintaining high availability and performance.
Health and Performance Tracking: Continuous tracking of pod readiness, resource utilization (CPU, memory, storage), request counts, and response times helps identify issues early.
Alerting: Automated alerts ensure timely notifications for critical or warning conditions, reducing downtime and enabling quick remediation.
Audit and Historical Analysis: Retaining metrics data over extended periods supports trend analysis and capacity planning.

Architecture and How It Works

Prometheus Stack Deployment

The monitoring infrastructure is deployed into a dedicated Kubernetes namespace (e.g., `unchained-monitoring`). The main components deployed are:

Prometheus: Collects and stores metrics data from Kubernetes nodes, pods, and application endpoints.
Grafana: Provides dashboards for data visualization and querying.
Alertmanager: Manages alert notifications based on Prometheus rules.
Kube-State-Metrics: Exposes Kubernetes cluster state metrics for Prometheus consumption.

Deployment is automated and configured through Pulumi, a TypeScript-based infrastructure as code framework, as seen in `monitoring/src/index.ts`.

Key Deployment Configuration Highlights

The Helm chart kube-prometheus-stack (version 52.1.0) is used for deploying the full Prometheus ecosystem, including Prometheus, Grafana, Alertmanager, and exporters.
Persistent volume claims with specific AWS EBS gp3 storage class and IOPS/throughput annotations ensure reliable storage for metrics retention.
Grafana is configured with:
- A custom admin password (defaulting to unchained if not provided).
- A persistence volume for dashboard and data storage.
- Dashboard providers and preloaded dashboards (notably the overview.json dashboard).
- GitHub OAuth authentication enabling access control based on GitHub organizations.
Alertmanager is configured with a customized configuration file (config.yaml) and templates (discord.tmpl) supporting Discord webhook integration for alert notifications.
Additional Prometheus alerting rules are loaded from alertmanager/rules.json.
Kubernetes monitoring settings are adjusted with label allowlists for kube-state-metrics.

Grafana Ingress and TLS Setup

The Grafana service is exposed externally via Traefik ingress routes with TLS certificates provisioned by cert-manager using Let’s Encrypt. This is implemented in `monitoring/src/grafana.ts`:

A Certificate resource is created for the domain monitoring.{domain}.
An IngressRoute resource defines Traefik routing rules for HTTP and HTTPS entry points.
A fallback Kubernetes Ingress resource for basic routing.
Support for an additional domain, allowing flexible DNS configurations.

Metrics Collection

Metrics are gathered from multiple sources:

Kubernetes Metrics: Pod readiness, number of replicas, and resource limits/usage are scraped from Kubernetes APIs and kubelet endpoints.
Application Metrics: Blockchain coinstack services expose HTTP metrics endpoints instrumented for Prometheus. Metrics include request counts, response durations, and WebSocket connection counts.
Custom Metrics: Namespaced metrics such as unchained_http_request_count and unchained_ws_client_count provide granular insights into API usage and real-time connection status.

The module’s configuration includes Prometheus queries embedded in Grafana dashboards to present these metrics meaningfully.

Grafana Dashboards

A rich set of preconfigured dashboards is provided, with `overview.json` being a primary example (located in `monitoring/src/dashboards/overview.json`). These dashboards feature:

Health Gauges: Display the percentage of ready replicas for both API deployments and StatefulSets per coinstack.
Resource Usage Graphs: Timeseries charts showing CPU and memory usage as a percentage of allocated resources.
Request Metrics: Timeseries and tables showing request count and average request duration, segmented by endpoint and HTTP status.
WebSocket Metrics: Real-time counts of active WebSocket client connections per coinstack.
Storage Utilization: Gauges displaying volume space usage for daemon and indexer persistent storage.
Restart Counts: Timeseries of pod/container restart events indicating stability issues.

Dashboards are organized per coinstack (e.g., Arbitrum, Avalanche, Bitcoin, Ethereum) and include repeated panels that dynamically adapt to deployed coinstacks.

Alerting Configuration

Alerting is managed by Prometheus Alertmanager, using a custom configuration (`alertmanager/config.yaml`) and alerting rules (`alertmanager/rules.json`):

The alerting rules define conditions on metrics such as pod readiness, CPU/memory saturation, and service availability.
Alertmanager sends notifications to Discord channels using webhooks configured via environment variables (DISCORD_WEBHOOK_URL_CRITICAL, DISCORD_WEBHOOK_URL_WARNING, etc.).
Custom templates (discord.tmpl) format alert messages for Discord, enhancing readability and context.
The alerting setup ensures operational teams receive timely notifications for critical incidents or warning states.

Interactions with Other System Parts

Kubernetes Cluster: The monitoring components scrape metrics from all namespaces where blockchain coinstacks are deployed.
Blockchain Services: Each coinstack’s API servers and daemon pods expose Prometheus metrics endpoints that are scraped.
Pulumi Infrastructure: The monitoring stack deployment is managed via Pulumi scripts, ensuring consistent and repeatable provisioning alongside other infrastructure components.
Traefik: Acts as a reverse proxy routing external traffic to the Grafana service with TLS termination.
Alert Notification Channels: Discord is integrated as an alert recipient, linking observability with communication tools.

Example Snippet from Deployment Code (`monitoring/src/index.ts`)

This snippet demonstrates the configuration of the Prometheus stack Helm release with custom values:

new k8s.helm.v3.Release(
  name,
  {
    name,
    chart: 'kube-prometheus-stack',
    version: '52.1.0',
    repositoryOpts: { repo: 'https://prometheus-community.github.io/helm-charts' },
    namespace,
    values: {
      prometheus: {
        prometheusSpec: {
          retention: '60d',
          storageSpec: {
            volumeClaimTemplate: {
              metadata: { annotations: { 'ebs.csi.aws.com/iops': '3000' } },
              spec: { storageClassName: 'gp3', accessModes: ['ReadWriteOnce'], resources: { requests: { storage: '1000Gi' } } },
            },
          },
        },
      },
      grafana: {
        adminPassword: process.env.GRAFANA_ADMIN_PASSWORD ?? 'unchained',
        persistence: { enabled: true, size: '10Gi', storageClassName: 'gp3' },
        dashboards: {
          default: {
            overview: { json: readFileSync('./dashboards/overview.json').toString() },
          },
        },
        'grafana.ini': {
          server: { root_url: `https://monitoring.${additionalDomain ?? domain}` },
          'auth.github': {
            enabled: true,
            allowed_organizations: process.env.GITHUB_ORG,
            client_id: process.env.GITHUB_OAUTH_CLIENT_ID,
            client_secret: process.env.GITHUB_OAUTH_SECRET,
          },
        },
      },
      alertmanager: {
        stringConfig: readFileSync('./alertmanager/config.yaml').toString()
          .replace('<<DISCORD_WEBHOOK_URL_CRITICAL>>', process.env.DISCORD_WEBHOOK_URL_CRITICAL ?? ''),
        tplConfig: true,
        templateFiles: { 'discord.tmpl': readFileSync('./alertmanager/discord.tmpl').toString() },
      },
      additionalPrometheusRulesMap: {
        unchained: JSON.parse(readFileSync('./alertmanager/rules.json').toString()),
      },
    },
  },
  { provider }
)

Mermaid Diagram: Deployment and Data Flow of Prometheus & Grafana Monitoring

flowchart TD
  K8s[Kubernetes Cluster] -->|Metrics scraped| Prometheus[Prometheus Server]
  Prometheus -->|Stores metrics| Storage[(Persistent Volume)]
  Prometheus -->|Sends alerts| Alertmanager[Alertmanager]
  Alertmanager -->|Notifies| Discord[Discord Webhooks]
  Prometheus -->|Data queries| Grafana[Grafana]
  Grafana -->|Dashboard UI| Operator[Operators/Developers]
  Traefik[Traefik Proxy] -->|Route HTTPS| Grafana
  CertManager[Cert-Manager] -->|Provides TLS| Traefik

Summary of Key Points

The module deploys a Prometheus monitoring stack including Prometheus, Grafana, Alertmanager, and supporting exporters via a Helm chart managed by Pulumi.
Persistent storage and AWS EBS-specific tuning ensure durability and performance of metrics data.
Grafana dashboards provide detailed visualizations of service health, resource usage, request metrics, and WebSocket connections per blockchain coinstack.
Alertmanager is configured with custom alert rules and Discord integration for timely incident notifications.
TLS-secured external access to Grafana is provided via Traefik and cert-manager.
The monitoring system integrates closely with the Kubernetes cluster and application services by scraping metrics endpoints and Kubernetes state data.

This module is critical for maintaining the operational excellence and reliability of the ShapeShift Unchained platform by providing rich observability and alerting capabilities.