Prometheus & Grafana Monitoring

Overview

The Prometheus & Grafana Monitoring module provides a comprehensive observability solution for the ShapeShift Unchained platform’s blockchain services and infrastructure. It integrates Prometheus for metrics collection with Grafana for visualization, enabling operators and developers to monitor system health, performance, and resource usage across all deployed blockchain coinstacks and their supporting components.

This module addresses the need for proactive system monitoring, alerting, and diagnostics by collecting critical runtime metrics from Kubernetes-managed services (daemons, indexers, APIs), consolidating these metrics, and displaying them through tailored Grafana dashboards. It also supports alerting rules that notify teams about system anomalies through communication channels like Discord.


Core Concepts and Purpose

Why This Module Exists


Architecture and How It Works

Prometheus Stack Deployment

The monitoring infrastructure is deployed into a dedicated Kubernetes namespace (e.g., `unchained-monitoring`). The main components deployed are:

Deployment is automated and configured through Pulumi, a TypeScript-based infrastructure as code framework, as seen in `monitoring/src/index.ts`.

Key Deployment Configuration Highlights

Grafana Ingress and TLS Setup

The Grafana service is exposed externally via Traefik ingress routes with TLS certificates provisioned by cert-manager using Let’s Encrypt. This is implemented in `monitoring/src/grafana.ts`:


Metrics Collection

Metrics are gathered from multiple sources:

The module’s configuration includes Prometheus queries embedded in Grafana dashboards to present these metrics meaningfully.


Grafana Dashboards

A rich set of preconfigured dashboards is provided, with `overview.json` being a primary example (located in `monitoring/src/dashboards/overview.json`). These dashboards feature:

Dashboards are organized per coinstack (e.g., Arbitrum, Avalanche, Bitcoin, Ethereum) and include repeated panels that dynamically adapt to deployed coinstacks.


Alerting Configuration

Alerting is managed by Prometheus Alertmanager, using a custom configuration (`alertmanager/config.yaml`) and alerting rules (`alertmanager/rules.json`):


Interactions with Other System Parts


Example Snippet from Deployment Code (monitoring/src/index.ts)

This snippet demonstrates the configuration of the Prometheus stack Helm release with custom values:

new k8s.helm.v3.Release(
  name,
  {
    name,
    chart: 'kube-prometheus-stack',
    version: '52.1.0',
    repositoryOpts: { repo: 'https://prometheus-community.github.io/helm-charts' },
    namespace,
    values: {
      prometheus: {
        prometheusSpec: {
          retention: '60d',
          storageSpec: {
            volumeClaimTemplate: {
              metadata: { annotations: { 'ebs.csi.aws.com/iops': '3000' } },
              spec: { storageClassName: 'gp3', accessModes: ['ReadWriteOnce'], resources: { requests: { storage: '1000Gi' } } },
            },
          },
        },
      },
      grafana: {
        adminPassword: process.env.GRAFANA_ADMIN_PASSWORD ?? 'unchained',
        persistence: { enabled: true, size: '10Gi', storageClassName: 'gp3' },
        dashboards: {
          default: {
            overview: { json: readFileSync('./dashboards/overview.json').toString() },
          },
        },
        'grafana.ini': {
          server: { root_url: `https://monitoring.${additionalDomain ?? domain}` },
          'auth.github': {
            enabled: true,
            allowed_organizations: process.env.GITHUB_ORG,
            client_id: process.env.GITHUB_OAUTH_CLIENT_ID,
            client_secret: process.env.GITHUB_OAUTH_SECRET,
          },
        },
      },
      alertmanager: {
        stringConfig: readFileSync('./alertmanager/config.yaml').toString()
          .replace('<<DISCORD_WEBHOOK_URL_CRITICAL>>', process.env.DISCORD_WEBHOOK_URL_CRITICAL ?? ''),
        tplConfig: true,
        templateFiles: { 'discord.tmpl': readFileSync('./alertmanager/discord.tmpl').toString() },
      },
      additionalPrometheusRulesMap: {
        unchained: JSON.parse(readFileSync('./alertmanager/rules.json').toString()),
      },
    },
  },
  { provider }
)

Mermaid Diagram: Deployment and Data Flow of Prometheus & Grafana Monitoring

flowchart TD
  K8s[Kubernetes Cluster] -->|Metrics scraped| Prometheus[Prometheus Server]
  Prometheus -->|Stores metrics| Storage[(Persistent Volume)]
  Prometheus -->|Sends alerts| Alertmanager[Alertmanager]
  Alertmanager -->|Notifies| Discord[Discord Webhooks]
  Prometheus -->|Data queries| Grafana[Grafana]
  Grafana -->|Dashboard UI| Operator[Operators/Developers]
  Traefik[Traefik Proxy] -->|Route HTTPS| Grafana
  CertManager[Cert-Manager] -->|Provides TLS| Traefik

Summary of Key Points

This module is critical for maintaining the operational excellence and reliability of the ShapeShift Unchained platform by providing rich observability and alerting capabilities.