Metrics Validation

Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.

Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.

This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.

What you need

Steps

  1. Install krknctl — follow the installation guide.

  2. Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:

    - expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
      description: "etcd fsync latency too high: {{$value}}"
      severity: error
    
    - expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
      description: "Too many failed pods: {{$value}}"
      severity: error
    

    Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.

  3. Run a scenario with the alerts profile mounted:

    krknctl run pod-scenarios --alerts-profile config/alerts.yaml
    

    Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.

Reference docs

Next steps

To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.

Last modified April 21, 2026: addin guser journeys (#285) (9a4d731)