This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Choose Your Path

Find the right Krkn setup for your goals — from a quick first run to multi-cluster resilience pipelines.

Whether you’re running your first chaos scenario or building a production resilience pipeline, there’s a path here for you. Pick the journey that matches your experience level and goals — each one builds on the previous, so you can start simple and add complexity when you’re ready.

New to Krkn? Start with Basic Run — no configuration files, no metrics, just run a scenario and see what happens.

Familiar with the basics? Add automatic pass/fail evaluation with Metrics Validation, then layer in a Resilience Score to get a percentage-based view of how your system held up.

Running chaos regularly? Move to Long-Term Storage to persist metrics across runs and spot regressions between releases.

Operating at scale? Use Multi-Cluster Orchestration to drive chaos across multiple clusters or cloud environments from a single control point.

JourneyI want to…Experience levelTools needed
Basic RunInject chaos and observe results manuallyBeginnerkrknctl
Metrics ValidationAutomatically pass/fail based on Prometheus metricsIntermediatekrknctl + Prometheus
Resilience ScoreGenerate a scored report to validate an environmentIntermediatekrknctl + Prometheus
Long-Term StorageStore metrics across runs for regression analysisAdvancedkrknctl + Prometheus + Elasticsearch
Multi-Cluster OrchestrationRun chaos across multiple clusters or cloudsAdvancedkrkn-operator

1 - Basic Run

Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.

Goal: Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.

This is the best starting point if you are new to Krkn or want to explore a specific scenario quickly.

What you need

  • A running Kubernetes or OpenShift cluster
  • A kubeconfig with cluster access
  • krknctl (recommended) or krkn installed

Steps

  1. Install krknctl — follow the installation guide.

  2. List available scenarios to find one that fits your target:

    krknctl list
    
  3. Run a scenario — for example, to kill pods matching a label:

    krknctl run pod-scenarios
    

    krknctl will prompt you for required inputs interactively, or you can pass them as flags.

  4. Observe results in your cluster using kubectl or your existing monitoring tools. Krkn logs pass/fail and recovery status to stdout.

    For pod scenarios specifically, you can confirm the pod was killed and recovered by checking its age. A restarted pod will show a much shorter uptime than its neighbours:

    kubectl get pods -A
    

    Example output after a pod scenario targeting the my-app namespace:

    NAMESPACE     NAME                          READY   STATUS    RESTARTS   AGE
    my-app        frontend-7d9f8b6c4-xk2pq      1/1     Running   0          8s
    my-app        backend-5c6d7f8b9-lm3rt        1/1     Running   0          4d2h
    kube-system   coredns-787d4945fb-nqpzj       1/1     Running   0          4d2h
    

    The 8s age on frontend shows it was recently restarted by the scenario while all other pods remain unaffected.

Next steps

2 - Metrics Validation

Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.

Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.

This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.

What you need

Steps

  1. Install krknctl — follow the installation guide.

  2. Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:

    - expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
      description: "etcd fsync latency too high: {{$value}}"
      severity: error
    
    - expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
      description: "Too many failed pods: {{$value}}"
      severity: error
    

    Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.

  3. Run a scenario with the alerts profile mounted:

    krknctl run pod-scenarios --alerts-profile config/alerts.yaml
    

    Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.

Reference docs

Next steps

To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.

3 - Long-Term Storage

Persist metrics from every chaos run into Elasticsearch to compare behavior across releases, dates, or cluster configurations.

Goal: Persist metrics from every chaos run into Elasticsearch so you can compare behavior across releases, dates, or cluster configurations.

This journey enables regression analysis — for example, detecting that API server latency during a node failure has increased between software versions.

What you need

Steps

  1. Complete Metrics Validation first to confirm Prometheus evaluation is working.

  2. Define your metrics profile at config/metrics.yaml. This controls which Prometheus metrics are snapshotted and stored per run:

    - query: irate(apiserver_request_total{verb="POST"}[2m])
      metricName: apiserverRequestRate
    
    - query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
      metricName: etcdFsyncLatencyP99
    
  3. Run a scenario with both profiles mounted:

    krknctl run pod-scenarios \
      --metrics-profile config/metrics.yaml \
      --alerts-profile config/alerts.yaml
    

    After each scenario, metrics snapshots are stored alongside run metadata (scenario type, duration, cluster version, exit status).

  4. Deploy krkn-visualize to query your data through pre-built Grafana dashboards for API performance, etcd health, node and pod scenarios, and more:

    krknctl visualize \
      --es-url https://elasticsearch.example.com \
      --es-username elastic \
      --es-password <your-password> \
      --prometheus-url https://prometheus.example.com \
      --prometheus-bearer-token <your-token> \
      --grafana-password <grafana-admin-password>
    

    This deploys krkn-visualize to your cluster and wires it to both Elasticsearch and Prometheus. To tear it down later:

    krknctl visualize --delete
    

Reference docs

Next steps

To generate a numerical resilience score on top of your Prometheus data, continue to Resilience Score.

4 - Resilience Score

Generate a numerical score (0–100%) that represents how well your environment held up during chaos.

Goal: Generate a numerical score (0–100%) that represents how well your environment held up during chaos — giving you more signal than a binary pass/fail.

A resilience score lets you track improvement over time, compare environments, and set score thresholds as release gates.

What you need

How scoring works

After a chaos scenario completes, Krkn evaluates a set of SLOs (defined as PromQL expressions) over the chaos time window. Each SLO is weighted by severity:

  • Warning SLOs — 1 point each
  • Critical SLOs — 3 points each

The final score is (points passed / total possible points) × 100. A score of 95% indicates a robust system with minor degradation; 60% signals significant issues that need investigation even if the scenario technically passed.

When running via krknctl, resiliency scoring runs automatically in controller mode — per-scenario scores are captured and aggregated across all scenarios in the run.

Steps

  1. Complete Metrics Validation to confirm Prometheus evaluation is working.

  2. Define your SLOs in config/alerts.yaml. Add a severity to each entry — scoring weights them automatically:

    - expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
      description: "etcd fsync latency above 10ms"
      severity: critical        # 3 points
    
    - expr: sum(kube_pod_status_phase{phase="Failed"}) > 0
      description: "any pods in Failed phase"
      severity: warning         # 1 point
    
    - expr: increase(apiserver_request_total{code=~"5.."}[5m]) > 10
      description: "API server 5xx errors during chaos"
      severity: critical        # 3 points
    

    You can also set a custom weight: on any entry to override the severity default — see custom weights and a complete example profile.

  3. Run a scenario with the alerts profile mounted:

    krknctl run pod-scenarios -–resiliency-file config/alerts.yaml
    

    The resiliency score is printed at the end of the run and written to kraken.report and resiliency-report.json.

    Resiliency Score: 87% (13/15 SLOs passed)
    
  4. Use the score as a gate — in CI, check the exit code and parse the score from the output to enforce a minimum threshold before promoting a build.

Reference docs

Next steps

To orchestrate chaos across multiple clusters from a single control point, continue to Multi-Cluster Orchestration.

5 - Multi-Cluster Orchestration

Run chaos scenarios across multiple clusters or cloud environments from a single control point using krkn-operator.

Goal: Run chaos scenarios across multiple clusters or cloud environments from a single control point — useful for validating a multi-region application or comparing cluster configurations.

The recommended approach for multi-cluster orchestration is krkn-operator: a Kubernetes operator that runs on a dedicated control plane cluster and dispatches chaos scenarios to any number of registered target clusters, without distributing credentials to individual users.

How krkn-operator works

  • A control plane cluster runs the operator and its web console
  • Target clusters are registered once by an administrator (via kubeconfig, service account token, or username/password)
  • Users select one or more targets through the web UI and launch scenarios — they never handle cluster credentials directly
  • Scenarios run in parallel across all selected targets

This design preserves the original Krkn architecture (chaos runs from outside the cluster) while adding a secure, centralized orchestration layer.

What you need

  • A Kubernetes or OpenShift cluster to host the operator (the control plane cluster)
  • Helm 3.0+
  • kubeconfig or service account credentials for each target cluster (held by the admin, not shared with users)

Steps

  1. Install krkn-operator on your control plane cluster using Helm:

    helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
      --version <VERSION> \
      --namespace krkn-operator-system \
      --create-namespace
    

    For production deployments with HA, external access, and monitoring, see the full installation guide.

  2. Access the web console — for local testing use port-forwarding; for production expose it via Ingress, Gateway API, or OpenShift Route:

    kubectl port-forward svc/krkn-operator-console 3000:3000 -n krkn-operator-system
    
  3. Register target clusters — as an administrator, open Admin Settings → Cluster Targets → Add Target and provide the cluster name and credentials for each cluster you want to target. See Configuration for the three supported auth methods (kubeconfig, service account token, username/password).

  4. Run a scenario across multiple clusters — click Run Scenario, select one or more registered target clusters, choose a scenario, configure its parameters, and launch. The operator executes the scenario on all selected targets concurrently.

  5. Monitor in real time — the home dashboard shows all active runs across all clusters. Click any run to see live log streaming and execution status.

ACM/OCM integration

If your organization uses Red Hat Advanced Cluster Management (ACM) or Open Cluster Management (OCM), install the operator with ACM integration enabled. It will automatically discover and sync all ACM-managed clusters as chaos targets — no manual credential management required:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace \
  --set acm.enabled=true

Reference docs