Choose Your Path
Find the right Krkn setup for your goals — from a quick first run to multi-cluster resilience pipelines.
Whether you’re running your first chaos scenario or building a production resilience pipeline, there’s a path here for you. Pick the journey that matches your experience level and goals — each one builds on the previous, so you can start simple and add complexity when you’re ready.
New to Krkn? Start with Basic Run — no configuration files, no metrics, just run a scenario and see what happens.
Familiar with the basics? Add automatic pass/fail evaluation with Metrics Validation, then layer in a Resilience Score to get a percentage-based view of how your system held up.
Running chaos regularly? Move to Long-Term Storage to persist metrics across runs and spot regressions between releases.
Operating at scale? Use Multi-Cluster Orchestration to drive chaos across multiple clusters or cloud environments from a single control point.
| Journey | I want to… | Experience level | Tools needed |
|---|
| Basic Run | Inject chaos and observe results manually | Beginner | krknctl |
| Metrics Validation | Automatically pass/fail based on Prometheus metrics | Intermediate | krknctl + Prometheus |
| Resilience Score | Generate a scored report to validate an environment | Intermediate | krknctl + Prometheus |
| Long-Term Storage | Store metrics across runs for regression analysis | Advanced | krknctl + Prometheus + Elasticsearch |
| Multi-Cluster Orchestration | Run chaos across multiple clusters or clouds | Advanced | krkn-operator |
1 - Basic Run
Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.
Goal: Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.
This is the best starting point if you are new to Krkn or want to explore a specific scenario quickly.
What you need
- A running Kubernetes or OpenShift cluster
- A kubeconfig with cluster access
- krknctl (recommended) or krkn installed
Steps
Install krknctl — follow the installation guide.
List available scenarios to find one that fits your target:
Run a scenario — for example, to kill pods matching a label:
krknctl run pod-scenarios
krknctl will prompt you for required inputs interactively, or you can pass them as flags.
Observe results in your cluster using kubectl or your existing monitoring tools. Krkn logs pass/fail and recovery status to stdout.
For pod scenarios specifically, you can confirm the pod was killed and recovered by checking its age. A restarted pod will show a much shorter uptime than its neighbours:
Example output after a pod scenario targeting the my-app namespace:
NAMESPACE NAME READY STATUS RESTARTS AGE
my-app frontend-7d9f8b6c4-xk2pq 1/1 Running 0 8s
my-app backend-5c6d7f8b9-lm3rt 1/1 Running 0 4d2h
kube-system coredns-787d4945fb-nqpzj 1/1 Running 0 4d2h
The 8s age on frontend shows it was recently restarted by the scenario while all other pods remain unaffected.
Next steps
2 - Metrics Validation
Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.
Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.
This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.
What you need
Steps
Install krknctl — follow the installation guide.
Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
description: "etcd fsync latency too high: {{$value}}"
severity: error
- expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
description: "Too many failed pods: {{$value}}"
severity: error
Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.
Run a scenario with the alerts profile mounted:
krknctl run pod-scenarios --alerts-profile config/alerts.yaml
Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.
Reference docs
Next steps
To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.
3 - Long-Term Storage
Persist metrics from every chaos run into Elasticsearch to compare behavior across releases, dates, or cluster configurations.
Goal: Persist metrics from every chaos run into Elasticsearch so you can compare behavior across releases, dates, or cluster configurations.
This journey enables regression analysis — for example, detecting that API server latency during a node failure has increased between software versions.
What you need
Steps
Complete Metrics Validation first to confirm Prometheus evaluation is working.
Define your metrics profile at config/metrics.yaml. This controls which Prometheus metrics are snapshotted and stored per run:
- query: irate(apiserver_request_total{verb="POST"}[2m])
metricName: apiserverRequestRate
- query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
metricName: etcdFsyncLatencyP99
Run a scenario with both profiles mounted:
krknctl run pod-scenarios \
--metrics-profile config/metrics.yaml \
--alerts-profile config/alerts.yaml
After each scenario, metrics snapshots are stored alongside run metadata (scenario type, duration, cluster version, exit status).
Deploy krkn-visualize to query your data through pre-built Grafana dashboards for API performance, etcd health, node and pod scenarios, and more:
krknctl visualize \
--es-url https://elasticsearch.example.com \
--es-username elastic \
--es-password <your-password> \
--prometheus-url https://prometheus.example.com \
--prometheus-bearer-token <your-token> \
--grafana-password <grafana-admin-password>
This deploys krkn-visualize to your cluster and wires it to both Elasticsearch and Prometheus. To tear it down later:
krknctl visualize --delete
Reference docs
Next steps
To generate a numerical resilience score on top of your Prometheus data, continue to Resilience Score.
4 - Resilience Score
Generate a numerical score (0–100%) that represents how well your environment held up during chaos.
Goal: Generate a numerical score (0–100%) that represents how well your environment held up during chaos — giving you more signal than a binary pass/fail.
A resilience score lets you track improvement over time, compare environments, and set score thresholds as release gates.
Beta Feature
Resiliency Scoring is currently in Beta. The configuration format and scoring behavior may change in future releases.What you need
How scoring works
After a chaos scenario completes, Krkn evaluates a set of SLOs (defined as PromQL expressions) over the chaos time window. Each SLO is weighted by severity:
- Warning SLOs — 1 point each
- Critical SLOs — 3 points each
The final score is (points passed / total possible points) × 100. A score of 95% indicates a robust system with minor degradation; 60% signals significant issues that need investigation even if the scenario technically passed.
When running via krknctl, resiliency scoring runs automatically in controller mode — per-scenario scores are captured and aggregated across all scenarios in the run.
Steps
Complete Metrics Validation to confirm Prometheus evaluation is working.
Define your SLOs in config/alerts.yaml. Add a severity to each entry — scoring weights them automatically:
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
description: "etcd fsync latency above 10ms"
severity: critical # 3 points
- expr: sum(kube_pod_status_phase{phase="Failed"}) > 0
description: "any pods in Failed phase"
severity: warning # 1 point
- expr: increase(apiserver_request_total{code=~"5.."}[5m]) > 10
description: "API server 5xx errors during chaos"
severity: critical # 3 points
You can also set a custom weight: on any entry to override the severity default — see custom weights and a complete example profile.
Run a scenario with the alerts profile mounted:
krknctl run pod-scenarios -–resiliency-file config/alerts.yaml
The resiliency score is printed at the end of the run and written to kraken.report and resiliency-report.json.
Resiliency Score: 87% (13/15 SLOs passed)
Use the score as a gate — in CI, check the exit code and parse the score from the output to enforce a minimum threshold before promoting a build.
Reference docs
Next steps
To orchestrate chaos across multiple clusters from a single control point, continue to Multi-Cluster Orchestration.
5 - Multi-Cluster Orchestration
Run chaos scenarios across multiple clusters or cloud environments from a single control point using krkn-operator.
Goal: Run chaos scenarios across multiple clusters or cloud environments from a single control point — useful for validating a multi-region application or comparing cluster configurations.
The recommended approach for multi-cluster orchestration is krkn-operator: a Kubernetes operator that runs on a dedicated control plane cluster and dispatches chaos scenarios to any number of registered target clusters, without distributing credentials to individual users.
How krkn-operator works
- A control plane cluster runs the operator and its web console
- Target clusters are registered once by an administrator (via kubeconfig, service account token, or username/password)
- Users select one or more targets through the web UI and launch scenarios — they never handle cluster credentials directly
- Scenarios run in parallel across all selected targets
This design preserves the original Krkn architecture (chaos runs from outside the cluster) while adding a secure, centralized orchestration layer.
What you need
- A Kubernetes or OpenShift cluster to host the operator (the control plane cluster)
- Helm 3.0+
- kubeconfig or service account credentials for each target cluster (held by the admin, not shared with users)
Steps
Install krkn-operator on your control plane cluster using Helm:
helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
--version <VERSION> \
--namespace krkn-operator-system \
--create-namespace
For production deployments with HA, external access, and monitoring, see the full installation guide.
Access the web console — for local testing use port-forwarding; for production expose it via Ingress, Gateway API, or OpenShift Route:
kubectl port-forward svc/krkn-operator-console 3000:3000 -n krkn-operator-system
Register target clusters — as an administrator, open Admin Settings → Cluster Targets → Add Target and provide the cluster name and credentials for each cluster you want to target. See Configuration for the three supported auth methods (kubeconfig, service account token, username/password).
Run a scenario across multiple clusters — click Run Scenario, select one or more registered target clusters, choose a scenario, configure its parameters, and launch. The operator executes the scenario on all selected targets concurrently.
Monitor in real time — the home dashboard shows all active runs across all clusters. Click any run to see live log streaming and execution status.
ACM/OCM integration
If your organization uses Red Hat Advanced Cluster Management (ACM) or Open Cluster Management (OCM), install the operator with ACM integration enabled. It will automatically discover and sync all ACM-managed clusters as chaos targets — no manual credential management required:
helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
--version <VERSION> \
--namespace krkn-operator-system \
--create-namespace \
--set acm.enabled=true
Reference docs