This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Choose Your Path

Find the right Krkn setup for your goals — from a quick first run to multi-cluster resilience pipelines.

1: Basic Run
2: Metrics Validation
3: Long-Term Storage
4: Resilience Score
5: Multi-Cluster Orchestration

Whether you’re running your first chaos scenario or building a production resilience pipeline, there’s a path here for you. Pick the journey that matches your experience level and goals — each one builds on the previous, so you can start simple and add complexity when you’re ready.

New to Krkn? Start with Basic Run — no configuration files, no metrics, just run a scenario and see what happens.

Familiar with the basics? Add automatic pass/fail evaluation with Metrics Validation, then layer in a Resilience Score to get a percentage-based view of how your system held up.

Running chaos regularly? Move to Long-Term Storage to persist metrics across runs and spot regressions between releases.

Operating at scale? Use Multi-Cluster Orchestration to drive chaos across multiple clusters or cloud environments from a single control point.

Journey	I want to…	Experience level	Tools needed
Basic Run	Inject chaos and observe results manually	Beginner	krknctl
Metrics Validation	Automatically pass/fail based on Prometheus metrics	Intermediate	krknctl + Prometheus
Resilience Score	Generate a scored report to validate an environment	Intermediate	krknctl + Prometheus
Long-Term Storage	Store metrics across runs for regression analysis	Advanced	krknctl + Prometheus + Elasticsearch
Multi-Cluster Orchestration	Run chaos across multiple clusters or clouds	Advanced	krkn-operator

1 - Basic Run

Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.

Goal: Run a chaos scenario and observe what happens — no metrics, no scoring, no pipeline.

This is the best starting point if you are new to Krkn or want to explore a specific scenario quickly.

What you need

A running Kubernetes or OpenShift cluster
A kubeconfig with cluster access
krknctl (recommended) or krkn installed

Steps

Install krknctl — follow the installation guide.
List available scenarios to find one that fits your target:
```
krknctl list
```
Run a scenario — for example, to kill pods matching a label:
```
krknctl run pod-scenarios
```
krknctl will prompt you for required inputs interactively, or you can pass them as flags.
Observe results in your cluster using kubectl or your existing monitoring tools. Krkn logs pass/fail and recovery status to stdout.
For pod scenarios specifically, you can confirm the pod was killed and recovered by checking its age. A restarted pod will show a much shorter uptime than its neighbours:
```
kubectl get pods -A
```
Example output after a pod scenario targeting the my-app namespace:
```
NAMESPACE     NAME                          READY   STATUS    RESTARTS   AGE
my-app        frontend-7d9f8b6c4-xk2pq      1/1     Running   0          8s
my-app        backend-5c6d7f8b9-lm3rt        1/1     Running   0          4d2h
kube-system   coredns-787d4945fb-nqpzj       1/1     Running   0          4d2h
```
The 8s age on frontend shows it was recently restarted by the scenario while all other pods remain unaffected.

Next steps

Read each scenario’s documentation to understand what inputs are available.
When you’re ready to add automatic metric evaluation, continue to Metrics Validation.

2 - Metrics Validation

Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.

Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.

This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.

What you need

Everything from Basic Run
A Prometheus instance accessible from where Krkn runs (auto-detected on OpenShift; set via scenario flags on Kubernetes) — need to set one up? See installing Prometheus on a kind cluster
krknctl installed

Steps

Install krknctl — follow the installation guide.

Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: "etcd fsync latency too high: {{$value}}"
  severity: error

- expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
  description: "Too many failed pods: {{$value}}"
  severity: error

Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.

Run a scenario with the alerts profile mounted:
```
krknctl run pod-scenarios --alerts-profile config/alerts.yaml
```
Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.

Reference docs

SLO Validation — full details on alert profiles and PromQL configuration
krknctl usage — full flag reference for run
Installing Prometheus on a kind cluster — Helm-based setup for local testing

Next steps

To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.

3 - Long-Term Storage

Persist metrics from every chaos run into Elasticsearch to compare behavior across releases, dates, or cluster configurations.

Goal: Persist metrics from every chaos run into Elasticsearch so you can compare behavior across releases, dates, or cluster configurations.

This journey enables regression analysis — for example, detecting that API server latency during a node failure has increased between software versions.

What you need

Everything from Metrics Validation
An Elasticsearch instance (self-hosted or managed) — need to set one up? See installing Elasticsearch on a kind cluster
krknctl installed

Steps

Complete Metrics Validation first to confirm Prometheus evaluation is working.

Define your metrics profile at config/metrics.yaml. This controls which Prometheus metrics are snapshotted and stored per run:

- query: irate(apiserver_request_total{verb="POST"}[2m])
  metricName: apiserverRequestRate

- query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
  metricName: etcdFsyncLatencyP99

Run a scenario with both profiles mounted:
```
krknctl run pod-scenarios \
  --metrics-profile config/metrics.yaml \
  --alerts-profile config/alerts.yaml
```
After each scenario, metrics snapshots are stored alongside run metadata (scenario type, duration, cluster version, exit status).

Deploy krkn-visualize to query your data through pre-built Grafana dashboards for API performance, etcd health, node and pod scenarios, and more:

krknctl visualize \
  --es-url https://elasticsearch.example.com \
  --es-username elastic \
  --es-password <your-password> \
  --prometheus-url https://prometheus.example.com \
  --prometheus-bearer-token <your-token> \
  --grafana-password <grafana-admin-password>

This deploys krkn-visualize to your cluster and wires it to both Elasticsearch and Prometheus. To tear it down later:

krknctl visualize --delete

Reference docs

krknctl usage — full flag reference for run and visualize
Performance Dashboards — krkn-visualize dashboards and manual deploy script
Telemetry — understanding the data Krkn captures and stores per run
Installing Elasticsearch on a kind cluster — Helm-based setup for local testing

Next steps

To generate a numerical resilience score on top of your Prometheus data, continue to Resilience Score.

4 - Resilience Score

Generate a numerical score (0–100%) that represents how well your environment held up during chaos.

Goal: Generate a numerical score (0–100%) that represents how well your environment held up during chaos — giving you more signal than a binary pass/fail.

A resilience score lets you track improvement over time, compare environments, and set score thresholds as release gates.

Beta Feature

Resiliency Scoring is currently in Beta. The configuration format and scoring behavior may change in future releases.

What you need

Everything from Metrics Validation
krknctl installed

How scoring works

After a chaos scenario completes, Krkn evaluates a set of SLOs (defined as PromQL expressions) over the chaos time window. Each SLO is weighted by severity:

Warning SLOs — 1 point each
Critical SLOs — 3 points each

The final score is (points passed / total possible points) × 100. A score of 95% indicates a robust system with minor degradation; 60% signals significant issues that need investigation even if the scenario technically passed.

When running via krknctl, resiliency scoring runs automatically in controller mode — per-scenario scores are captured and aggregated across all scenarios in the run.

Steps

Complete Metrics Validation to confirm Prometheus evaluation is working.

Define your SLOs in config/alerts.yaml. Add a severity to each entry — scoring weights them automatically:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: "etcd fsync latency above 10ms"
  severity: critical        # 3 points

- expr: sum(kube_pod_status_phase{phase="Failed"}) > 0
  description: "any pods in Failed phase"
  severity: warning         # 1 point

- expr: increase(apiserver_request_total{code=~"5.."}[5m]) > 10
  description: "API server 5xx errors during chaos"
  severity: critical        # 3 points

You can also set a custom weight: on any entry to override the severity default — see custom weights and a complete example profile.

Run a scenario with the alerts profile mounted:
```
krknctl run pod-scenarios -–resiliency-file config/alerts.yaml
```
The resiliency score is printed at the end of the run and written to kraken.report and resiliency-report.json.
```
Resiliency Score: 87% (13/15 SLOs passed)
```
Use the score as a gate — in CI, check the exit code and parse the score from the output to enforce a minimum threshold before promoting a build.

Reference docs

Resiliency Scoring — full algorithm, custom weights, and configuration reference
SLO Validation — PromQL alert configuration that feeds into scoring
krknctl usage — full flag reference for run

Next steps

To orchestrate chaos across multiple clusters from a single control point, continue to Multi-Cluster Orchestration.

5 - Multi-Cluster Orchestration

Run chaos scenarios across multiple clusters or cloud environments from a single control point using krkn-operator.

Goal: Run chaos scenarios across multiple clusters or cloud environments from a single control point — useful for validating a multi-region application or comparing cluster configurations.

The recommended approach for multi-cluster orchestration is krkn-operator: a Kubernetes operator that runs on a dedicated control plane cluster and dispatches chaos scenarios to any number of registered target clusters, without distributing credentials to individual users.

How krkn-operator works

A control plane cluster runs the operator and its web console
Target clusters are registered once by an administrator (via kubeconfig, service account token, or username/password)
Users select one or more targets through the web UI and launch scenarios — they never handle cluster credentials directly
Scenarios run in parallel across all selected targets

This design preserves the original Krkn architecture (chaos runs from outside the cluster) while adding a secure, centralized orchestration layer.

What you need

A Kubernetes or OpenShift cluster to host the operator (the control plane cluster)
Helm 3.0+
kubeconfig or service account credentials for each target cluster (held by the admin, not shared with users)

Steps

Install krkn-operator on your control plane cluster using Helm:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace

For production deployments with HA, external access, and monitoring, see the full installation guide.

Access the web console — for local testing use port-forwarding; for production expose it via Ingress, Gateway API, or OpenShift Route:
```
kubectl port-forward svc/krkn-operator-console 3000:3000 -n krkn-operator-system
```
Register target clusters — as an administrator, open Admin Settings → Cluster Targets → Add Target and provide the cluster name and credentials for each cluster you want to target. See Configuration for the three supported auth methods (kubeconfig, service account token, username/password).
Run a scenario across multiple clusters — click Run Scenario, select one or more registered target clusters, choose a scenario, configure its parameters, and launch. The operator executes the scenario on all selected targets concurrently.
Monitor in real time — the home dashboard shows all active runs across all clusters. Click any run to see live log streaming and execution status.

ACM/OCM integration

If your organization uses Red Hat Advanced Cluster Management (ACM) or Open Cluster Management (OCM), install the operator with ACM integration enabled. It will automatically discover and sync all ACM-managed clusters as chaos targets — no manual credential management required:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace \
  --set acm.enabled=true

Reference docs

krkn-operator overview — architecture and security model
Installation — Helm values for Kubernetes, OpenShift, and ACM
Configuration — adding target clusters and ACM integration
Usage — running and monitoring scenarios via the web console