This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Getting Started

Getting started with Krkn-chaos

1: Metrics Validation
2: Running a Chaos Scenario with Krkn
3: Long-Term Storage
4: Resilience Score
5: Multi-Cluster Orchestration

TL;DR

# 1. Install krknctl
curl -fsSL https://raw.githubusercontent.com/krkn-chaos/krknctl/refs/heads/main/install.sh | bash

# 2. Create a test workload
kubectl create namespace chaos-test
kubectl create deployment nginx-test --image=nginx --replicas=3 -n chaos-test

# 3. Run your first chaos scenario (pod disruption)
krknctl run pod-scenarios --namespace chaos-test --pod-label "app=nginx-test" --disruption-count 1

# 4. Verify pods recovered
kubectl get pods -n chaos-test -l app=nginx-test

What you need

Requirement	Minimum Version	Check Command
Kubernetes or OpenShift cluster	1.21+	`kubectl version`
kubeconfig with cluster-admin access	—	`kubectl get nodes`
Docker or Podman	Docker 20.10+ / Podman 4.0+	`docker --version` or `podman --version`

Basic Run

This is the best starting point if you are new to Krkn or want to explore a specific scenario quickly. No metrics, no scoring, no pipeline — just run a scenario and see what happens.

1. Install krknctl

curl -fsSL https://raw.githubusercontent.com/krkn-chaos/krknctl/refs/heads/main/install.sh | bash

Verify the installation:

krknctl --version

Tip

Enable shell auto-completion for the best experience:

Bash: source <(krknctl completion bash)

Zsh: autoload -Uz compinit && compinit && source <(krknctl completion zsh)

2. Create a test workload

kubectl create namespace chaos-test
kubectl create deployment nginx-test --image=nginx --replicas=3 -n chaos-test
kubectl wait --for=condition=Available deployment/nginx-test -n chaos-test --timeout=60s

3. List available scenarios

krknctl list

This shows all chaos scenarios you can run. For your first test, we will use pod-scenarios.

4. Run a scenario

krknctl run pod-scenarios \
  --namespace chaos-test \
  --pod-label "app=nginx-test" \
  --disruption-count 1 \
  --kill-timeout 180 \
  --expected-recovery-time 120

krknctl will prompt you for required inputs interactively, or you can pass them as flags.

The scenario will:

Find pods matching the label app=nginx-test in the chaos-test namespace
Disrupt 1 pod (delete it)
Wait up to 180 seconds for the pod to be removed
Monitor recovery for up to 120 seconds

5. Observe results

In a separate terminal, watch the pods recover:

kubectl get pods -n chaos-test -l app=nginx-test -w

You can confirm the pod was killed and recovered by checking its age. A restarted pod will show a much shorter uptime than its neighbours:

NAMESPACE     NAME                          READY   STATUS    RESTARTS   AGE
chaos-test    nginx-test-7d9f8b6c4-xk2pq   1/1     Running   0          8s
chaos-test    nginx-test-5c6d7f8b9-lm3rt   1/1     Running   0          4d2h
chaos-test    nginx-test-787d4945fb-nqpzj   1/1     Running   0          4d2h

The 8s age shows the pod was recently restarted by the scenario while the others remain unaffected.

What success looks like: The disrupted pod is deleted and Kubernetes recreates it. The new pod reaches Ready state within the --expected-recovery-time window. The scenario exits with code 0.

{
  "recovered": [
    {
      "pod_name": "nginx-test-7d9f8b6c4-xk2pq",
      "namespace": "chaos-test",
      "pod_rescheduling_time": 2.3,
      "pod_readiness_time": 5.7,
      "total_recovery_time": 8.0
    }
  ],
  "unrecovered": []
}

What failure looks like: The pod does not recover within the timeout. The scenario exits with a non-zero code and logs an error.

{
  "recovered": [],
  "unrecovered": [
    {
      "pod_name": "nginx-test-7d9f8b6c4-xk2pq",
      "namespace": "chaos-test",
      "pod_rescheduling_time": 0.0,
      "pod_readiness_time": 0.0,
      "total_recovery_time": 0.0
    }
  ]
}

6. Clean up

kubectl delete namespace chaos-test
krknctl clean

Where to go next

Whether you’re running your first scenario or building a production resilience pipeline, pick the journey that matches your goals:

Journey	I want to…	Experience level	Tools needed
Metrics Validation	Automatically pass/fail based on Prometheus metrics	Intermediate	krknctl + Prometheus
Resilience Score	Generate a scored report to validate an environment	Intermediate	krknctl + Prometheus
Long-Term Storage	Store metrics across runs for regression analysis	Advanced	krknctl + Prometheus + Elasticsearch
Multi-Cluster Orchestration	Run chaos across multiple clusters or clouds	Advanced	krkn-operator

Alternative Methods

Krkn-hub (Containerized)

Krkn-hub runs scenarios as container images — ideal for CI/CD pipelines. Each scenario is a pre-built image on quay.io/krkn-chaos/krkn-hub.

podman run --net=host \
  -v ~/.kube/config:/home/krkn/.kube/config:Z \
  -e NAMESPACE=default \
  -e POD_LABEL="app=my-app" \
  -d quay.io/krkn-chaos/krkn-hub:pod-scenarios

See the krkn-hub installation guide for full setup instructions.

Note: Krkn-hub runs one scenario type at a time per container.

Krkn (Standalone Python)

Krkn is the core chaos engine — a Python program that can run multiple scenario types in a single execution using config files.

See the krkn installation guide and configuration hints to get started.

Note: Krkn allows running multiple different scenario types and scenario files in one execution, unlike krkn-hub and krknctl.

1 - Metrics Validation

Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.

Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.

This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.

What you need

Everything from the Getting Started guide (install krknctl, create a test workload, and run a first scenario)
A Prometheus instance accessible from where Krkn runs (auto-detected on OpenShift; set via scenario flags on Kubernetes) — need to set one up? See installing Prometheus on a kind cluster
krknctl installed

Steps

Install krknctl — follow the installation guide.

Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: "etcd fsync latency too high: {{$value}}"
  severity: error

- expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
  description: "Too many failed pods: {{$value}}"
  severity: error

Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.

Run a scenario with the alerts profile mounted:
```
krknctl run pod-scenarios --alerts-profile config/alerts.yaml
```
Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.

Reference docs

SLO Validation — full details on alert profiles and PromQL configuration
krknctl usage — full flag reference for run
Installing Prometheus on a kind cluster — Helm-based setup for local testing

Next steps

To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.

2 - Running a Chaos Scenario with Krkn

Getting Started Running Chaos Scenarios

Config

Instructions on how to setup the config and all the available options supported can be found at Config.

In all the examples below you’ll replace the scenario_type with the scenario plugin type that can be found in the second column here

Running a Single Scenario

To run a single scenario, you’ll edit the krkn config file and only have 1 item in the list of chaos_scenarios

kraken:
    ...
    chaos_scenarios:
        - <scenario_type>:
            - scenarios/<scenario_file>
    ...

Running Multiple Scenarios

To run multiple scenarios, you’ll edit the krkn config file and add multiple scenarios into chaos_scenarios. If you want to run multiple scenario files that are the same scenario type you can add multiple items under the scenario_type. If you want to run multiple different scenario types you can add those under chaos_scenarios

kraken:
    ...
    chaos_scenarios:
        - <scenario_type>:
            - scenarios/<scenario_file_1>
            - scenarios/<scenario_file_2>
        - <scenario_type_2>:
            - scenarios/<scenario_file_3>
            - scenarios/<scenario_file_4>

Creating a Scenario File

You can either copy an existing scenario yaml file and make it your own, or fill in one of the templates below to suit your needs.

Common Scenario Edits

If you just want to make small changes to pre-existing scenarios, feel free to edit the scenario file itself.

Example of Quick Pod Scenario Edit:

If you want to kill 2 pods instead of 1 in any of the pre-existing scenarios, you can either edit the iterations number located at config or edit the kill count in the scenario file

- id: kill-pods
  config:
    namespace_pattern: ^kube-system$
    name_pattern: .*
    kill: 1 -> 2
    krkn_pod_recovery_time: 120

Example of Quick Nodes Scenario Edit:

If your cluster is build on GCP instead of AWS, just change the cloud type in the node_scenarios_example.yml file.

node_scenarios:
  - actions:
    - node_reboot_scenario
    node_name:
    label_selector: node-role.kubernetes.io/worker
    instance_count: 1
    timeout: 120
    cloud_type: aws -> gcp
    parallel: true
    kube_check: true

Templates

Pod Scenario Yaml Template

For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:

# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
  config:
    namespace_pattern: ^<namespace>$
    label_selector: <pod label>
    kill: <number of pods to kill>
    krkn_pod_recovery_time: <expected time for the pod to become ready>

Node Scenario Yaml Template

node_scenarios:
  - actions:  # Node chaos scenarios to be injected.
    - <chaos scenario>
    - <chaos scenario>
    node_name: <node name>  # Can be left blank.
    label_selector: <node label>
    instance_kill_count: <number of nodes on which to perform action>
    timeout: <duration to wait for completion>
    cloud_type: <cloud provider>

Time Chaos Scenario Template

time_scenarios:
  - action: 'skew_time' or 'skew_date'
    object_type: 'pod' or 'node'
    label_selector: <label of pod or node>

RBAC

Based on the type of chaos test being executed, certain scenarios may require elevated privileges. The specific RBAC Authorization needed for each Krkn scenario are outlined in detail at the following link: Krkn RBAC

3 - Long-Term Storage

Persist metrics from every chaos run into Elasticsearch to compare behavior across releases, dates, or cluster configurations.

Goal: Persist metrics from every chaos run into Elasticsearch so you can compare behavior across releases, dates, or cluster configurations.

This journey enables regression analysis — for example, detecting that API server latency during a node failure has increased between software versions.

What you need

Everything from Metrics Validation
An Elasticsearch instance (self-hosted or managed) — need to set one up? See installing Elasticsearch on a kind cluster
krknctl installed

Steps

Complete Metrics Validation first to confirm Prometheus evaluation is working.

Define your metrics profile at config/metrics.yaml. This controls which Prometheus metrics are snapshotted and stored per run:

- query: irate(apiserver_request_total{verb="POST"}[2m])
  metricName: apiserverRequestRate

- query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
  metricName: etcdFsyncLatencyP99

Run a scenario with both profiles mounted:
```
krknctl run pod-scenarios \
  --metrics-profile config/metrics.yaml \
  --alerts-profile config/alerts.yaml
```
After each scenario, metrics snapshots are stored alongside run metadata (scenario type, duration, cluster version, exit status).

Deploy krkn-visualize to query your data through pre-built Grafana dashboards for API performance, etcd health, node and pod scenarios, and more:

krknctl visualize \
  --es-url https://elasticsearch.example.com \
  --es-username elastic \
  --es-password <your-password> \
  --prometheus-url https://prometheus.example.com \
  --prometheus-bearer-token <your-token> \
  --grafana-password <grafana-admin-password>

This deploys krkn-visualize to your cluster and wires it to both Elasticsearch and Prometheus. To tear it down later:

krknctl visualize --delete

Reference docs

krknctl usage — full flag reference for run and visualize
Performance Dashboards — krkn-visualize dashboards and manual deploy script
Telemetry — understanding the data Krkn captures and stores per run
Installing Elasticsearch on a kind cluster — Helm-based setup for local testing

Next steps

To generate a numerical resilience score on top of your Prometheus data, continue to Resilience Score.

4 - Resilience Score

Generate a numerical score (0–100%) that represents how well your environment held up during chaos.

Goal: Generate a numerical score (0–100%) that represents how well your environment held up during chaos — giving you more signal than a binary pass/fail.

A resilience score lets you track improvement over time, compare environments, and set score thresholds as release gates.

Beta Feature

Resiliency Scoring is currently in Beta. The configuration format and scoring behavior may change in future releases.

What you need

Everything from Metrics Validation
krknctl installed

How scoring works

After a chaos scenario completes, Krkn evaluates a set of SLOs (defined as PromQL expressions) over the chaos time window. Each SLO is weighted by severity:

Warning SLOs — 1 point each
Critical SLOs — 3 points each

The final score is (points passed / total possible points) × 100. A score of 95% indicates a robust system with minor degradation; 60% signals significant issues that need investigation even if the scenario technically passed.

When running via krknctl, resiliency scoring runs automatically in controller mode — per-scenario scores are captured and aggregated across all scenarios in the run.

Steps

Complete Metrics Validation to confirm Prometheus evaluation is working.

Define your SLOs in config/alerts.yaml. Add a severity to each entry — scoring weights them automatically:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: "etcd fsync latency above 10ms"
  severity: critical        # 3 points

- expr: sum(kube_pod_status_phase{phase="Failed"}) > 0
  description: "any pods in Failed phase"
  severity: warning         # 1 point

- expr: increase(apiserver_request_total{code=~"5.."}[5m]) > 10
  description: "API server 5xx errors during chaos"
  severity: critical        # 3 points

You can also set a custom weight: on any entry to override the severity default — see custom weights and a complete example profile.

Run a scenario with the alerts profile mounted:
```
krknctl run pod-scenarios -–resiliency-file config/alerts.yaml
```
The resiliency score is printed at the end of the run and written to kraken.report and resiliency-report.json.
```
Resiliency Score: 87% (13/15 SLOs passed)
```
Use the score as a gate — in CI, check the exit code and parse the score from the output to enforce a minimum threshold before promoting a build.

Reference docs

Resiliency Scoring — full algorithm, custom weights, and configuration reference
SLO Validation — PromQL alert configuration that feeds into scoring
krknctl usage — full flag reference for run

Next steps

To orchestrate chaos across multiple clusters from a single control point, continue to Multi-Cluster Orchestration.

5 - Multi-Cluster Orchestration

Run chaos scenarios across multiple clusters or cloud environments from a single control point using krkn-operator.

Goal: Run chaos scenarios across multiple clusters or cloud environments from a single control point — useful for validating a multi-region application or comparing cluster configurations.

The recommended approach for multi-cluster orchestration is krkn-operator: a Kubernetes operator that runs on a dedicated control plane cluster and dispatches chaos scenarios to any number of registered target clusters, without distributing credentials to individual users.

How krkn-operator works

A control plane cluster runs the operator and its web console
Target clusters are registered once by an administrator (via kubeconfig, service account token, or username/password)
Users select one or more targets through the web UI and launch scenarios — they never handle cluster credentials directly
Scenarios run in parallel across all selected targets

This design preserves the original Krkn architecture (chaos runs from outside the cluster) while adding a secure, centralized orchestration layer.

What you need

A Kubernetes or OpenShift cluster to host the operator (the control plane cluster)
Helm 3.0+
kubeconfig or service account credentials for each target cluster (held by the admin, not shared with users)

Steps

Install krkn-operator on your control plane cluster using Helm:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace

For production deployments with HA, external access, and monitoring, see the full installation guide.

Access the web console — for local testing use port-forwarding; for production expose it via Ingress, Gateway API, or OpenShift Route:
```
kubectl port-forward svc/krkn-operator-console 3000:3000 -n krkn-operator-system
```
Register target clusters — as an administrator, open Admin Settings → Cluster Targets → Add Target and provide the cluster name and credentials for each cluster you want to target. See Cluster Management for details.
Run a scenario across multiple clusters — click Run Scenario, select one or more registered target clusters, choose a scenario, configure its parameters, and launch. The operator executes the scenario on all selected targets concurrently.
Monitor in real time — the home dashboard shows all active runs across all clusters. Click any run to see live log streaming and execution status.

ACM/OCM integration

If your organization uses Red Hat Advanced Cluster Management (ACM) or Open Cluster Management (OCM), install the operator with ACM integration enabled. It will automatically discover and sync all ACM-managed clusters as chaos targets — no manual credential management required:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace \
  --set acm.enabled=true

Reference docs

krkn-operator overview — architecture and security model
Installation — Helm values for Kubernetes, OpenShift, and ACM
Administration — managing clusters, users, registries and providers
Usage — running and monitoring scenarios via the web console

Getting Started

TL;DR

What you need

Basic Run

1. Install krknctl

Tip

2. Create a test workload

3. List available scenarios

4. Run a scenario

5. Observe results

6. Clean up

Where to go next

Alternative Methods

Krkn-hub (Containerized)

Krkn (Standalone Python)

Further Reading

1 - Metrics Validation

What you need

Steps

Reference docs

Next steps

2 - Running a Chaos Scenario with Krkn

Getting Started Running Chaos Scenarios

Config

Running a Single Scenario

Running Multiple Scenarios

Creating a Scenario File

Common Scenario Edits

Example of Quick Pod Scenario Edit:

Example of Quick Nodes Scenario Edit:

Templates

Pod Scenario Yaml Template

Node Scenario Yaml Template

Time Chaos Scenario Template

RBAC

3 - Long-Term Storage

What you need

Steps

Reference docs

Next steps

4 - Resilience Score

Beta Feature

What you need

How scoring works

Steps

Reference docs

Next steps

5 - Multi-Cluster Orchestration

How krkn-operator works

What you need

Steps

ACM/OCM integration

Reference docs