This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Getting Started

Getting started with Krkn-chaos

TL;DR

# 1. Install krknctl
curl -fsSL https://raw.githubusercontent.com/krkn-chaos/krknctl/refs/heads/main/install.sh | bash

# 2. Create a test workload
kubectl create namespace chaos-test
kubectl create deployment nginx-test --image=nginx --replicas=3 -n chaos-test

# 3. Run your first chaos scenario (pod disruption)
krknctl run pod-scenarios --namespace chaos-test --pod-label "app=nginx-test" --disruption-count 1

# 4. Verify pods recovered
kubectl get pods -n chaos-test -l app=nginx-test

What you need

RequirementMinimum VersionCheck Command
Kubernetes or OpenShift cluster1.21+kubectl version
kubeconfig with cluster-admin accesskubectl get nodes
Docker or PodmanDocker 20.10+ / Podman 4.0+docker --version or podman --version

Basic Run

This is the best starting point if you are new to Krkn or want to explore a specific scenario quickly. No metrics, no scoring, no pipeline — just run a scenario and see what happens.

1. Install krknctl

curl -fsSL https://raw.githubusercontent.com/krkn-chaos/krknctl/refs/heads/main/install.sh | bash

Verify the installation:

krknctl --version

2. Create a test workload

kubectl create namespace chaos-test
kubectl create deployment nginx-test --image=nginx --replicas=3 -n chaos-test
kubectl wait --for=condition=Available deployment/nginx-test -n chaos-test --timeout=60s

3. List available scenarios

krknctl list

This shows all chaos scenarios you can run. For your first test, we will use pod-scenarios.

4. Run a scenario

krknctl run pod-scenarios \
  --namespace chaos-test \
  --pod-label "app=nginx-test" \
  --disruption-count 1 \
  --kill-timeout 180 \
  --expected-recovery-time 120

krknctl will prompt you for required inputs interactively, or you can pass them as flags.

The scenario will:

  1. Find pods matching the label app=nginx-test in the chaos-test namespace
  2. Disrupt 1 pod (delete it)
  3. Wait up to 180 seconds for the pod to be removed
  4. Monitor recovery for up to 120 seconds

5. Observe results

In a separate terminal, watch the pods recover:

kubectl get pods -n chaos-test -l app=nginx-test -w

You can confirm the pod was killed and recovered by checking its age. A restarted pod will show a much shorter uptime than its neighbours:

NAMESPACE     NAME                          READY   STATUS    RESTARTS   AGE
chaos-test    nginx-test-7d9f8b6c4-xk2pq   1/1     Running   0          8s
chaos-test    nginx-test-5c6d7f8b9-lm3rt   1/1     Running   0          4d2h
chaos-test    nginx-test-787d4945fb-nqpzj   1/1     Running   0          4d2h

The 8s age shows the pod was recently restarted by the scenario while the others remain unaffected.

What success looks like: The disrupted pod is deleted and Kubernetes recreates it. The new pod reaches Ready state within the --expected-recovery-time window. The scenario exits with code 0.

{
  "recovered": [
    {
      "pod_name": "nginx-test-7d9f8b6c4-xk2pq",
      "namespace": "chaos-test",
      "pod_rescheduling_time": 2.3,
      "pod_readiness_time": 5.7,
      "total_recovery_time": 8.0
    }
  ],
  "unrecovered": []
}

What failure looks like: The pod does not recover within the timeout. The scenario exits with a non-zero code and logs an error.

{
  "recovered": [],
  "unrecovered": [
    {
      "pod_name": "nginx-test-7d9f8b6c4-xk2pq",
      "namespace": "chaos-test",
      "pod_rescheduling_time": 0.0,
      "pod_readiness_time": 0.0,
      "total_recovery_time": 0.0
    }
  ]
}

6. Clean up

kubectl delete namespace chaos-test
krknctl clean

Where to go next

Whether you’re running your first scenario or building a production resilience pipeline, pick the journey that matches your goals:

JourneyI want to…Experience levelTools needed
Metrics ValidationAutomatically pass/fail based on Prometheus metricsIntermediatekrknctl + Prometheus
Resilience ScoreGenerate a scored report to validate an environmentIntermediatekrknctl + Prometheus
Long-Term StorageStore metrics across runs for regression analysisAdvancedkrknctl + Prometheus + Elasticsearch
Multi-Cluster OrchestrationRun chaos across multiple clusters or cloudsAdvancedkrkn-operator

Alternative Methods

Krkn-hub (Containerized)

Krkn-hub runs scenarios as container images — ideal for CI/CD pipelines. Each scenario is a pre-built image on quay.io/krkn-chaos/krkn-hub.

podman run --net=host \
  -v ~/.kube/config:/home/krkn/.kube/config:Z \
  -e NAMESPACE=default \
  -e POD_LABEL="app=my-app" \
  -d quay.io/krkn-chaos/krkn-hub:pod-scenarios

See the krkn-hub installation guide for full setup instructions.

Note: Krkn-hub runs one scenario type at a time per container.

Krkn (Standalone Python)

Krkn is the core chaos engine — a Python program that can run multiple scenario types in a single execution using config files.

See the krkn installation guide and configuration hints to get started.

Note: Krkn allows running multiple different scenario types and scenario files in one execution, unlike krkn-hub and krknctl.


Further Reading

1 - Metrics Validation

Run chaos and automatically evaluate Prometheus metrics for a clear pass or fail without manual inspection.

Goal: Run chaos and automatically evaluate Prometheus metrics — getting a clear pass or fail without manual inspection.

This journey is well suited to CI/CD pipelines where you cannot watch the cluster in real time.

What you need

Steps

  1. Install krknctl — follow the installation guide.

  2. Create your alerts profile at config/alerts.yaml. This defines the PromQL expressions Krkn evaluates after each scenario:

    - expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
      description: "etcd fsync latency too high: {{$value}}"
      severity: error
    
    - expr: sum(kube_pod_status_phase{phase="Failed"}) > 5
      description: "Too many failed pods: {{$value}}"
      severity: error
    

    Queries with severity: error cause Krkn to exit with a non-zero code. Queries with severity: info are logged only.

  3. Run a scenario with the alerts profile mounted:

    krknctl run pod-scenarios --alerts-profile config/alerts.yaml
    

    Krkn evaluates the alert profile at the end of each scenario and reports pass or fail.

Reference docs

Next steps

To persist metrics long-term for regression analysis across releases, continue to Long-Term Storage.

2 - Running a Chaos Scenario with Krkn

Getting Started Running Chaos Scenarios

Config

Instructions on how to setup the config and all the available options supported can be found at Config.

In all the examples below you’ll replace the scenario_type with the scenario plugin type that can be found in the second column here

Running a Single Scenario

To run a single scenario, you’ll edit the krkn config file and only have 1 item in the list of chaos_scenarios

kraken:
    ...
    chaos_scenarios:
        - <scenario_type>:
            - scenarios/<scenario_file>
    ...

Running Multiple Scenarios

To run multiple scenarios, you’ll edit the krkn config file and add multiple scenarios into chaos_scenarios. If you want to run multiple scenario files that are the same scenario type you can add multiple items under the scenario_type. If you want to run multiple different scenario types you can add those under chaos_scenarios

kraken:
    ...
    chaos_scenarios:
        - <scenario_type>:
            - scenarios/<scenario_file_1>
            - scenarios/<scenario_file_2>
        - <scenario_type_2>:
            - scenarios/<scenario_file_3>
            - scenarios/<scenario_file_4>

Creating a Scenario File

You can either copy an existing scenario yaml file and make it your own, or fill in one of the templates below to suit your needs.

Common Scenario Edits

If you just want to make small changes to pre-existing scenarios, feel free to edit the scenario file itself.

Example of Quick Pod Scenario Edit:

If you want to kill 2 pods instead of 1 in any of the pre-existing scenarios, you can either edit the iterations number located at config or edit the kill count in the scenario file

- id: kill-pods
  config:
    namespace_pattern: ^kube-system$
    name_pattern: .*
    kill: 1 -> 2
    krkn_pod_recovery_time: 120

Example of Quick Nodes Scenario Edit:

If your cluster is build on GCP instead of AWS, just change the cloud type in the node_scenarios_example.yml file.

node_scenarios:
  - actions:
    - node_reboot_scenario
    node_name:
    label_selector: node-role.kubernetes.io/worker
    instance_count: 1
    timeout: 120
    cloud_type: aws -> gcp
    parallel: true
    kube_check: true

Templates

Pod Scenario Yaml Template

For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:

# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
  config:
    namespace_pattern: ^<namespace>$
    label_selector: <pod label>
    kill: <number of pods to kill>
    krkn_pod_recovery_time: <expected time for the pod to become ready>

Node Scenario Yaml Template

node_scenarios:
  - actions:  # Node chaos scenarios to be injected.
    - <chaos scenario>
    - <chaos scenario>
    node_name: <node name>  # Can be left blank.
    label_selector: <node label>
    instance_kill_count: <number of nodes on which to perform action>
    timeout: <duration to wait for completion>
    cloud_type: <cloud provider>

Time Chaos Scenario Template

time_scenarios:
  - action: 'skew_time' or 'skew_date'
    object_type: 'pod' or 'node'
    label_selector: <label of pod or node>

RBAC

Based on the type of chaos test being executed, certain scenarios may require elevated privileges. The specific RBAC Authorization needed for each Krkn scenario are outlined in detail at the following link: Krkn RBAC

3 - Long-Term Storage

Persist metrics from every chaos run into Elasticsearch to compare behavior across releases, dates, or cluster configurations.

Goal: Persist metrics from every chaos run into Elasticsearch so you can compare behavior across releases, dates, or cluster configurations.

This journey enables regression analysis — for example, detecting that API server latency during a node failure has increased between software versions.

What you need

Steps

  1. Complete Metrics Validation first to confirm Prometheus evaluation is working.

  2. Define your metrics profile at config/metrics.yaml. This controls which Prometheus metrics are snapshotted and stored per run:

    - query: irate(apiserver_request_total{verb="POST"}[2m])
      metricName: apiserverRequestRate
    
    - query: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))
      metricName: etcdFsyncLatencyP99
    
  3. Run a scenario with both profiles mounted:

    krknctl run pod-scenarios \
      --metrics-profile config/metrics.yaml \
      --alerts-profile config/alerts.yaml
    

    After each scenario, metrics snapshots are stored alongside run metadata (scenario type, duration, cluster version, exit status).

  4. Deploy krkn-visualize to query your data through pre-built Grafana dashboards for API performance, etcd health, node and pod scenarios, and more:

    krknctl visualize \
      --es-url https://elasticsearch.example.com \
      --es-username elastic \
      --es-password <your-password> \
      --prometheus-url https://prometheus.example.com \
      --prometheus-bearer-token <your-token> \
      --grafana-password <grafana-admin-password>
    

    This deploys krkn-visualize to your cluster and wires it to both Elasticsearch and Prometheus. To tear it down later:

    krknctl visualize --delete
    

Reference docs

Next steps

To generate a numerical resilience score on top of your Prometheus data, continue to Resilience Score.

4 - Resilience Score

Generate a numerical score (0–100%) that represents how well your environment held up during chaos.

Goal: Generate a numerical score (0–100%) that represents how well your environment held up during chaos — giving you more signal than a binary pass/fail.

A resilience score lets you track improvement over time, compare environments, and set score thresholds as release gates.

What you need

How scoring works

After a chaos scenario completes, Krkn evaluates a set of SLOs (defined as PromQL expressions) over the chaos time window. Each SLO is weighted by severity:

  • Warning SLOs — 1 point each
  • Critical SLOs — 3 points each

The final score is (points passed / total possible points) × 100. A score of 95% indicates a robust system with minor degradation; 60% signals significant issues that need investigation even if the scenario technically passed.

When running via krknctl, resiliency scoring runs automatically in controller mode — per-scenario scores are captured and aggregated across all scenarios in the run.

Steps

  1. Complete Metrics Validation to confirm Prometheus evaluation is working.

  2. Define your SLOs in config/alerts.yaml. Add a severity to each entry — scoring weights them automatically:

    - expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
      description: "etcd fsync latency above 10ms"
      severity: critical        # 3 points
    
    - expr: sum(kube_pod_status_phase{phase="Failed"}) > 0
      description: "any pods in Failed phase"
      severity: warning         # 1 point
    
    - expr: increase(apiserver_request_total{code=~"5.."}[5m]) > 10
      description: "API server 5xx errors during chaos"
      severity: critical        # 3 points
    

    You can also set a custom weight: on any entry to override the severity default — see custom weights and a complete example profile.

  3. Run a scenario with the alerts profile mounted:

    krknctl run pod-scenarios -–resiliency-file config/alerts.yaml
    

    The resiliency score is printed at the end of the run and written to kraken.report and resiliency-report.json.

    Resiliency Score: 87% (13/15 SLOs passed)
    
  4. Use the score as a gate — in CI, check the exit code and parse the score from the output to enforce a minimum threshold before promoting a build.

Reference docs

Next steps

To orchestrate chaos across multiple clusters from a single control point, continue to Multi-Cluster Orchestration.

5 - Multi-Cluster Orchestration

Run chaos scenarios across multiple clusters or cloud environments from a single control point using krkn-operator.

Goal: Run chaos scenarios across multiple clusters or cloud environments from a single control point — useful for validating a multi-region application or comparing cluster configurations.

The recommended approach for multi-cluster orchestration is krkn-operator: a Kubernetes operator that runs on a dedicated control plane cluster and dispatches chaos scenarios to any number of registered target clusters, without distributing credentials to individual users.

How krkn-operator works

  • A control plane cluster runs the operator and its web console
  • Target clusters are registered once by an administrator (via kubeconfig, service account token, or username/password)
  • Users select one or more targets through the web UI and launch scenarios — they never handle cluster credentials directly
  • Scenarios run in parallel across all selected targets

This design preserves the original Krkn architecture (chaos runs from outside the cluster) while adding a secure, centralized orchestration layer.

What you need

  • A Kubernetes or OpenShift cluster to host the operator (the control plane cluster)
  • Helm 3.0+
  • kubeconfig or service account credentials for each target cluster (held by the admin, not shared with users)

Steps

  1. Install krkn-operator on your control plane cluster using Helm:

    helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
      --version <VERSION> \
      --namespace krkn-operator-system \
      --create-namespace
    

    For production deployments with HA, external access, and monitoring, see the full installation guide.

  2. Access the web console — for local testing use port-forwarding; for production expose it via Ingress, Gateway API, or OpenShift Route:

    kubectl port-forward svc/krkn-operator-console 3000:3000 -n krkn-operator-system
    
  3. Register target clusters — as an administrator, open Admin Settings → Cluster Targets → Add Target and provide the cluster name and credentials for each cluster you want to target. See Configuration for the three supported auth methods (kubeconfig, service account token, username/password).

  4. Run a scenario across multiple clusters — click Run Scenario, select one or more registered target clusters, choose a scenario, configure its parameters, and launch. The operator executes the scenario on all selected targets concurrently.

  5. Monitor in real time — the home dashboard shows all active runs across all clusters. Click any run to see live log streaming and execution status.

ACM/OCM integration

If your organization uses Red Hat Advanced Cluster Management (ACM) or Open Cluster Management (OCM), install the operator with ACM integration enabled. It will automatically discover and sync all ACM-managed clusters as chaos targets — no manual credential management required:

helm install krkn-operator oci://quay.io/krkn-chaos/charts/krkn-operator \
  --version <VERSION> \
  --namespace krkn-operator-system \
  --create-namespace \
  --set acm.enabled=true

Reference docs