Various assumptions led to a number of outages in production environments in the past. The services suffered from poor performance or were inaccessible to the customers, leading to missing Service Level Agreement uptime promises, revenue loss, and a degradation in the perceived reliability of said services.

How can we best avoid this from happening? This is where Chaos testing can add value

Why Krkn?

There are many chaos related projects out there including other ones within CNCF.

We decided to create Krkn to help face some challenges we saw:

Have a light weight application that had the ability to run outside the cluster
- This gives us the ability to take down a cluster and still be able to get logs and complete our tests
Ability to have both cloud based and kubernetes based scenarios
Wanted to have performance at the top of mind by completing metric checks during and after chaos
Take into account the resilience of the software by post scenario basic alert checks

Krkn is here to solve these problems.

Below is a flow chart of all the krkn related repositories in the github organization. They all build on eachother with krkn-lib being the lowest level of kubernetes based functions to full running scenarios and demos and documentations

First off, krkn-lib. Our lowest level repository containing all of the basic kubernetes python functions that make Krkn run. This also includes models of our telemetry data we output at the end of our runs and lots of functional tests.
Krkn: Our brain repository that takes in a yaml file of configuration and scenario files and causes chaos on a cluster. We sugguest using this way of running to try out new scenarios or if you want to run a combination of scenarios in one run. A CNCF sandbox project. Github
Krkn-hub: This is our containerized wrapper around krkn that easily allows us to run with the respective environment variables without having to maintain and tweak files! This is great for CI systems. But note, with this way of running it only allows you to run one scenario at a time
Krknctl is a tool designed to run and orchestrate krkn chaos scenarios utilizing container images from krkn-hub. Its primary objective is to streamline the usage of krkn by providing features like scenario descriptions and detailed instructions, effectively abstracting the complexities of the container environment. This allows users to focus solely on implementing chaos engineering practices without worrying about runtime complexities. This is our recommended way of running krkn to get started
All of the above repos are documented in the website repository, if you find any issues in this documentation please open an issue here
Finally, our krkn-demos repo, this gives you bash scripts and a pre configured config file to easily see all of what krkn is capable of along with checks to verify it in action

Continue reading more details about each of the repositories on the left hand side. We recommend starting with “What is Krkn?” to get details around all the features we offer before moving to installation and description of the scenarios we offer

1 - What is Krkn?

Chaos and Resiliency Testing Tool for Kubernetes

krkn is a chaos and resiliency testing tool for Kubernetes. Krkn injects deliberate failures into Kubernetes clusters to check if it is resilient to turbulent conditions.

Use Case and Target Personas

Krkn is designed for the following user roles:

Site Reliability Engineers aiming to enhance the resilience and reliability of the Kubernetes platform and the applications it hosts. They also seek to establish a testing pipeline that ensures managed services adhere to best practices, minimizing the risk of prolonged outages.
Developers and Engineers focused on improving the performance and robustness of their application stack when operating under failure scenarios.
Kubernetes Administrators responsible for ensuring that onboarded services comply with established best practices to prevent extended downtime.

Workflow

Krkn workflow

How to Get Started

Instructions on how to setup, configure and run Krkn can be found at Installation.

You may consider utilizing the chaos recommendation tool prior to initiating the chaos runs to profile the application service(s) under test. This tool discovers a list of Krkn scenarios with a high probability of causing failures or disruptions to your application service(s). The tool can be accessed at Chaos-Recommender.

See the getting started doc on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.

After installation, refer back to the below sections for supported scenarios and how to tweak the Krkn config to load them on your cluster.

Running Krkn with minimal configuration tweaks

For cases where you want to run Krkn with minimal configuration changes, refer to krkn-hub. One use case is CI integration where you do not want to carry around different configuration files for the scenarios.

Config

Instructions on how to setup the config and the options supported can be found at Config.

Krkn scenario pass/fail criteria and report

It is important to check if the targeted component recovered from the chaos injection and if the Kubernetes cluster is healthy, since failures in one component can have an adverse impact on other components. Krkn does this by:

Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
Leveraging Cerberus to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos.
- It is highly recommended to turn on the Cerberus health check feature available in Krkn. Instructions on installing and setting up Cerberus can be found here or can be installed from Krkn using the instructions.
- Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Krkn config file.
- Cerberus can monitor application routes during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers or users environment.
  - It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress, etc.
  - It can be enabled by setting check_application_routes: True in the Krkn config provided application routes are being monitored in the cerberus config.
Leveraging built-in alert collection feature to fail the runs in case of critical alerts.
- See also: SLOs validation for more details on metrics and alerts Fail test if certain metrics aren’t met at the end of the run

Krkn Features

Signaling

In CI runs or any external job it is useful to stop Krkn once a certain test or state gets reached. We created a way to signal to Krkn to pause the chaos or stop it completely using a signal posted to a port of your choice.

For example, if we have a test run loading the cluster running and Krkn separately running, we want to be able to know when to start/stop the Krkn run based on when the test run completes or when it gets to a certain loaded state

More detailed information on enabling and leveraging this feature can be found here.

Performance monitoring

Monitoring the Kubernetes/OpenShift cluster to observe the impact of Krkn chaos scenarios on various components is key to find out the bottlenecks. It is important to make sure the cluster is healthy in terms of both recovery and performance during and after the failure has been injected. Instructions on enabling it within the config can be found here.

SLOs validation during and post chaos

In addition to checking the recovery and health of the cluster and components under test, Krkn takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics.
Krkn also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail’s.

Information on enabling and leveraging this feature can be found here

Health Checks

Health checks provide real-time visibility into the impact of chaos scenarios on application availability and performance. The system periodically checks the provided URLs based on the defined interval and records the results in Telemetry. To read more about how to properly configure health checks in your krkn run and sample output see health checks document.

Telemetry

We gather some basic details of the clsuter configuration and scenarios ran as part of a telemetry set of data that is printed off at the end of each krkn run. You can also opt in to the telemetry being stored in AWS S3 bucket or elasticsearch for long term storage. Find more details and configuration specifics here

OCM / ACM integration

Krkn supports injecting faults into Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM) managed clusters through ManagedCluster Scenarios.

Where should I go next?

Installation: Get started using krkn!
Config: See details on how to configure your krkn run
Scenarios: Check out the scenarios we offer!

1.1 - Krkn Config Explanations

Krkn config field explanations

Config

Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at config/config.yaml.

NOTE: config can be used if leveraging the automated way to install the infrastructure pieces.

Config components:

Kraken

This section defines scenarios and specific data to the chaos run

Distribution

The distribution is now automatically set based on some verification points. Depending on which distribution, either openshift or kubernetes other parameters will be automatically set. The prometheus url/route and bearer token are automatically obtained in case of OpenShift, please be sure to set it when the distribution is Kubernetes.

Exit on failure

exit_on_failure: Exit when a post action check or cerberus run fails

Publish kraken status

Refer to signal.md for more details

publish_kraken_status: Can be accessed at http://0.0.0.0:8081 (or what signal_address and port you set in signal address section)

signal_state: State you want krkn to start at; will wait for the RUN signal to start running a chaos iteration. When set to PAUSE before running the scenarios

signal_address: Address to listen/post the signal state to

port: port to listen/post the signal state to

Chaos Scenarios

chaos_scenarios: List of different types of chaos scenarios you want to run with paths to their specific yaml file configurations.

Currently the scenarios are run one after another (in sequence) and will exit if one of the scenarios fail, without moving onto the next one. You can find more details on each scenario under the Scenario folder.

Chaos scenario types:

pod_disruption_scenarios
container_scenarios
hog_scenarios
node_scenarios
time_scenarios
cluster_shut_down_scenarios
namespace_scenarios
zone_outages
application_outages
pvc_scenarios
network_chaos
pod_network_scenarios
service_disruption_scenarios
service_hijacking_scenarios
syn_flood_scenarios

Cerberus

Parameters to set for enabling of cerberus checks at the end of each executed scenario. The given url will pinged after the scenario and post action check have been completed for each scenario and iteration. Read more about what cerberus is here

cerberus_enabled: Enable it when cerberus is previously installed

cerberus_url: When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal

check_applicaton_routes: When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run

Performance Monitoring

deploy_dashboards: Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift

repo: Github repo of dashboards that you want to load. A great starter of some performance related dashbaords can be found here

prometheus_url: The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.

prometheus_bearer_token: The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.

uuid: Uuid for the run, a new random one is generated by default if not set. Each chaos run should have its own unique UUID

enable_alerts: True or False; Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error

enable_metrics: True or False, capture metrics defined by the metrics profile

alert_profile: Path or URL to alert profile with the prometheus queries, see a sample of an alerts file of some preconfigured alerts we have set up and more documentation around it here

metrics_profile: Path or URL to metrics profile with the prometheus queries to capture certain metrics on, see more details around metrics on its documentation page

check_critical_alerts: True or False; When enabled will check prometheus for critical alerts firing post chaos. Read more about this functionality in SLOs validation

Elastic

We have enabled the ability to store telemetry, metrics and alerts into ElasticSearch based on the below keys and values.

enable_elastic: True or False; If true, the telemetry data will be stored in the telemetry_index defined below. Based on if value of performance_monitoring.enable_alerts and performance_monitoring.enable_metrics are true or false, alerts and metrics will be saved in addition to each of the indexes

verify_certs: True or False

elastic_url: The url of the ElasticeSearch where you want to store data

username: ElasticSearch username

password: ElasticSearch password

metrics_index: ElasticSearch index where you want to store the metrics details, the alerts captured are defined from the performance_monitoring.metrics_profile variable and can be captured based on value of performance_monitoring.enable_alenable_metricserts

alerts_index: ElasticSearch index where you want to store the alert details, the alerts captured are defined from the performance_monitoring.alert_profile variable and can be captured based on value of performance_monitoring.enable_alerts

telemetry_index: ElasticSearch index where you want to store the telemetry details

Tunings

wait_duration: Duration to wait between each chaos scenario

iterations: Number of times to execute the scenarios

daemon_mode: True or False; If true, iterations are set to infinity which means that the krkn will cause chaos forever and number of iterations is ignored

Telemetry

More details on the data captured in the telmetry and how to set up your own telemetry data storage can be found here

enabled: True or False, enable/disables the telemetry collection feature

api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint

username: Telemetry service username

password: Telemetry service password

prometheus_backup: True or False, enables/disables prometheus data collection

prometheus_namespace: Namespace where prometheus is deployed, only needed if distribution is kubernetes

prometheus_container_name: Name of the prometheus container name, only needed if distribution is kubernetes

prometheus_pod_name: Name of the prometheus pod, only needed if distribution is kubernetes

full_prometheus_backup: True or False, if is set to False only the /prometheus/wal folder will be downloaded.

backup_threads: Number of telemetry download/upload threads, default is 5

archive_path: Local path where the archive files will be temporarly stored, default is /tmp

max_retries: Maximum number of upload retries (if 0 will retry forever), defaulted to 0

run_tag: If set, this will be appended to the run folder in the bucket (useful to group the runs)

archive_size: The size of the prometheus data archive size in KB. The lower the size of archive is the higher the number of archive files will be produced and uploaded (and processed by backup_threads simultaneously). For unstable/slow connection is better to keep this value low increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the failed chunk without affecting the whole upload.

telemetry_group: If set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use “default”

logs_backup: True

logs_filter_patterns: Way to filter out certain times from the logs

        - "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+"         # Sep 9 11:20:36.123425532
        - "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+"          # kinit 2023/09/15 11:20:36 log
        - "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+"      # 2023-09-15T11:20:36.123425532Z log

oc_cli_path: Optional, if not specified will be search in $PATH, default is /usr/bin/oc

events_backup: True or False, this will capture events that occured during the chaos run. Will be saved to {archive_path}/events.json

Health Checks

Utilizing health check endpoints to observe application behavior during chaos injection, see more details about how this works and different ways to configure here

interval: Interval in seconds to perform health checks, default value is 2 seconds

config: Provide list of health check configurations for applications

url: Provide application endpoint
bearer_token: Bearer token for authentication if any
auth: Provide authentication credentials (username , password) in tuple format if any, ex:(“admin”,“secretpassword”)
exit_on_failure: If value is True exits when health check failed for application, values can be True/False

Sample Config file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    exit_on_failure: False                                 # Exit when a post action scenario fails
    publish_kraken_status: True                            # Can be accessed at http://0.0.0.0:8081
    signal_state: RUN                                      # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
    signal_address: 0.0.0.0                                # Signal listening address
    port: 8081                                             # Signal port
    chaos_scenarios:
        # List of policies/chaos scenarios to load
        - hog_scenarios:
            - scenarios/kube/cpu-hog.yml
            - scenarios/kube/memory-hog.yml
            - scenarios/kube/io-hog.yml
        - application_outages_scenarios:
            - scenarios/openshift/app_outage.yaml
        - container_scenarios:                             # List of chaos pod scenarios to load
            - scenarios/openshift/container_etcd.yml
        - pod_network_scenarios:
              - scenarios/openshift/network_chaos_ingress.yml
              - scenarios/openshift/pod_network_outage.yml
        - pod_disruption_scenarios:
            - scenarios/openshift/etcd.yml
            - scenarios/openshift/regex_openshift_pod_kill.yml
            - scenarios/openshift/prom_kill.yml
            - scenarios/openshift/openshift-apiserver.yml
            - scenarios/openshift/openshift-kube-apiserver.yml
        - node_scenarios:                                  # List of chaos node scenarios to load
            - scenarios/openshift/aws_node_scenarios.yml
            - scenarios/openshift/vmware_node_scenarios.yml
            - scenarios/openshift/ibmcloud_node_scenarios.yml
        - time_scenarios:                                  # List of chaos time scenarios to load
            - scenarios/openshift/time_scenarios_example.yml
        - cluster_shut_down_scenarios:
            - scenarios/openshift/cluster_shut_down_scenario.yml
        - service_disruption_scenarios:
             - scenarios/openshift/regex_namespace.yaml
             - scenarios/openshift/ingress_namespace.yaml
        - zone_outages_scenarios:
            - scenarios/openshift/zone_outage.yaml
        - pvc_scenarios:
            - scenarios/openshift/pvc_scenario.yaml
        - network_chaos_scenarios:
            - scenarios/openshift/network_chaos.yaml
        - service_hijacking_scenarios:
              - scenarios/kube/service_hijacking.yaml
        - syn_flood_scenarios:
              - scenarios/kube/syn_flood.yaml

cerberus:
    cerberus_enabled: False                                # Enable it when cerberus is previously installed
    cerberus_url:                                          # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
    check_applicaton_routes: False                         # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run

performance_monitoring:
    deploy_dashboards: False                              # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
    repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
    prometheus_url: ''                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                              # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
    uuid:                                                 # uuid for the run is generated by default if not set
    enable_alerts: False                                  # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
    enable_metrics: False
    alert_profile: config/alerts.yaml                          # Path or URL to alert profile with the prometheus queries
    metrics_profile: config/metrics-report.yaml
    check_critical_alerts: False                          # When enabled will check prometheus for critical alerts firing post chaos
elastic:
    enable_elastic: False
    verify_certs: False
    elastic_url: ""                                         # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
    elastic_port: 32766
    username: "elastic"
    password: "test"
    metrics_index: "krkn-metrics"
    alerts_index: "krkn-alerts"
    telemetry_index: "krkn-telemetry"

tunings:
    wait_duration: 60                                      # Duration to wait between each chaos scenario
    iterations: 1                                          # Number of times to execute the scenarios
    daemon_mode: False                                     # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
    enabled: False                                           # enable/disables the telemetry collection feature
    api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
    username: username                                      # telemetry service username
    password: password                                    # telemetry service password
    prometheus_backup: True                                 # enables/disables prometheus data collection
    prometheus_namespace: ""                                # namespace where prometheus is deployed (if distribution is kubernetes)
    prometheus_container_name: ""                           # name of the prometheus container name (if distribution is kubernetes)
    prometheus_pod_name: ""                                 # name of the prometheus pod (if distribution is kubernetes)
    full_prometheus_backup: False                           # if is set to False only the /prometheus/wal folder will be downloaded.
    backup_threads: 5                                       # number of telemetry download/upload threads
    archive_path: /tmp                                      # local path where the archive files will be temporarly stored
    max_retries: 0                                          # maximum number of upload retries (if 0 will retry forever)
    run_tag: ''                                             # if set, this will be appended to the run folder in the bucket (useful to group the runs)
    archive_size: 500000
    telemetry_group: ''                                     # if set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use "default"
    # the size of the prometheus data archive size in KB. The lower the size of archive is
                                                            # the higher the number of archive files will be produced and uploaded (and processed by backup_threads
                                                            # simultaneously).
                                                            # For unstable/slow connection is better to keep this value low
                                                            # increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
                                                            # failed chunk without affecting the whole upload.
    logs_backup: True
    logs_filter_patterns:
     - "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+"         # Sep 9 11:20:36.123425532
     - "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+"          # kinit 2023/09/15 11:20:36 log
     - "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+"      # 2023-09-15T11:20:36.123425532Z log
    oc_cli_path: /usr/bin/oc                                # optional, if not specified will be search in $PATH
    events_backup: True                                     # enables/disables cluster events collection

health_checks:                                              # Utilizing health check endpoints to observe application behavior during chaos injection.
    interval:                                               # Interval in seconds to perform health checks, default value is 2 seconds
    config:                                                 # Provide list of health check configurations for applications
        - url:                                              # Provide application endpoint
          bearer_token:                                     # Bearer token for authentication if any
          auth:                                             # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
          exit_on_failure:                                  # If value is True exits when health check failed for application, values can be True/False

1.2 - Health Checks

Health Checks to analyze down times of applications

Health Checks

Health checks provide real-time visibility into the impact of chaos scenarios on application availability and performance. Health check configuration supports application endpoints accessible via http / https along with authentication mechanism such as bearer token and authentication credentials. Health checks are configured in the config.yaml

The system periodically checks the provided URLs based on the defined interval and records the results in Telemetry. The telemetry data includes:

Success response 200 when the application is running normally.
Failure response other than 200 if the application experiences downtime or errors.

This helps users quickly identify application health issues and take necessary actions.

Sample health check config

health_checks:
  interval: <time_in_seconds>                       # Defines the frequency of health checks, default value is 2 seconds
  config:                                           # List of application endpoints to check
    - url: "https://example.com/health"
      bearer_token: "hfjauljl..."                   # Bearer token for authentication if any
      auth:                                         
      exit_on_failure: True                         # If value is True exits when health check failed for application, values can be True/False
      verify_url: True                              # SSL Verification of URL, default to true
    - url: "https://another-service.com/status"
      bearer_token:
      auth: ("admin","secretpassword")              # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
      exit_on_failure: False
      verify_url: False  
    - url: http://general-service.com
      bearer_token:
      auth:
      exit_on_failure:  
      verify_url: False

Sample health check telemetry

"health_checks": [
            {
                "url": "https://example.com/health",
                "status": False,
                "status_code": "503",
                "start_timestamp": "2025-02-25 11:51:33",
                "end_timestamp": "2025-02-25 11:51:40",
                "duration": "0:00:07"
            },
            {
                "url": "https://another-service.com/status",
                "status": True,
                "status_code": 200,
                "start_timestamp": "2025-02-25 22:18:19",
                "end_timestamp": "22025-02-25 22:22:46",
                "duration": "0:04:27"
            },
            {
                "url": "http://general-service.com",
                "status": True,
                "status_code": 200,
                "start_timestamp": "2025-02-25 22:18:19",
                "end_timestamp": "22025-02-25 22:22:46",
                "duration": "0:04:27"
            }
        ],

1.3 - Krkn RBAC

RBAC Authorization rules required to run Krkn scenarios.

RBAC Configurations

Krkn supports two types of RBAC configurations:

Non-Privileged RBAC: Provides namespace-scoped permissions for scenarios that only require access to resources within a specific namespace.
Privileged RBAC: Provides cluster-wide permissions for scenarios that require access to cluster-level resources like nodes.

INFO

The examples below use placeholders such as target-namespace and krkn-namespace which should be replaced with your actual namespaces. The service account name krkn-sa is also a placeholder that you can customize.

RBAC YAML Files

Non-Privileged Role

The non-privileged role provides permissions limited to namespace-scoped resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: krkn-non-privileged-role
  namespace: <target-namespace>
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "delete"]

Non-Privileged RoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: krkn-non-privileged-rolebinding
  namespace: <target-namespace>
subjects:
- kind: ServiceAccount
  name: <krkn-sa>
  namespace: <target-namespace>
roleRef:
  kind: Role
  name: krkn-non-privileged-role
  apiGroup: rbac.authorization.k8s.io

Privileged ClusterRole

The privileged ClusterRole provides permissions for cluster-wide resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: krkn-privileged-clusterrole
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]

Privileged ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: krkn-privileged-clusterrolebinding
subjects:
- kind: ServiceAccount
  name: <krkn-sa>
  namespace: <krkn-namespace>
roleRef:
  kind: ClusterRole
  name: krkn-privileged-clusterrole
  apiGroup: rbac.authorization.k8s.io

How to Apply RBAC Configuration

Customize the namespace in the YAML files:
- Replace target-namespace with the namespace where you want to run Krkn scenarios
- Replace krkn-namespace with the namespace where Krkn itself is deployed

Create a service account for Krkn:

kubectl create serviceaccount krkn-sa -n <namespace>

Apply the RBAC configuration:

# For non-privileged access
kubectl apply -f rbac/non-privileged-role.yaml
kubectl apply -f rbac/non-privileged-rolebinding.yaml

# For privileged access
kubectl apply -f rbac/privileged-clusterrole.yaml
kubectl apply -f rbac/privileged-clusterrolebinding.yaml

OpenShift-specific Configuration

For OpenShift clusters, you may need to grant the privileged Security Context Constraint (SCC) to the service account:

oc adm policy add-scc-to-user privileged -z krkn-sa -n <namespace>

Krkn Scenarios and Required RBAC Permissions

The following table lists the available Krkn scenarios and their required RBAC permission levels:

Scenario Type	Plugin Type	Required RBAC	Description
pod_disruption_scenarios	Namespace	Non-Privileged	Scenarios that disrupt or kill pods
container_scenarios	Namespace	Non-Privileged	Scenarios that affect containers
service_disruption_scenarios	Namespace	Non-Privileged	Scenarios that disrupt services
application_outages_scenarios	Namespace	Non-Privileged	Scenarios that cause application outages
pvc_scenarios	Namespace	Non-Privileged	Scenarios that affect persistent volume claims
pod_network_scenarios	Namespace	Non-Privileged	Scenarios that affect pod network connectivity
service_hijacking_scenarios	Namespace	Non-Privileged	Scenarios that hijack services
node_scenarios	Cluster	Privileged	Scenarios that affect nodes
zone_outages_scenarios	Cluster	Privileged	Scenarios that simulate zone outages
time_scenarios	Cluster	Privileged	Scenarios that manipulate system time
hog_scenarios	Cluster	Privileged	Scenarios that consume resources
cluster_shut_down_scenarios	Cluster	Privileged	Scenarios that shut down the cluster
network_chaos_scenarios	Cluster	Privileged	Scenarios that cause network chaos
network_chaos_ng_scenarios	Cluster	Privileged	Next-gen network chaos scenarios
syn_flood_scenarios	Cluster	Privileged	SYN flood attack scenarios

NOTE: Grant the privileged SCC to the user running the pod, to execute all the below krkn testscenarios

oc adm policy add-scc-to-user privileged user1

1.4 - Krkn Roadmap

Krkn roadmap of work items and goals

Following are a list of enhancements that we are planning to work on adding support in Krkn. Of course any help/contributions are greatly appreciated.

Ability to run multiple chaos scenarios in parallel under load to mimic real world outages
Centralized storage for chaos experiments artifacts
Support for causing DNS outages
Chaos recommender to suggest scenarios having probability of impacting the service under test using profiling results
Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
Support for pod level network traffic shaping
Ability to visualize the metrics that are being captured by Krkn and stored in Elasticsearch
Support for running all the scenarios of Krkn on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
Continue to improve Chaos Testing Guide in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
Switch documentation references to Kubernetes
OCP and Kubernetes functionalities segregation
Krknctl - client for running Krkn scenarios with ease

1.5 - Signaling to Krkn

Signal to stop/start/pause krkn

This functionality allows a user to be able to pause or stop the Krkn run at any time no matter the number of iterations or daemon_mode set in the config.

If publish_kraken_status is set to True in the config, Krkn will start up a connection to a url at a certain port to decide if it should continue running.

By default, it will get posted to http://0.0.0.0:8081/

An example use case for this feature would be coordinating Krkn runs based on the status of the service installation or load on the cluster.

States

There are 3 states in the Krkn status:

PAUSE: When the Krkn signal is ‘PAUSE’, this will pause the Krkn test and wait for the wait_duration until the signal returns to RUN.

STOP: When the Krkn signal is ‘STOP’, end the Krkn run and print out report.

RUN: When the Krkn signal is ‘RUN’, continue Krkn run based on iterations.

Configuration

In the config you need to set these parameters to tell Krkn which port to post the Krkn run status to. As well if you want to publish and stop running based on the Krkn status or not. The signal is set to RUN by default, meaning it will continue to run the scenarios. It can set to PAUSE for Krkn to act as listener and wait until set to RUN before injecting chaos.

    port: 8081
    publish_kraken_status: True
    signal_state: RUN

Setting Signal

You can reset the Krkn status during Krkn execution with a set_stop_signal.py script with the following contents:

import http.client as cli

conn = cli.HTTPConnection("0.0.0.0", "<port>")

conn.request("POST", "/STOP", {})

# conn.request('POST', '/PAUSE', {})

# conn.request('POST', '/RUN', {})

response = conn.getresponse()
print(response.read().decode())

Make sure to set the correct port number in your set_stop_signal script.

Url Examples

To stop run:

curl -X POST http:/0.0.0.0:8081/STOP

To pause run:

curl -X POST http:/0.0.0.0:8081/PAUSE

To start running again:

curl -X POST http:/0.0.0.0:8081/RUN

1.6 - SLO Validation

Validation points in krkn

SLOs validation

Krkn has a few different options that give a Pass/fail based on metrics captured from the cluster is important in addition to checking the health status and recovery. Krkn supports:

Checking for critical alerts post chaos

If enabled, the check runs at the end of each scenario ( post chaos ) and Krkn exits in case critical alerts are firing to allow user to debug. You can enable it in the config:

performance_monitoring:
    check_critical_alerts: False                          # When enabled will check prometheus for critical alerts firing post chaos

Validation and alerting based on the queries defined by the user during chaos

Takes PromQL queries as input and modifies the return code of the run to determine pass/fail. It’s especially useful in case of automated runs in CI where user won’t be able to monitor the system. This feature can be enabled in the config by setting the following:

performance_monitoring:
    prometheus_url:                                       # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                              # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
    enable_alerts: True                                   # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error.
    alert_profile: config/alerts.yaml                          # Path to alert profile with the prometheus queries.

Alert profile

A couple of alert profiles alerts are shipped by default and can be tweaked to add more queries to alert on. User can provide a URL or path to the file in the config. The following are a few alerts examples:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: 5 minutes avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
  severity: error

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1
  description: 5 minutes avg. etcd network peer round trip on {{$labels.pod}} higher than 100ms {{$value}}
  severity: info

- expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
  description: etcd leader changes observed
  severity: critical

Krkn supports setting the severity for the alerts with each one having different effects:

info: Prints an info message with the alarm description to stdout. By default all expressions have this severity.
warning: Prints a warning message with the alarm description to stdout.
error: Prints a error message with the alarm description to stdout and sets Krkn rc = 1
critical: Prints a fatal message with the alarm description to stdout and exits execution inmediatly with rc != 0

Metrics Profile

A couple of metric profiles, metrics.yaml, and metrics-aggregated.yaml are shipped by default and can be tweaked to add more metrics to capture during the run. The following are the API server metrics for example:

metrics:
# API server
  - query: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver", verb!~"WATCH", subresource!="log"}[2m])) by (verb,resource,subresource,instance,le)) > 0
    metricName: API99thLatency

  - query: sum(irate(apiserver_request_total{apiserver="kube-apiserver",verb!="WATCH",subresource!="log"}[2m])) by (verb,instance,resource,code) > 0
    metricName: APIRequestRate

  - query: sum(apiserver_current_inflight_requests{}) by (request_kind) > 0
    metricName: APIInflightRequests

1.7 - Telemetry

Telemetry run details of the cluster and scenario

Telemetry Details

We wanted to gather some more insights regarding our Krkn runs that could have been post processed (eg. by a ML model) to have a better understanding about the behavior of the clusters hit by krkn, so we decided to include this as an opt-in feature that, based on the platform (Kubernetes/OCP), is able to gather different type of data and metadata in the time frame of each chaos run. The telemetry service is currently able to gather several scenario and cluster metadata: A json named telemetry.json containing:

Chaos run metadata:
- the duration of the chaos run
- the config parameters with which the scenario has been setup
- any recovery time details (applicable to pod scenarios and node scenarios only)
- the exit status of the chaos run
Cluster metadata:
- Node metadata (architecture, cloud instance type etc.)
- Node counts
- Number and type of objects deployed in the cluster
- Network plugins
- Cluster version
A partial/full backup of the prometheus binary logs (currently available on OCP only)
Any firing crtiical alerts on the cluster

Deploy your own telemetry AWS service

The krkn-telemetry project aims to provide a basic, but fully working example on how to deploy your own Krkn telemetry collection API. We currently do not support the telemetry collection as a service for community users and we discourage to handover your infrastructure telemetry metadata to third parties since may contain confidential infos.

The guide below will explain how to deploy the service automatically as an AWS lambda function, but you can easily deploy it as a flask application in a VM or in any python runtime environment. Then you can use it to store data to use in chaos-ai

https://github.com/krkn-chaos/krkn-telemetry

Sample telemetry config

telemetry:
    enabled: False                                           # enable/disables the telemetry collection feature
    api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
    username: username                                      # telemetry service username
    password: password                                      # telemetry service password
    prometheus_backup: True                                 # enables/disables prometheus data collection
    full_prometheus_backup: False                           # if is set to False only the /prometheus/wal folder will be downloaded.
    backup_threads: 5                                       # number of telemetry download/upload threads
    archive_path: /tmp                                      # local path where the archive files will be temporarly stored
    max_retries: 0                                          # maximum number of upload retries (if 0 will retry forever)
    run_tag: ''                                             # if set, this will be appended to the run folder in the bucket (useful to group the runs)
    archive_size: 500000                                     # the size of the prometheus data archive size in KB. The lower the size of archive is
                                                            # the higher the number of archive files will be produced and uploaded (and processed by backup_threads
                                                            # simultaneously).
                                                            # For unstable/slow connection is better to keep this value low
                                                            # increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
                                                            # failed chunk without affecting the whole upload.
    logs_backup: True
    logs_filter_patterns:
     - "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+"         # Sep 9 11:20:36.123425532
     - "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+"          # kinit 2023/09/15 11:20:36 log
     - "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+"      # 2023-09-15T11:20:36.123425532Z log
    oc_cli_path: /usr/bin/oc                                # optional, if not specified will be search in $PATH

Sample output of telemetry

{
    "telemetry": {
        "scenarios": [
            {
                "start_timestamp": 1745343338,
                "end_timestamp": 1745343683,
                "scenario": "scenarios/network_chaos.yaml",
                "scenario_type": "pod_disruption_scenarios",
                "exit_status": 0,
                "parameters_base64": "",
                "parameters": [
                    {
                        "config": {
                            "execution_type": "parallel",
                            "instance_count": 1,
                            "kubeconfig_path": "/root/.kube/config",
                            "label_selector": "node-role.kubernetes.io/master",
                            "network_params": {
                                "bandwidth": "10mbit",
                                "latency": "500ms",
                                "loss": "50%"
                            },
                            "node_interface_name": null,
                            "test_duration": 300,
                            "wait_duration": 60
                        },
                        "id": "network_chaos"
                    }
                ],
                "affected_pods": {
                    "recovered": [],
                    "unrecovered": [],
                    "error": null
                },
                "affected_nodes": [],
                "cluster_events": []
            }
        ],
        "node_summary_infos": [
            {
                "count": 3,
                "architecture": "amd64",
                "instance_type": "n2-standard-4",
                "nodes_type": "master",
                "kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
                "kubelet_version": "v1.31.6",
                "os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
            },
            {
                "count": 3,
                "architecture": "amd64",
                "instance_type": "n2-standard-4",
                "nodes_type": "worker",
                "kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
                "kubelet_version": "v1.31.6",
                "os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
            }
        ],
        "node_taints": [
            {
                "node_name": "prubenda-g-qdcvv-master-0.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            },
            {
                "node_name": "prubenda-g-qdcvv-master-1.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            },
            {
                "node_name": "prubenda-g-qdcvv-master-2.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            }
        ],
        "kubernetes_objects_count": {
            "ConfigMap": 530,
            "Pod": 294,
            "Deployment": 69,
            "Route": 8,
            "Build": 1
        },
        "network_plugins": [
            "OVNKubernetes"
        ],
        "timestamp": "2025-04-22T17:35:37Z",
        "health_checks": null,
        "total_node_count": 6,
        "cloud_infrastructure": "GCP",
        "cloud_type": "self-managed",
        "cluster_version": "4.18.0-0.nightly-2025-03-13-035622",
        "major_version": "4.18",
        "run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239",
        "job_status": true
    },
    "critical_alerts": null
}

2 - What is krkn-hub?

Background on what is the krkn-hub github repository

Hosts container images and wrapper for running scenarios supported by Krkn, a chaos testing tool for Kubernetes clusters to ensure it is resilient to failures. All we need to do is run the containers with the respective environment variables defined as supported by the scenarios without having to maintain and tweak files!

Getting Started

Checkout how to clone the repo and get started using this documentation page

3 - What is krkn-lib?

Krkn Chaos and resiliency testing tool Foundation Library

PyPI

krkn-lib

Krkn Chaos and resiliency testing tool Foundation Library

The Library contains Classes, Models and helper functions used in Kraken to interact with Kubernetes, Openshift and other external APIS. The goal of this library is to give to developers the building blocks to realize new Chaos Scenarios and to increase the testability and the modularity of the Krkn codebase.

Packages

The library is subdivided in several Packages under src/krkn_lib

ocp: Openshift Integration
k8s: Kubernetes Integration
elastic: Collection of ElasticSearch functions for posting telemetry
prometheus: Collection of prometheus functions for collecting metrics and alerts
telemetry:
- k8s: Kubernetes Telemetry collection and distribution
- ocp: Openshift Telemetry collection and distribution
models: Krkn shared data models
- k8s: Kubernetes objects model
- krkn: Krkn base models
- telemetry: Telemetry collection model
- elastic: Elastic model for data
utils: common functions

Documentation

The Library documentation of available functions is here. The documentation is automatically generated by Sphinx on top of the reStructuredText Docstring Format comments present in the code.

4 - What is krknctl?

Krkn CLI tool

Krknctl is a tool designed to run and orchestrate krkn chaos scenarios utilizing container images from the krkn-hub. Its primary objective is to streamline the usage of krkn by providing features like:

Command auto-completion
Input validation
Scenario descriptions and detailed instructions

and much more, effectively abstracting the complexities of the container environment. This allows users to focus solely on implementing chaos engineering practices without worrying about runtime complexities.

4.1 - Usage

Commands:

Commands are grouped by action and may include one or more subcommands to further define the specific action.

`list <subcommand>`:

`available`:

Builds a list of all the available scenarios in krkn-hub

% krknctl list available

Name	Size	Digest	Last Modified
network-chaos	**	sha256:**	2025-01-01 00:00:00+0000 +0000
service-disruption-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
node-memory-hog	**	sha256:**	2025-01-01 00:00:00+0000 +0000
application-outages	**	sha256:**	2025-01-01 00:00:00+0000 +0000
node-cpu-hog	**	sha256:**	2025-01-01 00:00:00+0000 +0000
time-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
node-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
service-hijacking	**	sha256:**	2025-01-01 00:00:00+0000 +0000
pvc-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
chaos-recommender	**	sha256:**	2025-01-01 00:00:00+0000 +0000
syn-flood	**	sha256:**	2025-01-01 00:00:00+0000 +0000
container-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
pod-network-chaos	**	sha256:**	2025-01-01 00:00:00+0000 +0000
pod-scenarios	**	sha256:**	2025-01-01 00:00:00+0000 +0000
node-io-hog	**	sha256:**	2025-01-01 00:00:00+0000 +0000
power-outages	**	sha256:**	2025-01-01 00:00:00+0000 +0000
zone-outages	**	sha256:**	2025-01-01 00:00:00+0000 +0000
dummy-scenario	**	sha256:**	2025-01-01 00:00:00+0000 +0000

running:
Builds a list of all the scenarios currently running in the system. The scenarios are filtered based on the tool’s naming conventions.

`describe <scenario name>`:

Describes the specified scenario giving to the user an overview of what are the actions that the scenario will perform on the target system. It will also show all the available flags that the scenario will accept as input to modify the behaviour of the scenario.

`run <scenario name> [flags]`:

Will run the selected scenarios with the specified options

Tip

Because the kubeconfig file may reference external certificates stored on the filesystem, which won’t be accessible once mounted inside the container, it will be automatically copied to the directory where the tool is executed. During this process, the kubeconfig will be flattened by encoding the certificates in base64 and inlining them directly into the file.

Tip

if you want interrupt the scenario while running in attached mode simply hit CTRL+C the container will be killed and the scenario interrupted immediately

Common flags:

Flag	Description
–kubeconfig	kubeconfig path (if empty will default to ~/.kube/config)
–detached	will run the scenario in detached mode (background) will be possible to reattach the tool to the container logs with the attach command
–alerts-profile	will mount in the container a custom alert profile (check krkn documentation for further infos)
–metrics-profile	will mount in the container scenario a custom metrics profile (check krkn documentation for further infos)

`graph <subcommand>`:

In addition to running individual scenarios, the tool can also orchestrate multiple scenarios in serial, parallel, or mixed execution by utilizing a scenario dependency graph resolution algorithm.

scaffold <scenario names> [flags]:

Scaffolds a basic execution plan structure in json format for all the scenario names provided. The default structure is a serial execution with a root node and each node depends on the other starting from the root. Starting from this configuration it is possible to define complex scenarios changing the dependencies between the nodes. Will be provided a random id for each scenario and the dependency will be defined through the depends_on attribute. The scenario id is not strictly dependent on the scenario type so it’s perfectly legit to repeat the same scenario type (with the same or different attributes) varying the scenario Id and the dependencies accordingly.

./krknctl graph scaffold node-cpu-hog node-memory-hog node-io-hog service-hijacking node-cpu-hog > plan.json

will generate an execution plan (serial) containing all the available options for each of the scenarios mentioned with default values when defined, or a description of the content expected for the field.

Note

Any graph configuration is supported except cycles (self dependencies or transitive)

Supported flags:

Flag	Description
–global-env	if set this flag will add global environment variables to each scenario in the graph

run <json execution plan path> [flags]:

It will display the resolved dependency graph, detailing all the scenarios executed at each dependency step, and will instruct the container runtime to execute the krkn scenarios accordingly.

Note

Since multiple scenarios can be executed within a single running plan, the output is redirected to files in the directory where the command is run. These files are named using the following format: krknctl---.log.

Supported flags:

Flag	Description
–kubeconfig	kubeconfig path (if empty will default to ~/.kube/config)
–alerts-profile	will mount in the container a custom alert profile (check krkn documentation for further infos)
–metrics-profile	will mount in the container scenario a custom metrics profile (check krkn documentation for further infos)
–exit-on-error	if set this flag will the workflow will be interrupted and the tool will exit with a status greater than 0

Supported graph configurations:

Execution Plans

Serial execution:

All the nodes depend on each other building a chain, the execution will start from the last item of the chain.

Mixed execution:

The graph is structured in different “layers” so the execution will happen step-by-step executing all the scenarios of the step in parallel and waiting the end

Parallel execution:

To achieve full parallel execution, where each step can run concurrently (if it involves multiple scenarios), the approach is to use a root scenario as the entry point, with several other scenarios dependent on it. While we could have implemented a completely new command to handle this, doing so would have introduced additional code to support what is essentially a specific case of graph execution.

Instead, we developed a scenario called dummy-scenario. This scenario performs no actual actions but simply pauses for a set duration. It serves as an ideal root node, allowing all dependent nodes to execute in parallel without adding unnecessary complexity to the codebase.

`random <subcommand>`

Random orchestration can be used to test parallel scenario generating random graphs from a set of preconfigured scenarios. Differently from the graph command, the scenarios in the json plan don’t have dependencies between them since the dependencies are generated at runtime. This is might be also helpful to run multiple chaos scenarios at large scale.

scaffold <scenario names> [flags]

Will create the structure for a random plan execution, so without any dependency between the scenarios. Once properly configured this can be used as a seed to generate large test plans for large scale tests. This subcommand supports base scaffolding mode by allowing users to specify desired scenario names or generate a plan file of any size using pre-configured scenarios as a template (or seed). This mode is extensively covered in the scale testing section.

Supported flags:

Flag	Description
–global-env	if set this flag will add global environment variables to each scenario in the graph
–number-of-scenarios	the number of scenarios that will be created from the template file
–seed-file	template file with already configured scenarios used to generate the random test plan

run <json execution plan path> [flags]

Supported flags:

Flag	Description
–alerts-profile	custom alerts profile file path
–exit-on-error	if set this flag will the workflow will be interrupted and the tool will exit with a status greater than 0
–graph-dump	specifies the name of the file where the randomly generated dependency graph will be persisted
–kubeconfig	kubeconfig path (if not set will default to ~/.kube/config)
–max-parallel	maximum number of parallel scenarios
–metrics-profile	custom metrics profile file path
–number-of-scenarios	allows you to specify the number of elements to select from the execution plan

`attach <scenario ID>`:

If a scenario has been executed in detached mode or through a graph plan and you want to attach to the container standard output this command comes into help.

Tip

to interrupt the output hit CTRL+C, this won’t interrupt the container, but only the output

Tip

if shell completion is enabled, pressing TAB twice will display a list of running containers along with their respective IDs, helping you select the correct one.

`clean`:

will remove all the krkn containers from the container runtime, will delete all the kubeconfig files and logfiles created by the tool in the current folder.

`query-status <container Id or Name> [--graph <graph file path>]`:

The tool will query the container platform to retrieve information about a container by its name or ID if the --graph flag is not provided. If the --graph flag is set, it will instead query the status of all container names listed in the graph file. When a single container name or ID is specified, the tool will exit with the same status as that container.

Tip

This function can be integrated into CI/CD pipelines to halt execution if the chaos run encounters any failure.

Running krknctl on a disconnected environment with a private registry

If you’re using krknctl in a disconnected environment, you can mirror the desired krkn-hub images to your private registry and configure krknctl to use that registry as the backend. Krknctl supports this through global flags or environment variables.

Private registry global flags

Flag	Environment Variable	Description
–private-registry	KRKNCTL_PRIVATE_REGISTRY	private registry URI (eg. quay.io, without any protocol schema prefix)
–private-registry-insecure	KRKNCTL_PRIVATE_REGISTRY_INSECURE	uses plain HTTP instead of TLS
–private-registry-password	KRKNCTL_PRIVATE_REGISTRY_PASSWORD	private registry password for basic authentication
–private-registry-scenarios	KRKNCTL_PRIVATE_REGISTRY_SCENARIOS	private registry krkn scenarios image repository
–private-registry-skip-tls	KRKNCTL_PRIVATE_REGISTRY_SKIP_TLS	skips tls verification on private registry
–private-registry-token	KRKNCTL_PRIVATE_REGISTRY_TOKEN	private registry identity token for token based authentication
-private-registry-username	KRKNCTL_PRIVATE_REGISTRY_USERNAME	private registry username for basic authentication

Note

Not all options are available on every platform due to limitations in the container runtime platform SDK:

Podman

Token authentication is not supported

Docker

Skip TLS verfication cannot be done by CLI, docker daemon needs to be configured on that purpose please follow the documentation

Example: Running krknctl on quay.io private registry

Note

This example will run only on Docker since the token authentication is not yet implemented on the podman SDK

I will use for that example an invented private registry on quay.io: my-quay-user/krkn-hub

mirror some krkn-hub scenarios on a private registry on quay.io
obtain the quay.io token with cURL:

curl -s -X GET \
  --user 'user:password' \
  "https://quay.io/v2/auth?service=quay.io&scope=repository:my-quay-user/krkn-hub:pull,push" \
  -k | jq -r '.token'

run krknctl with the private registry flags:

krknctl \ 
--private-registry quay.io \
--private-registry-scenarios my-quay-user/krkn-hub \
--private-registry-token <your token obtained in the previous step> \
list available

your images should be listed on the console

Note

To make krknctl commands more concise, it’s more convenient to export the corresponding environment variables instead of prepending flags to every command. The relevant variables are:

KRKNCTL_PRIVATE_REGISTRY
KRKNCTL_PRIVATE_REGISTRY_SCENARIOS
KRKNCTL_PRIVATE_REGISTRY_TOKEN

4.2 - Randomized chaos testing

The random subcommand is valuable for generating chaos tests on a large scale with ease and speed. The random scaffold command, when used with the --seed-file and --number-of-scenarios flags, allows you to expand a pre-existing random or graph plan as a template (or seed). The tool randomly distributes scenarios from the seed-file to meet the specified number-of-scenarios. The resulting output is compatible exclusively with the random run command, which generates a random graph from it.

Warning

graph scaffolded scenarios can serve as input for random scaffold --seed-file and random run, as dependencies are simply ignored. However, the reverse is not true. To address this, graphs generated by the random run command are saved (with the path and file name configurable via the --graph-dump flag) and can be replayed using the graph run command.

Example

Let’s start from the following chaos test graph called graph.json:

{
  "application-outages-1-1": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "NAMESPACE": "dittybopper",
      "POD_SELECTOR": "{app: dittybopper}",
      "WAIT_DURATION": "1",
      "KRKN_DEBUG": "True"
    },
  },
  "application-outages-1-2": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "NAMESPACE": "default",
      "POD_SELECTOR": "{app: nginx}",
      "WAIT_DURATION": "1",
      "KRKN_DEBUG": "True"
    },
    "depends_on": "root-scenario"
  },
  "root-scenario-1": {
    "_comment": "I'm the root Node!",
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:dummy-scenario",
    "name": "dummy-scenario",
    "env": {
      "END": "10",
      "EXIT_STATUS": "0"
    }
  }
}

Note

The larger the seed file, the more diverse the resulting output file will be.

Step 1: let’s expand it to 100 scenarios with the command krknctl random scaffold --seed-file graph.json --number-of-scenarios 100 > big-random-graph.json This will produce a file containing 100 compiled replicating the three scenarios above a random amount of times per each:

{
  "application-outages-1-1--6oJCqST": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "KRKN_DEBUG": "True",
      "NAMESPACE": "dittybopper",
      "POD_SELECTOR": "{app: dittybopper}",
      "WAIT_DURATION": "1"
    }
  },
  "application-outages-1-1--JToAFrk": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "KRKN_DEBUG": "True",
      "NAMESPACE": "dittybopper",
      "POD_SELECTOR": "{app: dittybopper}",
      "WAIT_DURATION": "1"
    }
  },
  "application-outages-1-1--ofb4iMD": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "KRKN_DEBUG": "True",
      "NAMESPACE": "dittybopper",
      "POD_SELECTOR": "{app: dittybopper}",
      "WAIT_DURATION": "1"
    }
  },
  "application-outages-1-1--tLPY-MZ": {
    "image": "containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages",
    "name": "application-outages",
    "env": {
      "BLOCK_TRAFFIC_TYPE": "[Ingress, Egress]",
      "DURATION": "30",
      "KRKN_DEBUG": "True",
      "NAMESPACE": "dittybopper",
      "POD_SELECTOR": "{app: dittybopper}",
      "WAIT_DURATION": "1"
    }
  },

  .... (and other 96 scenarios)

Step 2: run the randomly generated chaos test using the command krknctl random run big-random-graph.json --max-parallel 50 --graph-dump big-graph.json. This instructs krknctl to orchestrate the scenarios in the specified file within a graph, allowing up to 50 scenarios to run in parallel per step, while ensuring all scenarios listed in the JSON input file are executed.The generated random graph will be saved to a file named big-graph.json.

Warning

The max-parallel value should be tuned according to machine resources, as it determines the number of parallel krkn instances executed simultaneously on the local machine via containers on podman or docker

Step 3: if you found the previous chaos run disruptive and you want to re-execute it periodically you can store the big-graph.jsonsomewhere and replay it with the command krknctl graph run big-graph.json

5 - Getting Started

Getting started with Krkn-chaos

How to:

Krkn-lib

See how to install krkn-lib here

Krkn

Get krkn set up with the help of these directions

See these helpful hints on easy edits to the scenarios and config file to start running your own chaos scenarios

krknctl

See how to run krkn through the dedicated CLI krknctl

Note

krknctl is the recommended and the easiest/safest way to run krkn scenarios

Krkn-hub

Set up krkn-hub based on these directions

Test your changes using setup and how to run instructions here

5.1 - Getting Started with Krkn

Getting Started Running Chaos Scenarios

Config

Instructions on how to setup the config and the options supported can be found at Config.

Adding New Scenarios

Adding a new scenario is as simple as adding a new config file under scenarios directory and defining it in the main kraken config. You can either copy an existing yaml file and make it your own, or fill in one of the templates below to suit your needs.

Templates

Pod Scenario Yaml Template

For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:

# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
  config:
    namespace_pattern: ^<namespace>$
    label_selector: <pod label>
    kill: <number of pods to kill>
    krkn_pod_recovery_time: <expected time for the pod to become ready>

Node Scenario Yaml Template

node_scenarios:
  - actions:  # Node chaos scenarios to be injected.
    - <chaos scenario>
    - <chaos scenario>
    node_name: <node name>  # Can be left blank.
    label_selector: <node label>
    instance_kill_count: <number of nodes on which to perform action>
    timeout: <duration to wait for completion>
    cloud_type: <cloud provider>

Time Chaos Scenario Template

time_scenarios:
  - action: 'skew_time' or 'skew_date'
    object_type: 'pod' or 'node'
    label_selector: <label of pod or node>

Common Scenario Edits

If you just want to make small changes to pre-existing scenarios, feel free to edit the scenario file itself.

Example of Quick Pod Scenario Edit:

If you want to kill 2 pods instead of 1 in any of the pre-existing scenarios, you can either edit the number located at config -> label_selector and/or namespace_pattern

Example of Quick Nodes Scenario Edit:

If your cluster is build on GCP instead of AWS, just change the cloud type in the node_scenarios_example.yml file.

RBAC

Based on the type of chaos test being executed, certain scenarios may require elevated privileges. The specific RBAC Authorization needed for each Krkn scenario are outlined in detail at the following link: Krkn RBAC

5.2 - Getting Started with Krkn-hub

Adding/Editing a New Scenario to Krkn-hub

Create folder with scenario name under krkn-hub
Create generic scenario template with enviornment variables
a. See scenario.yaml for example
b. Almost all parameters should be set using a variable (these will be set in the env.sh file or through the command line environment variables)
Add defaults for any environment variables in an “env.sh” file
a. See env.sh for example
Create script to run.sh chaos scenario a. See run.sh for example
b. edit line 16 with your scenario yaml template
c. edit line 17 and 23 with your yaml config location

Create Dockerfile template

a. See dockerfile template for example

b. Lines to edit

 i. 12: replace "application-outages" with your folder name

 ii. 14: replace "application-outages" with your folder name

 iii. 17: replace "application-outages" with your scenario name

 iv. 18: replace description with a description of your new scenario

Add service/scenario to docker-compose.yaml file following syntax of other services
Point the dockerfile parameter in your docker-compose to the Dockerfile file in your new folder
Add the folder name to the list of scenarios in build.sh
Update the krkn website and main README with new scenario type

NOTE:

If you added any main configuration variables or new sections be sure to update config.yaml.template
Similar to above, also add the default parameter values to env.sh

5.3 - Getting Started with krknctl

Installation and configuration

Please check the installation and configuration instructions

Usage

Explore the features and how execute chaos scenarios directly from your terminal.

5.4 - Krkn Use cases

Network Chaos NG

INFO

To utilize the Node Network Filter and Pod Network Filter scenarios you’ll need to run privileged pods on your cluster

Node Network Filter

AWS EFS (Elastic File System) disruption

Description

This scenario creates an outgoing firewall rule on specific nodes in your cluster, chosen by node name or a selector. This rule blocks connections to AWS EFS, leading to a temporary failure of any EFS volumes mounted on those affected nodes.

podman

podman run -v ~/.kube/config:/home/krkn/.kube/config:z -e TEST_DURATION="60" -e INGRESS="false" -e EGRESS="true" -e PROTOCOLS="tcp,udp" -e PORTS="2049" -e NODE_NAME="kind-control-plane" quay.io/krkn-chaos/krkn-hub:node-network-filter

krknctl

krknctl run node-network-filter \
 --chaos-duration 60 \
 --node-name kind-control-plane \
 --ingress false \
 --egress true \
 --protocols tcp,udp \
 --ports 2049

etcd split brain

Description

This scenario isolates an etcd node by blocking its network traffic. This action forces an etcd leader re-election. Once the scenario concludes, the cluster should temporarily exhibit a split-brain condition, with two etcd leaders active simultaneously. This is particularly useful for testing the etcd cluster’s resilience under such a challenging state.

DANGER

This scenario carries a significant risk: it might break the cluster API, making it impossible to automatically revert the applied network rules. The iptables rules will be printed to the console, allowing for manual reversal via a shell on the affected node. This scenario is best suited for disposable clusters and should be used at your own risk.

podman

podman run -v ~/.kube/config:/home/krkn/.kube/config:z -e TEST_DURATION="60" -e INGRESS="false" -e EGRESS="true" -e PROTOCOLS="tcp" -e PORTS="2379,2380" -e NODE_NAME="kind-control-plane" quay.io/krkn-chaos/krkn-hub:node-network-filter

krknctl

krknctl run node-network-filter \
 --chaos-duration 60 \
 --node-name kind-control-plane \
 --ingress false \
 --egress true \
 --protocols tcp \
 --ports 2379,2380

Pod Network Filter

Pod DNS outage

Description

This scenario blocks all outgoing DNS traffic from a specific pod, effectively preventing it from resolving any hostnames or service names.

podman

podman run -v ~/.kube/config:/home/krkn/.kube/config:z -e TEST_DURATION="60" -e INGRESS="false" -e EGRESS="true" -e PROTOCOLS="tcp,udp" -e PORTS="53" -e POD_NAME="target-pod" quay.io/krkn-chaos/krkn-hub:pod-network-filter

krknctl

krknctl run pod-network-filter \
 --chaos-duration 60 \
 --pod-name target-pod \
 --ingress false \
 --egress true \
 --protocols tcp,udp \
 --ports 53

Pod AWS aurora Disruption

Description

This scenario blocks a pod’s outgoing MySQL and PostgreSQL traffic, effectively preventing it from connecting to any AWS Aurora SQL engine. It works just as well for standard MySQL and PostgreSQL connections too.

podman

podman run -v ~/.kube/config:/home/krkn/.kube/config:z -e TEST_DURATION="60" -e INGRESS="false" -e EGRESS="true" -e PROTOCOLS="tcp" -e PORTS="3306,5432" -e POD_NAME="target-pod" quay.io/krkn-chaos/krkn-hub:pod-network-filter

krknctl

krknctl run pod-network-filter \
 --chaos-duration 60 \
 --pod-name target-pod \
 --ingress false \
 --egress true \
 --protocols tcp \
 --ports 3306,5432

6 - Installation

Details on how to install krkn and krkn-hub

The following ways are supported to run Krkn:

Krkn CLI (recommended) - krknctl
Standalone python program through Git - See specific documentation for krkn
Containerized version using either Podman or Docker as the runtime via Krkn-hub
Kubernetes or OpenShift deployment ( unsupported )

Note

It is recommended to run Kraken external to the cluster ( Standalone or Containerized ) hitting the Kubernetes/OpenShift API as running it internal to the cluster might be disruptive to itself and also might not report back the results if the chaos leads to cluster’s API server instability.

Note

To run Kraken on Power (ppc64le) architecture, build and run a containerized version by following the instructions given here.

Note

Helper functions for interactions in Krkn are part of krkn-lib. Please feel free to reuse and expand them as you see fit when adding a new scenario or expanding the capabilities of the current supported scenarios.

6.1 - Krkn

Krkn aka Kraken

Installation

Clone the Repository

To clone and use the latest krkn version follow the directions below. If you’re wanting to contribute back to krkn in anyway in the future we recommend forking the repository first before cloning.

See the latest release version here

$ git clone https://github.com/krkn-chaos/krkn.git --branch <release version>
$ cd krkn

Fork and Clone the Repository

Fork the repository

$ git clone https://github.com/<github_user_id>/krkn.git
$ cd krkn

Set your cloned local to track the upstream repository:

cd krkn
git remote add upstream https://github.com/krkn-chaos/krkn

Disable pushing to upstream master:

git remote set-url --push upstream no_push
git remote -v

Install the dependencies

To be sure that krkn’s dependencies don’t interfere with other python dependencies you may have locally, we recommend creating a virtual enviornment before installing the dependencies. We have only tested up to python 3.9

$ python3.9 -m venv chaos
$ source chaos/bin/activate

$ pip install -r requirements.txt

Note

Make sure python3-devel and latest pip versions are installed on the system. The dependencies install has been tested with pip >= 21.1.3 versions.

Where can your user find your project code? How can they install it (binaries, installable package, build from source)? Are there multiple options/versions they can install and how should they choose the right one for them?

Getting Started with Krkn

If you are wanting to try to edit your configuration files and scenarios see getting started doc

Running Krkn

$ python run_kraken.py --config <config_file_location>

Run containerized version

Krkn-hub is a wrapper that allows running Krkn chaos scenarios via podman or docker runtime with scenario parameters/configuration defined as environment variables.

6.2 - Krkn-lib

Krkn-lib contains the base kubernetes python functions

Installation

Git

Clone the repository

git clone https://github.com/krkn-chaos/krkn-lib
cd krkn-lib

Install the dependencies

Krkn lib uses poetry for its dependency management and packaging. To install the proper packages please use:

$ pip install poetry
$ poetry install --no-interaction

Available Functions

You can find a list of available functions and modules here

Testing your changes

To see how you can configure and test your changes see testing changes

6.3 - krkn-hub

Krkn-hub aka kraken-hub

Set Up

You can use docker or podman to run kraken-hub

Install Podman your certain operating system based on these instructions

Install Docker on your system.

Docker is also supported but all variables you want to set (separate from the defaults) need to be set at the command line In the form -e <VARIABLE>=<value>

You can take advantage of the get_docker_params.sh script to create your parameters string This will take all environment variables and put them in the form “-e =” to make a long string that can get passed to the command

For example: docker run $(./get_docker_params.sh) --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/redhat-chaos/krkn-hub:power-outages

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view –flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) –name=<container_name> –net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

6.4 - krknctl

how to install, build and configure the CLI

Binary distribution (Recommended):

The krknctl binary is available for download from GitHub releases for supported operating systems and architectures. Extract the tarball and add the binary to your $PATH.

Build from sources :

Linux:

Dictionaries:

To generate the random words we use the american dictionary, it is often available but if that’s not the case:

Fedora/RHEL: sudo dnf install words
Ubuntu/Debian: sudo apt-get install wamerican

Building from sources:

Linux:

To build the only system package required is libbtrfs:

Fedora/RHEL: sudo dnf install btrfs-progs-devel
Ubuntu/Debian: sudo apt-get install libbtrfs-dev

MacOS:

gpgme: brew install gpgme

Build command:

go build -tags containers_image_openpgp -ldflags="-w -s" -o bin/ ./...

Note

To build for different operating systems/architectures refer to GOOS GOARCH golang variables

Configure Autocompletion:

The first step to have the best experience with the tool is to install the autocompletion in the shell so that the tool will be able to suggest to the user the available command and the description simply hitting tab twice.

Bash (linux):

source <(./krknctl completion bash)

Tip

To install autocompletion permanently add this command to .bashrc (setting the krknctl binary path correctly)

zsh (MacOS):

autoload -Uz compinit
compinit
source <(./krknctl completion zsh)

Tip

To install autocompletion permanently add this command to .zshrc (setting the krknctl binary path correctly)

Container Runtime:

The tool supports both Podman and Docker to run the krkn-hub scenario containers. The tool interacts with the container runtime through Unix socket. If both container runtimes are installed in the system the tool will default on Podman.

Podman:

Steps required to enable the Podman support

Linux:

enable and activate the podman API daemon

sudo systemctl enable --now podman

activate the user socket

systemctl enable --user --now podman.socket

MacOS:

If both Podman and Docker are installed be sure that the docker compatibility is disabled

Docker:

Linux:

Check that the user has been added to the docker group and can correctly connect to the Docker unix socket
running the comman podman ps if an error is returned run the command sudo usermod -aG docker $USER

7 - Scenarios

Krkn scenario list

Supported chaos scenarios

Scenario	Plugin Type	Description
Application outages	application_outages_scenarios	Isolates application Ingress/Egress traffic to observe the impact on dependent applications and recovery/initialization timing
Container failures	container_scenarios	Injects container failures based on the provided kill signal
KubeVirt VM Outage	kubevirt_vm_outage	Simulates VM-level disruptions by deleting a Virtual Machine Instance (VMI) to test resilience and recovery mechanisms
Network Chaos	network_chaos_scenarios	Introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using tc and Netem
Network Chaos NG	network_chaos_ng_scenarios	Introduces Node network filtering scenario and a new infrastructure to refactor and port the Network Chaos scenarios
Node CPU Hog	hog_scenarios	Hogs CPU on the targeted nodes
Node IO Hog	hog_scenarios	Hogs io on the targeted nodes
Node Memory Hog	hog_scenarios	Hogs memory on the targeted nodes
Node Failures	node_scenarios	Injects node failure through OpenShift/Kubernetes, cloud API’s
Pod Failures	pod_disruption_scenarios	Injects pod failures
Pod Network Chaos	pod_network_scenarios	Introduces network chaos at pod level
Power Outages	cluster_shut_down_scenarios	Shuts down the cluster for the specified duration and turns it back on to check the cluster health
PVC disk fill	pvc_scenarios	Fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it
Service Disruption	service_disruption_scenarios	Deleting all objects within a namespace
Service Hijacking	service_hijacking_scenarios	Hijacks a service http traffic to simulate custom HTTP responses
Syn Flood	syn_flood_scenarios	Generates a substantial amount of TCP traffic directed at one or more Kubernetes services
Time skew	time_scenarios	Skews the time and date
Zone outages	zone_outages_scenarios	Creates zone outage to observe the impact on the cluster, applications

INFO

Explore our use cases page to see if any align with your needs

7.1 - Krkn-Hub All Scenarios Variables

These variables are to be used for the top level configuration template that are shared by all the scenarios in Krkn-hub

See the description and default values below

Supported parameters for all scenarios in Krkn-Hub

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

Parameter	Description	Default
CERBERUS_ENABLED	Set this to true if cerberus is running and monitoring the cluster	False
CERBERUS_URL	URL to poll for the go/no-go signal	http://0.0.0.0:8080
WAIT_DURATION	Duration in seconds to wait between each chaos scenario	60
ITERATIONS	Number of times to execute the scenarios	1
DAEMON_MODE	Iterations are set to infinity which means that the kraken will cause chaos forever	False
PUBLISH_KRAKEN_STATUS	If you want	True
SIGNAL_ADDRESS	Address to print kraken status to	0.0.0.0
PORT	Port to print kraken status to	8081
SIGNAL_STATE	Waits for the RUN signal when set to PAUSE before running the scenarios, refer docs for more details	RUN
DEPLOY_DASHBOARDS	Deploys mutable grafana loaded with dashboards visualizing performance metrics pulled from in-cluster prometheus. The dashboard will be exposed as a route.	False
CAPTURE_METRICS	Captures metrics as specified in the profile from in-cluster prometheus. Default metrics captures are listed here	False
ENABLE_ALERTS	Evaluates expressions from in-cluster prometheus and exits 0 or 1 based on the severity set. Default profile.	False
ALERTS_PATH	Path to the alerts file to use when ENABLE_ALERTS is set	config/alerts
ELASTIC_SERVER	Be able to track telemtry data in elasticsearch, this is the url of the elasticsearch data storage	blank
ELASTIC_INDEX	Elastic search index pattern to post results to	blank
HEALTH_CHECK_URL	URL to continually check and detect downtimes	blank
HEALTH_CHECK_INTERVAL	Interval at which to get	2
HEALTH_CHECK_BEARER_TOKEN	Bearer token used for authenticating into health check URL	blank
HEALTH_CHECK_AUTH	Tuple of (username,password) used for authenticating into health check URL	blank
HEALTH_CHECK_EXIT_ON_FAILURE	If value is True exits when health check failed for application, values can be True/False	blank
HEALTH_CHECK_VERIFY	Health check URL SSL validation; can be True/False	False
CHECK_CRITICAL_ALERTS	When enabled will check prometheus for critical alerts firing post chaos	False
TELEMETRY_ENABLED	Enable/disables the telemetry collection feature	False
TELEMETRY_API_URL	telemetry service endpoint	https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production
TELEMETRY_USERNAME	telemetry service username	redhat-chaos
TELEMETRY_PASSWORD		No default
TELEMETRY_PROMETHEUS_BACKUP	enables/disables prometheus data collection	True
TELEMTRY_FULL_PROMETHEUS_BACKUP	if is set to False only the /prometheus/wal folder will be downloaded	False
TELEMETRY_BACKUP_THREADS	number of telemetry download/upload threads	5
TELEMETRY_ARCHIVE_PATH	local path where the archive files will be temporarly stored	/tmp
TELEMETRY_MAX_RETRIES	maximum number of upload retries (if 0 will retry forever)	0
TELEMETRY_RUN_TAG	if set, this will be appended to the run folder in the bucket (useful to group the runs	chaos
TELEMETRY_GROUP	if set will archive the telemetry in the S3 bucket on a folder named after the value	default
TELEMETRY_ARCHIVE_SIZE	the size of the prometheus data archive size in KB. The lower the size of archive is	1000
TELEMETRY_LOGS_BACKUP	Logs backup to s3	False
TELEMETRY_FILTER_PATTER	Filter logs based on certain time stamp patterns	["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+",“kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+”,"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"]
TELEMETRY_CLI_PATH	OC Cli path, if not specified will be search in $PATH	blank

Note

For setting the TELEMETRY_ARCHIVE_SIZE,the higher the number of archive files will be produced and uploaded (and processed by backup_thread simultaneously).For unstable/slow connection is better to keep this value low increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the failed chunk without affecting the whole upload.

7.2 - Krknctl All Scenarios Variables

These variables are to be used for the top level configuration template that are shared by all the scenarios in Krknctl

See the description and default values below

Supported parameters for all scenarios in KrknCtl

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: --<parameter> <value>

Parameter	Description	Type	Possible Values	Default
--cerberus-enabled	Enables Cerberus Support	enum	True/False	False
--cerberus-url	Cerberus http url	string		http://0.0.0.0:8080
--distribution	Selects the orchestrator distribution	enum	openshift/kubernetes	openshift
--krkn-kubeconfig	Sets the path where krkn will search for kubeconfig in container	string		/home/krkn/.kube/config
--wait-duration	waits for a certain amount of time after the scenario	number		1
--iterations	number of times the same chaos scenario will be executed	number		1
--daemon-mode	if set the scenario will execute forever	enum	True/False	False
--uuid	sets krkn run uuid instead of generating it	string
--capture-metrics	Enables metrics capture	enum	True/False	False
--enable-alerts	Enables cluster alerts check	enum	True/False	False
--alerts-path	Allows to specify a different alert file path	string		config/alerts.yaml
--metrics-path	Allows to specify a different metrics file path	string		config/metrics-aggregated.yaml
--enable-es	Enables elastic search data collection	enum	True/False	False
--es-server	Elasticsearch instance URL	string		http://0.0.0.0
--es-port	Elasticsearch instance port	number		443
--es-username	Elasticsearch instance username	string		elastic
--es-password	Elasticsearch instance password	string
--es-verify-certs	Enables elasticsearch TLS certificate verification	enum	True/False	False
--es-metrics-index	Index name for metrics in Elasticsearch	string		krkn-metrics
--es-alerts-index	Index name for alerts in Elasticsearch	string		krkn-alerts
--es-telemetry-index	Index name for telemetry in Elasticsearch	string		krkn-telemetry
--check-critical-alerts	Enables checking for critical alerts	enum	True/False	False
--telemetry-enabled	Enables telemetry support	enum	True/False	False
--telemetry-api-url	API endpoint for telemetry data	string		https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production
--telemetry-username	Username for telemetry authentication	string		redhat-chaos
--telemetry-password	Password for telemetry authentication	string
--telemetry-prometheus-backup	Enables Prometheus backup for telemetry	enum	True/False	True
--telemetry-full-prometheus-backup	Enables full Prometheus backup for telemetry	enum	True/False	False
--telemetry-backup-threads	Number of threads for telemetry backup	number		5
--telemetry-archive-path	Path to save telemetry archive	string		/tmp
--telemetry-max-retries	Maximum retries for telemetry operations	number		0
--telemetry-run-tag	Tag for telemetry run	string		chaos
--telemetry-group	Group name for telemetry data	string		default
--telemetry-archive-size	Maximum size for telemetry archives	number		1000
--telemetry-logs-backup	Enables logs backup for telemetry	enum	True/False	False
--telemetry-filter-pattern	Filter pattern for telemetry logs	string		["\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+
--telemetry-cli-path	Path to telemetry CLI tool oc	string
--telemetry-events-backup	Enables events backup for telemetry	enum	True/False	True
--health-check-interval	How often to check the health check urls	number		2
--health-check-url	Url to check the health of	string
--health-check-auth	Authentication tuple to authenticate into health check URL	string
--health-check-bearer-token	Bearer token to authenticate into health check URL	string
--health-check-exit	Exit on failure when health check URL is not able to connect	string
--health-check-verify	SSL Verification to authenticate into health check URL	string		false
--krkn-debug	Enables debug mode for Krkn	enum	True/False	False

Note

For setting the TELEMETRY_ARCHIVE_SIZE,the higher the number of archive files will be produced and uploaded (and processed by backup_thread simultaneously| .For unstable/slow connection is better to keep this value low increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the failed chunk without affecting the whole upload.

7.3 - Supported Cloud Providers

AWS

NOTE: For clusters with AWS make sure AWS CLI is installed and properly configured using an AWS account. This should set a configuration file at $HOME/.aws/config for your the AWS account. If you have multiple profiles configured on AWS, you can change the profile by setting export AWS_DEFAULT_PROFILE=<profile-name>

export AWS_DEFAULT_REGION=<aws-region>

This configuration will work for self managed AWS, ROSA and Rosa-HCP

GCP

NOTE: For clusters with GCP make sure GCP CLI is installed.

A google service account is required to give proper authentication to GCP for node actions. See here for how to create a service account.

NOTE: A user with ‘resourcemanager.projects.setIamPolicy’ permission is required to grant project-level permissions to the service account.

After creating the service account you will need to enable the account using the following: export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>" or use gcloud init

In krkn-hub, you’ll need to both set the environemnt variable and also copy the file to the local container

-e GOOGLE_APPLICATION_CREDENTIALS=<container_creds_file>

Nees to match above file path -v <local_gcp_creds_file>:<container_creds_file>:Z

Example:

podman run -e GOOGLE_APPLICATION_CREDENTIALS=/home/krkn/GCP_app.json -e DURATION=10 --net=host  -v <kubeconfig>:/home/krkn/.kube/config:Z -v <local_gcp_creds_file>:/home/krkn/GCP_app.json:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:...

Openstack

NOTE: For clusters with Openstack Cloud, ensure to create and source the OPENSTACK RC file to set the OPENSTACK environment variables from the server where Kraken runs.

Azure

NOTE: You will need to create a service principal and give it the correct access, see here for creating the service principal and setting the proper permissions.

To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.

Before running you will need to set the following:

export AZURE_SUBSCRIPTION_ID=<subscription_id>
export AZURE_TENANT_ID=<tenant_id>
export AZURE_CLIENT_SECRET=<client secret>
export AZURE_CLIENT_ID=<client id>

Note

This configuration will only work for self managed Azure, not ARO. ARO service puts a deny assignment in place over cluster managed resources, that only allows the ARO service itself to modify the VM resources. This is a capability unique to Azure and the structure of the service to prevent customers from hurting themselves. Refer to the links below for more documentation around this.

Alibaba

See the Installation guide to install alicloud cli.

export ALIBABA_ID=<access_key_id>
export ALIBABA_SECRET=<access key secret>
export ALIBABA_REGION_ID=<region id>

Refer to region and zone page to get the region id for the region you are running on.

Set cloud_type to either alibaba or alicloud in your node scenario yaml file.

VMware

Set the following environment variables

export VSPHERE_IP=<vSphere_client_IP_address>
export VSPHERE_USERNAME=<vSphere_client_username>
export VSPHERE_PASSWORD=<vSphere_client_password>

These are the credentials that you would normally use to access the vSphere client.

IBMCloud

If no api key is set up with proper VPC resource permissions, use the following to create:

Access group
Service ID with the following access:
- With policy VPC Infrastructure Services
- Resources = All
- Roles:
  - Editor
  - Administrator
  - Operator
  - Viewer
API Key

Set the following environment variables

export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
export IBMC_APIKEY=<ibmcloud_api_key>

7.4 - KubeVirt VM Outage Scenario

Simulating VM-level disruptions in KubeVirt/OpenShift CNV environments

This scenario enables the simulation of VM-level disruptions in clusters where KubeVirt or OpenShift Containerized Network Virtualization (CNV) is installed. It allows users to delete a Virtual Machine Instance (VMI) to simulate a VM crash and test recovery capabilities.

Purpose

The kubevirt_vm_outage scenario deletes a specific KubeVirt Virtual Machine Instance (VMI) to simulate a VM crash or outage. This helps users:

Test the resilience of applications running inside VMs
Verify that VM monitoring and recovery mechanisms work as expected
Validate high availability configurations for VM workloads
Understand the impact of sudden VM failures on workloads and the overall system

Prerequisites

Before using this scenario, ensure the following:

KubeVirt or OpenShift CNV is installed in your cluster
The target VMI exists and is running in the specified namespace
Your cluster credentials have sufficient permissions to delete and create VMIs

Parameters

The scenario supports the following parameters:

Parameter	Description	Required	Default
vm_name	The name of the VMI to delete	Yes	N/A
namespace	The namespace where the VMI is located	No	“default”
timeout	How long to wait (in seconds) before attempting recovery for VMI to start running again	No	60

Expected Behavior

When executed, the scenario will:

Validate that KubeVirt is installed and the target VMI exists
Save the initial state of the VMI
Delete the VMI
Wait for the VMI to become running or hit the timeout
Attempt to recover the VMI:
- If the VMI is managed by a VirtualMachine resource with runStrategy: Always, it will automatically recover
- If automatic recovery doesn’t occur, the plugin will manually recreate the VMI using the saved state
Validate that the VMI is running again

Note

If the VM is managed by a VirtualMachine resource with runStrategy: Always, KubeVirt will automatically try to recreate the VMI after deletion. In this case, the scenario will wait for this automatic recovery to complete.

Advanced Use Cases

Testing High Availability VM Configurations

This scenario is particularly useful for testing high availability configurations, such as:

Clustered applications running across multiple VMs
VMs with automatic restart policies
Applications with cross-VM resilience mechanisms

Recovery Strategies

The plugin implements two recovery strategies:

Automated Recovery: If the VM is managed by a VirtualMachine resource with runStrategy: Always, the plugin will wait for KubeVirt’s controller to automatically recreate the VMI.
Manual Recovery: If automatic recovery doesn’t occur within the timeout period, the plugin will attempt to manually recreate the VMI using the saved state from before the deletion.

Limitations

The scenario currently supports deleting a single VMI at a time
If VM spec changes during the outage window, the manual recovery may not reflect those changes
The scenario doesn’t simulate partial VM failures (e.g., VM freezing) - only complete VM outage

Troubleshooting

If the scenario fails, check the following:

Ensure KubeVirt/CNV is properly installed in your cluster
Verify that the target VMI exists and is running
Check that your credentials have sufficient permissions to delete and create VMIs
Examine the logs for specific error messages

7.4.1 - Kubevirt Outage Scenarios using Krknctl

krknctl run kubevirt-outage (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters: (be sure to scroll to right)

Parameter	Description	Type	Default
--namespace	VMI Namespace to target	string	node-role.kubernetes.io/worker
--vmi-name	VMI name to inject faults in case of targeting a specific node	string
--timeout	Duration to wait for completion of node scenario injection	number	180

To see all available scenario options

krknctl run kubevirt-outage --help

7.4.2 - KubeVirt Outage Scenarios using Krkn-Hub

This scenario deletes a VMI matching the namespace and name on a Kubernetes/OpenShift cluster.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines. Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
NAMESPACE	VMI Namespace to target	""
VMI_NAME	VMI name to delete, supports regex	""
TIMEOUT	Timeout to wait for VMI to start running again, will fail if timeout is hit	120
Note In case of using custom metrics profile or alerts profile when `CAPTURE_METRICS` or `ENABLE_ALERTS` is enabled, mount the metrics profile from the host on which the container is run using podman/docker under `/home/krkn/kraken/config/metrics-aggregated.yaml` and `/home/krkn/kraken/config/alerts`.
For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage

7.4.3 - KubeVirt VM Outage Scenario - Kraken

Detailed implementation of the KubeVirt VM Outage Scenario in Kraken

KubeVirt VM Outage Scenario in Kraken

The kubevirt_vm_outage scenario in Kraken enables users to simulate VM-level disruptions by deleting a Virtual Machine Instance (VMI) to test resilience and recovery capabilities.

Implementation

This scenario is implemented in Kraken’s core repository, with the following key functionality:

Finding and validating the target VMI
Deleting the VMI using the KubeVirt API
Monitoring the recovery process
Implementing fallback recovery if needed

Usage

You can use this scenario in your Kraken configuration file as follows:

scenarios:
  - name: "kubevirt vm outage"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: <my-application-vm>
      namespace: <vm-workloads>
      timeout: 60

Detailed Parameters

Parameter	Description	Required	Default	Example Values
vm_name	The name of the VMI to delete	Yes	N/A	“database-vm”, “web-server-vm”
namespace	The namespace where the VMI is located	No	“default”	“openshift-cnv”, “vm-workloads”
timeout	How long to wait (in seconds) for VMI to become running before attempting recovery	No	60	30, 120, 300

Execution Flow

When executed, the scenario follows this process:

Initialization: Validates KubeVirt is installed and configures the KubeVirt client
VMI Validation: Checks if the target VMI exists and is in Running state
State Preservation: Saves the initial state of the VMI
Chaos Injection: Deletes the VMI using the KubeVirt API
Wait for Running: Waits for VMI to become running again, up to the timeout specified
Recovery Monitoring: Checks if the VMI is automatically restored
Manual Recovery: If automatic recovery doesn’t occur, manually recreates the VMI
Validation: Confirms the VMI is running correctly

Sample Configuration

Here’s an example configuration for using the kubevirt_vm_outage scenario:

scenarios:
  - name: "kubevirt outage test"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: my-vm
      namespace: kubevirt
      duration: 60

For multiple VMs in different namespaces:

scenarios:
  - name: "kubevirt outage test - app VM"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: app-vm
      namespace: application
      duration: 120
  
  - name: "kubevirt outage test - database VM"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: db-vm
      namespace: database
      duration: 180

Combining with Other Scenarios

For more comprehensive testing, you can combine this scenario with other Kraken scenarios in the list of chaos_scenarios in the config file:

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    ...
    chaos_scenarios:
        - hog_scenarios:
            - scenarios/kube/cpu-hog.yml
        -  kubevirt_vm_outage:
               - scenarios/kubevirt/kubevirt-vm-outage.yaml

7.5 - Application Outage Scenarios

Application outages

Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.

You can add in your applications URL into the health checks section of the config to track the downtime of your application during this scenario

7.5.1 - Application Outage Scenarios using Krkn

Sample scenario config

application_outage:                                  # Scenario to create an outage of an application by blocking traffic
  duration: 600                                      # Duration in seconds after which the routes will be accessible
  namespace: <namespace-with-application>            # Namespace to target - all application routes will go inaccessible if pod selector is empty
  pod_selector: {app: foo}                            # Pods to target
  block: [Ingress, Egress]                           # It can be Ingress or Egress or Ingress, Egress

Debugging steps in case of failures

Kraken creates a network policy blocking the ingress/egress traffic to create an outage, in case of failures before reverting back the network policy, you can delete it manually by executing the following commands to stop the outage:

$ oc delete networkpolicy/kraken-deny -n <targeted-namespace>

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - application_outages_scenarios:
            - scenarios/<scenario_name>.yaml

7.5.2 - Application Outage Scenarios using Krknctl

krknctl run application-outages (optional: --<parameter>:<value>)

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Required	Default
--namespace	Namespace to target - all application routes will go inaccessible if pod selector is empty	string	True
--chaos-duration	Set chaos duration (in sec) as desired	number	False	600
--pod-selector	Pods to target. For example “{app: foo}”	string	False
--block-traffic-type	It can be [Ingress] or [Egress] or [Ingress, Egress]	string	False	“[Ingress, Egress]”

To see all available scenario options

krknctl run application-outages --help

7.5.3 - Application outage Scenario using Krkn-hub

This scenario disrupts the traffic to the specified application to be able to understand the impact of the outage on the dependent service/user experience. Refer docs for more details.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
DURATION	Duration in seconds after which the routes will be accessible	600
NAMESPACE	Namespace to target - all application routes will go inaccessible if pod selector is empty ( Required )	No default
POD_SELECTOR	Pods to target. For example “{app: foo}”	No default
BLOCK_TRAFFIC_TYPE	It can be Ingress or Egress or Ingress, Egress ( needs to be a list )	[Ingress, Egress]

Note

Defining the NAMESPACE parameter is required for running this scenario while the pod_selector is optional. In case of using pod selector to target a particular application, make sure to define it using the following format with a space between key and value: “{key: value}”.

Note

In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:application-outages

Demo

See a demo of this scenario:

7.6 - Container Scenarios

Kraken uses the oc exec command to kill specific containers in a pod. This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file. These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.

7.6.1 - Container Scenarios using Krkn

Example Config

The following are the components of Kubernetes for which a basic chaos scenario config exists today.

scenarios:
- name: "<name of scenario>"
  namespace: "<specific namespace>" # can specify "*" if you want to find in all namespaces
  label_selector: "<label of pod(s)>"
  container_name: "<specific container name>"  # This is optional, can take out and will kill all containers in all pods found under namespace and label
  pod_names:  # This is optional, can take out and will select all pods with given namespace and label
  - <pod_name>
  count: <number of containers to disrupt, default=1>
  action: <kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>
  expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - container_scenarios:
            - scenarios/<scenario_name>.yaml

7.6.2 - Container Scenarios using Krknctl

krknctl run container-scenarios (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--namespace	Targeted namespace in the cluster	string	openshift-etcd
--label-selector	Label of the container(s) to target	string	k8s-app=etcd
--disruption-count	Number of container to disrupt	number	1
--container-name	Name of the container to disrupt	string	etcd
--action	kill signal to run. For example 1 ( hang up ) or 9	string	1
--expected-recovery-time	Time to wait before checking if all containers that were affected recover properly	number	60

To see all available scenario options

krknctl run container-scenarios --help

7.6.3 - Container Scenarios using Krkn-hub

This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
NAMESPACE	Targeted namespace in the cluster	openshift-etcd
LABEL_SELECTOR	Label of the container(s) to target	k8s-app=etcd
DISRUPTION_COUNT	Number of container to disrupt	1
CONTAINER_NAME	Name of the container to disrupt	etcd
ACTION	kill signal to run. For example 1 ( hang up ) or 9	1
EXPECTED_RECOVERY_TIME	Time to wait before checking if all containers that were affected recover properly	60

Note

Set NAMESPACE environment variable to openshift-.* to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

Demo

See a demo of this scenario:

7.7 - Hog Scenarios

Hog Scenarios background

Hog Scenarios are designed to push the limits of memory, CPU, or I/O on one or more nodes in your cluster. They also serve to evaluate whether your cluster can withstand rogue pods that excessively consume resources without any limits.

These scenarios involve deploying one or more workloads in the cluster. Based on the specific configuration, these workloads will use a predetermined amount of resources for a specified duration.

Config Options

Common options

Option	Type	Description
`duration`	number	the duration of the stress test in seconds
`workers`	number (Optional)	the number of threads instantiated by stress-ng, if left empty the number of workers will match the number of available cores in the node.
`hog-type`	string (Enum)	can be cpu, memory or io.
`image`	string	the container image of the stress workload
`namespace`	string	the namespace where the stress workload will be deployed
`node-selector`	string (Optional)	defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.
`taints`	list (Optional) default []	list of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
`number-of-nodes`	number (Optional)	restricts the number of selected nodes by the selector

Available Scenarios

Hog scenarios:

7.7.1 - Hog Scenarios using Krkn

Usage

To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named hog_scenarios then add the desired scenario pointing to the input.yaml file.

kraken:
    ...
    chaos_scenarios:
        - hog_scenarios:
            - scenarios/kube/cpu-hog/input.yaml

input.yaml

The implemented scenarios can be found in scenarios/hog/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target

config.yaml

The hog config file. Here you can set the hog deployer and the hog log level. The supported deployers are:

Docker
Podman (podman daemon not needed, suggested option)
Kubernetes

The supported log levels are:

debug
info
warning
error

workflow.yaml

This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the hog workflows architecture and syntax it is suggested to refer to the hog Documentation.

This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift

7.7.2 - CPU Hog Scenario

The purpose of this scenario is to create cpu pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.

7.7.2.1 - CPU Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/kube/cpu-hog.yml as described in the Usage section.

`cpu-hog` options

In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of CPU to hog on a certain worker node

Option	Type	Description
`cpu-load-percentage`	number	the amount of cpu that will be consumed by the hog
`cpu-method`	string	reflects the cpu load strategy adopted by stress-ng, please refer to the stress-ng documentation for all the available options

Usage

kraken:
    ...
    chaos_scenarios:
        - hog_scenarios:
            - scenarios/kube/cpu-hog.yml

7.7.2.2 - Node CPU Hog using Krknctl

krknctl run node-cpu-hog (optional: --<parameter>:<value> )

Can also set any global variable listed here

Parameter	Description	Type	Default
--chaos-duration	Set chaos duration (in secs) as desired	number	60
--cores	Number of cores (workers) of node CPU to be consumed	number
--cpu-percentage	Percentage of total cpu to be consumed	number	50
--namespace	Namespace where the scenario container will be deployed	string	default
--node-selector	Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected	string
--number-of-nodes	restricts the number of selected nodes by the selector	number
--image	The hog container image. Can be changed if the hog image is mirrored on a private repository	string	quay.io/krkn-chaos/krkn-hog

To see all available scenario options

krknctl run node-cpu-hog --help

7.7.2.3 - CPU Hog Scenario using Krkn-Hub

This scenario hogs the cpu on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
TOTAL_CHAOS_DURATION	Set chaos duration (in sec) as desired	60
NODE_CPU_CORE	Number of cores (workers) of node CPU to be consumed	2
NODE_CPU_PERCENTAGE	Percentage of total cpu to be consumed	50
NAMESPACE	Namespace where the scenario container will be deployed	default
NODE_SELECTOR	Defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.	""
TAINTS	List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]	[]
NUMBER_OF_NODES	Restricts the number of selected nodes by the selector	""
IMAGE	The container image of the stress workload	quay.io/krkn-chaos/krkn-hog

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog

Demo

You can find a link to a demo of the scenario here

7.7.3 - IO Hog Scenario

The purpose of this scenario is to create disk pressure on a particular node of the Kubernetes/OpenShift cluster for a time span. The scenario allows to attach a node path to the pod as a hostPath volume.

7.7.3.1 - IO Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/kube/io-hog.yaml as described in the Usage section.

`io-hog` options

In addition to the common hog scenario options, you can specify the below options in your scenario configuration to target specific pod IO

Option	Type	Description
`io-block-size`	string	the block size written by the stressor
`io-write-bytes`	string	the total amount of data that will be written by the stressor. The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
`io-target-pod-folder`	string	the folder where the volume will be mounted in the pod
`io-target-pod-volume`	dictionary	the pod volume definition that will be stressed by the scenario.

WARNING

Modifying the structure of io-target-pod-volume might alter how the hog operates, potentially rendering it ineffective.

Usage

kraken:
    ...
    chaos_scenarios:
        - hog_scenarios:
            - scenarios/kube/io-hog.yml

7.7.3.2 - IO Hog using Krknctl

krknctl run node-io-hog (optional: --<parameter>:<value> )

Can also set any global variable listed here

Parameter	Description	Type	Default
--chaos-duration	Set chaos duration (in sec) as desired	number	60
--oo-block-size	sSze of each write in bytes. Size can be from 1 byte to 4 Megabytes (allowed suffix are b,k,m)	string	1m
--io-workers	Number of stressor instances	number	5
--io-write-bytes	string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g	string	10m
--node-mount-path	the path in the node that will be mounted in the pod and where the io hog will be executed. NOTE: be sure that kubelet has the rights to write in that node path	string	/root
--namespace	Namespace where the scenario container will be deployed	string	default
--node-selector	Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected	string
--number-of-nodes	restricts the number of selected nodes by the selector	number
--image	The hog container image. Can be changed if the hog image is mirrored on a private repository	string	quay.io/krkn-chaos/krkn-hog

To see all available scenario options

krknctl run node-io-hog --help

7.7.3.3 - IO Hog Scenario using Krkn-Hub

This scenario hogs the IO on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/root/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
TOTAL_CHAOS_DURATION	Set chaos duration (in sec) as desired	180
IO_BLOCK_SIZE	string size of each write in bytes. Size can be from 1 byte to 4m	1m
IO_WORKERS	Number of stressorts	5
IO_WRITE_BYTES	string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g	10m
NAMESPACE	Namespace where the scenario container will be deployed	default
NODE_SELECTOR	defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.	""
TAINTS	List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]	[]
NODE_MOUNT_PATH	the local path in the node that will be mounted in the pod and that will be filled by the scenario	""
NUMBER_OF_NODES	restricts the number of selected nodes by the selector	""
IMAGE	the container image of the stress workload	quay.io/krkn-chaos/krkn-hog

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/root/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/root/kraken/config/alerts -v <path-to-kube-config>:/root/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog

7.7.4 - Memory Hog Scenario

The purpose of this scenario is to create Virtual Memory pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.

7.7.4.1 - Memory Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/kube/memory-hog.yml as described in the Usage section.

`memory-hog` options

In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of memory to hog on a certain worker node

Option	Type	Description
`memory-vm-bytes`	string	the amount of memory that the scenario will try to hog.The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g

Usage

kraken:
    ...
    chaos_scenarios:
        - hog_scenarios:
            - scenarios/kube/memory-hog.yml

7.7.4.2 - Memory Hog using Krknctl

krknctl run node-memory-hog (optional: --<parameter>:<value> )

Can also set any global variable listed here

Parameter	Description	Type	Default
--chaos-duration	Set chaos duration (in sec) as desired	number	60
–memory-workers	Total number of workers (stress-ng threads)	number	1
–memory-consumption	percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario	string	90%
--namespace	Namespace where the scenario container will be deployed	string	default
--node-selector	Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected	string
--number-of-nodes	restricts the number of selected nodes by the selector	number
--image	The hog container image. Can be changed if the hog image is mirrored on a private repository	string	quay.memory/krkn-chaos/krkn-hog

To see all available scenario options

krknctl run node-memory-hog --help

7.7.4.3 - Memory Hog Scenario using Krkn-Hub

This scenario hogs the memory on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
TOTAL_CHAOS_DURATION	Set chaos duration (in sec) as desired	60
MEMORY_CONSUMPTION_PERCENTAGE	percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario	90%
NUMBER_OF_WORKERS	Total number of workers (stress-ng threads)	1
NAMESPACE	Namespace where the scenario container will be deployed	default
NODE_SELECTOR	defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.	""
TAINTS	List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]	[]
NUMBER_OF_NODES	restricts the number of selected nodes by the selector	""
IMAGE	the container image of the stress workload	quay.io/krkn-chaos/krkn-hog

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog

Demo

You can find a link to a demo of the scenario here

7.8 - ManagedCluster Scenarios

ManagedCluster scenarios provide a way to integrate kraken with Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM).

ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.

The following ManagedCluster chaos scenarios are supported:

managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.

ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under managedcluster_scenarios option in the Kraken config. Refer to managedcluster_scenarios_example config file.

managedcluster_scenarios:
  - actions:                                                        # ManagedCluster chaos scenarios to be injected
    - managedcluster_stop_start_scenario
    managedcluster_name: cluster1                                   # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
    # label_selector:                                               # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
    instance_count: 1                                               # Number of managedcluster to perform action/select that match the label selector
    runs: 1                                                         # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
    timeout: 420                                                    # Duration to wait for completion of ManagedCluster scenario injection
                                                                    # For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
                                                                    # (default leaseDurationSeconds = 60 sec)
  - actions:
    - stop_start_klusterlet_scenario
    managedcluster_name: cluster1
    # label_selector:
    instance_count: 1
    runs: 1
    timeout: 60

7.9 - Network Chaos NG Scenarios

This scenario introduce a new infrastructure to refactor and port the current implementation of the network chaos plugins

Available Scenarios

Network Chaos NG scenarios:

7.9.1 - Network Chaos API

`AbstractNetworkChaosModule` abstract module class

All the plugins must implement the AbstractNetworkChaosModule abstract class in order to be instantiated and ran by the Netwok Chaos NG plugin. This abstract class implements two main abstract methods:

run(self, target: str, kubecli: KrknTelemetryOpenshift, error_queue: queue.Queue = None) is the entrypoint for each Network Chaos module. If the module is configured to be run in parallel error_queue must not be None
- target: param is the name of the resource (Pod, Node etc.) that will be targeted by the scenario
- kubecli: the KrknTelemetryOpenshift needed by the scenario to access to the krkn-lib methods
- error_queue: a queue that will be used by the plugin to push the errors raised during the execution of parallel modules
get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig) returns the common subset of settings shared by all the scenarios BaseNetworkChaosConfig and the type of Network Chaos Scenario that is running (Pod Scenario or Node Scenario)

`BaseNetworkChaosConfig` base module configuration

Is the base class that contains the common parameters shared by all the Network Chaos NG modules.

id is the string name of the Network Chaos NG module
wait_duration if there is more than one network module config in the same config file, the plugin will wait wait_duration seconds before running the following one
test_duration the duration in seconds of the scenario
label_selector the selector used to target the resource
instance_count if greater than 0 picks instance_count elements from the targets selected by the filters randomly
execution if more than one target are selected by the selector the scenario can target the resources both in serial or parallel.
namespace the namespace were the scenario workloads will be deployed
taints : List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]

7.9.2 - Node Network Filter

Creates iptables rules on one or more nodes to block incoming and outgoing traffic on a port in the node network interface. Can be used to block network based services connected to the node or to block inter-node communication.

7.9.2.1 - Node Network Filter using Krkn

Configuration

- id: node_network_filter
  wait_duration: 300
  test_duration: 100
  label_selector: "kubernetes.io/hostname=ip-10-0-39-182.us-east-2.compute.internal"
  instance_count: 1
  execution: parallel
  namespace: 'default'
  # scenario specific settings
  ingress: false
  egress: true
  target: node-name
  interfaces: []
  protocols:
   - tcp
  ports:
    - 2049
  taints: []

for the common module settings please refer to the documentation.

ingress: filters the incoming traffic on one or more ports. If set one or more network interfaces must be specified
egress : filters the outgoing traffic on one or more ports.
target: the node name (if label_selector not set)
interfaces: a list of network interfaces where the incoming traffic will be filtered
ports: the list of ports that will be filtered
protocols: the ip protocols to filter (tcp and udp)
taints : List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]

Usage

To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario pointing to the hog.yaml file.

kraken:
    ...
    chaos_scenarios:
        - network_chaos_ng_scenarios:
            - scenarios/kube/node-network-filter.yml

Examples

Please refer to the use cases section for some real usage scenarios.

7.9.2.2 - Node Network Filter using Krkn-Hub

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands: kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

ex.) export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
TOTAL_CHAOS_DURATION	set chaos duration (in sec) as desired	60
NODE_SELECTOR	defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress.	“node-role.kubernetes.io/worker=”
NODE_NAME	the node name to target (if label selector not selected
INSTANCE_COUNT	restricts the number of selected nodes by the selector	“1”
EXECUTION	sets the execution mode of the scenario on multiple nodes, can be parallel or serial	“parallel”
INGRESS	sets the network filter on incoming traffic, can be true or false	false
EGRESS	sets the network filter on outgoing traffic, can be true or false	true
INTERFACES	a list of comma separated names of network interfaces (eg. eth0 or eth0,eth1,eth2) to filter for outgoing traffic	""
PORTS	a list of comma separated port numbers (eg 8080 or 8080,8081,8082) to filter for both outgoing and incoming traffic	""
PROTOCOLS	a list of comma separated protocols to filter (tcp, udp or both)
TAINTS	List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]	[]

NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter

7.9.2.3 - Node Network Filter using krknctl

No problem! Here’s the data you provided, formatted as a Markdown table.

krknctl run node-network-filter (optional: --<parameter>:<value> )

Can also set any global variable listed here

Pod Network Filter Parameters

Argument	Type	Description	Required	Default Value
`--chaos-duration`	number	Chaos Duration	false	60
`--pod-selector`	string	Pod Selector	false
`--pod-name`	string	Pod Name	false
`--namespace`	string	Namespace	false	default
`--instance-count`	number	Number of instances to target	false	1
`--execution`	enum	Execution mode	false
`--ingress`	boolean	Filter incoming traffic	true
`--egress`	boolean	Filter outgoing traffic	true
`--interfaces`	string	Network interfaces to filter outgoing traffic (if more than one separated by comma)	false
`--ports`	string	Network ports to filter traffic (if more than one separated by comma)	true
`--image`	string	The network chaos injection workload container image	false	quay.io/krkn-chaos/krkn-network-chaos:latest
`--protocols`	string	The network protocols that will be filtered	false	tcp
`--taints`	String	List of taints for which tolerations need to created	false

7.9.3 - Pod Network Filter

Creates iptables rules on one or more pods to block incoming and outgoing traffic on a port in the pod network interface. Can be used to block network based services connected to the pod or to block inter-pod communication.

7.9.3.1 - Pod Network Filter Using Krkn

Configuration

- id: pod_network_filter
  wait_duration: 300
  test_duration: 100
  label_selector: "app=label"
  instance_count: 1
  execution: parallel
  namespace: 'default'
  # scenario specific settings
  ingress: false
  egress: true
  target: 'pod-name'
  interfaces: []
  protocols:
    - tcp
  ports:
    - 80
  taints: []

for the common module settings please refer to the documentation.

ingress: filters the incoming traffic on one or more ports. If set one or more network interfaces must be specified
egress : filters the outgoing traffic on one or more ports.
target: the pod name (if label_selector not set)
interfaces: a list of network interfaces where the incoming traffic will be filtered
ports: the list of ports that will be filtered
protocols: the ip protocols to filter (tcp and udp)
taints : List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]

Usage

kraken:
    ...
    chaos_scenarios:
        - network_chaos_ng_scenarios:
            - scenarios/kube/pod-network-filter.yml

Examples

Please refer to the use cases section for some real usage scenarios.

7.9.3.2 - Pod Network Filter Using Krkn-Hub

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

ex.) export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
TOTAL_CHAOS_DURATION	set chaos duration (in sec) as desired	60
POD_SELECTOR	defines the pod selector for choosing target pods. If multiple pods match the selector, all of them will be subjected to stress.	“app=selector”
POD_NAME	the pod name to target (if POD_SELECTOR not specified)
INSTANCE_COUNT	restricts the number of selected pods by the selector	“1”
EXECUTION	sets the execution mode of the scenario on multiple pods, can be parallel or serial	“parallel”
INGRESS	sets the network filter on incoming traffic, can be true or false	false
EGRESS	sets the network filter on outgoing traffic, can be true or false	true
INTERFACES	a list of comma separated names of network interfaces (eg. eth0 or eth0,eth1,eth2) to filter for outgoing traffic	""
PORTS	a list of comma separated port numbers (eg 8080 or 8080,8081,8082) to filter for both outgoing and incoming traffic	""
PROTOCOLS	a list of comma separated network protocols (tcp, udp or both of them e.g. tcp,udp)	“tcp”
TAINTS	List of taints for which tolerations need to created. Example: [“node-role.kubernetes.io/master:NoSchedule”]	[]

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-traffic

7.9.3.3 - Pod Network Filter Using Krknctl

krknctl run pod-network-filter (optional: --<parameter>:<value> )

Can also set any global variable listed here

Argument	Type	Description	Required	Default Value
`--chaos-duration`	number	Chaos Duration	false	60
`--pod-selector`	string	Pod Selector	false
`--pod-name`	string	Pod Name	false
`--namespace`	string	Namespace	false	default
`--instance-count`	number	Number of instances to target	false	1
`--execution`	enum	Execution mode	false
`--ingress`	boolean	Filter incoming traffic	true
`--egress`	boolean	Filter outgoing traffic	true
`--interfaces`	string	Network interfaces to filter outgoing traffic (if more than one separated by comma)	false
`--ports`	string	Network ports to filter traffic (if more than one separated by comma)	true
`--image`	string	The network chaos injection workload container image	false	quay.io/krkn-chaos/krkn-network-chaos:latest
`--protocols`	string	The network protocols that will be filtered	false	tcp
`--taints`	String	List of taints for which tolerations need to created	false

7.10 - Network Chaos Scenario

Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node’s host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.

7.10.1 - Network Chaos Scenario using Krkn

Sample scenario config for egress traffic shaping

network_chaos:                                    # Scenario to create an outage by simulating random variations in the network.
  duration: 300                                   # In seconds - duration network chaos will be applied.
  node_name:                                      # Comma separated node names on which scenario has to be injected.
  label_selector: node-role.kubernetes.io/master  # When node_name is not specified, a node with matching label_selector is selected for running the scenario.
  instance_count: 1                               # Number of nodes in which to execute network chaos.
  interfaces:                                     # List of interface on which to apply the network restriction.
  - "ens5"                                        # Interface name would be the Kernel host network interface name.
  execution: serial|parallel                      # Execute each of the egress options as a single scenario(parallel) or as separate scenario(serial).
  egress:
    latency: 500ms
    loss: 50%                                    # percentage
    bandwidth: 10mbit

Sample scenario config for ingress traffic shaping (using a plugin)

- id: network_chaos
  config:
    node_interface_name:                            # Dictionary with key as node name(s) and value as a list of its interfaces to test
      ip-10-0-128-153.us-west-2.compute.internal:
        - ens5
        - genev_sys_6081
    label_selector: node-role.kubernetes.io/master  # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
    instance_count: 1                               # Number of nodes to perform action/select that match the label selector
    kubeconfig_path: ~/.kube/config                 # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
    execution_type: parallel                        # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
    network_params:
        latency: 500ms
        loss: '50%'
        bandwidth: 10mbit
    wait_duration: 120
    test_duration: 60

Note: For ingress traffic shaping, ensure that your node doesn’t have any IFB interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.

Steps

Pick the nodes to introduce the network anomaly either from node_name or label_selector.
Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
Set traffic shaping config on node’s interface using tc and netem.
Wait for the duration time.
Remove traffic shaping config on node’s interface.
Remove the job that spawned the pod.

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - network_chaos_scenarios:
            - scenarios/<scenario_name>.yaml

7.10.2 - Network Chaos Scenarios using Krknctl

krknctl run network-chaos (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--traffic-type	Selects the network chaos scenario type can be ingress or egress	enum	ingress
--duration	Duration in seconds - during with network chaos will be applied.	number	300
--label-selector	When NODE_NAME is not specified, a node with matching label_selector is selected for running.	string	node-role.kubernetes.io/master
--execution parallel	serial: Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial).	enum	parallel
--node-name	Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma	string
--interfaces	List of interface on which to apply the network restriction. eg.	[eth0,eth1,eth2]	string
--egress	Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit) eg. {bandwidth: 100mbit}	string	“{bandwidth: 100mbit}”
--target-node-interface	Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: ens5]}	string
--network-params	latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: 0.02}	string
--wait-duration	Ensure that it is at least about twice of test_duration	number	300

To see all available scenario options

krknctl run network-chaos --help

7.10.3 - Network Chaos Scenario using Krkn-Hub

This scenario introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using the tc and Netem. For more information refer the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:network-chaos

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

Note

export TRAFFIC_TYPE=egress for Egress scenarios and export TRAFFIC_TYPE=ingress for Ingress scenarios

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Egress Scenarios

Parameter	Description	Default
DURATION	Duration in seconds - during with network chaos will be applied.	300
NODE_NAME	Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma	""
LABEL_SELECTOR	When NODE_NAME is not specified, a node with matching label_selector is selected for running.	node-role.kubernetes.io/master
INSTANCE_COUNT	Targeted instance count matching the label selector	1
INTERFACES	List of interface on which to apply the network restriction.	[]
EXECUTION	Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial).	parallel
EGRESS	Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit)	{bandwidth: 100mbit}

Ingress Scenarios

Parameter	Description	Default
DURATION	Duration in seconds - during with network chaos will be applied.	300
TARGET_NODE_AND_INTERFACE	# Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]}	""
LABEL_SELECTOR	When NODE_NAME is not specified, a node with matching label_selector is selected for running.	node-role.kubernetes.io/master
INSTANCE_COUNT	Targeted instance count matching the label selector	1
EXECUTION	Used to specify whether you want to apply filters on interfaces one at a time or all at once.	parallel
NETWORK_PARAMS	latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: ‘0.02’}	""
WAIT_DURATION	Ensure that it is at least about twice of test_duration	300

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:network-chaos

7.11 - Node Scenarios

This scenario disrupts the node(s) matching the label or node name(s) on a Kubernetes/OpenShift cluster. These scenarios are performed in two different ways, either by the clusters cloud cli or by common/generic commands that can be performed on any cluster.

Actions

The following node chaos scenarios are supported:

node_start_scenario: Scenario to start the node instance. Need access to cloud provider
node_stop_scenario: Scenario to stop the node instance. Need access to cloud provider
node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware. Need access to cloud provider
node_termination_scenario: Scenario to terminate the node instance. Need access to cloud provider
node_reboot_scenario: Scenario to reboot the node instance. Need access to cloud provider
stop_kubelet_scenario: Scenario to stop the kubelet of the node instance. Need access to cloud provider
stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance. Need access to cloud provider
restart_kubelet_scenario: Scenario to restart the kubelet of the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
node_crash_scenario: Scenario to crash the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status. Need access to cloud provider
node_block_scenario: Scenario to block inbound and outbound traffic from other nodes to a specific node for a set duration (only for Azure). Need access to cloud provider

Clouds

Supported cloud supported:

Note

If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.

Note

node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported on

AWS
Azure
OpenStack
BareMetal
GCP
VMware
Alibaba
IbmCloud

Recovery Times

In each node scenario, the end telemetry details of the run will show the time it took for each node to stop and recover depening on the scenario.

The details printed in telemetry:

node_name: Node name
node_id: Node id
not_ready_time: Amount of time the node took to get to a not ready state after cloud provider has stopped node
ready_time: Amount of time the node took to get to a ready state after cloud provider has become in started state
stopped_time: Amount of time the cloud provider took to stop a node
running_time: Amount of time the cloud provider took to get a node running
terminating_time: Amount of time the cloud provider took for node to become terminated

Example:

"affected_nodes": [
    {
        "node_name": "cluster-name-**.438115.internal",
        "node_id": "cluster-name-**",
        "not_ready_time": 0.18194103240966797,
        "ready_time": 0.0,
        "stopped_time": 140.74104499816895,
        "running_time": 0.0,
        "terminating_time": 0.0
    },
    {
        "node_name": "cluster-name-**-master-0.438115.internal",
        "node_id": "cluster-name-**-master-0",
        "not_ready_time": 0.1611928939819336,
        "ready_time": 0.0,
        "stopped_time": 146.72056317329407,
        "running_time": 0.0,
        "terminating_time": 0.0
    },
    {
        "node_name": "cluster-name-**.438115.internal",
        "node_id": "cluster-name-**",
        "not_ready_time": 0.0,
        "ready_time": 43.521320104599,
        "stopped_time": 0.0,
        "running_time": 12.305592775344849,
        "terminating_time": 0.0
    },
    {
        "node_name": "cluster-name-**-master-0.438115.internal",
        "node_id": "cluster-name-**-master-0",
        "not_ready_time": 0.0,
        "ready_time": 48.33336925506592,
        "stopped_time": 0.0,
        "running_time": 12.052034854888916,
        "terminating_time": 0.0
    }

7.11.1 - Node Scenarios using Krkn

For any of the node scenarios, you’ll specify node_scenarios as the scenario type.

See example config here:

    chaos_scenarios:
        - node_scenarios: # List of chaos node scenarios to load
            - scenarios/***.yml
            - scenarios/***.yml # Can specify multiple files here

Sample scenario file, you are able to specify multiple list items under node_scenarios that will be ran serially

node_scenarios:
  - actions:                   # node chaos scenarios to be injected
    - <action>                 # Can specify multiple actions here
    node_name: <node_name>     # node on which scenario has to be injected; can set multiple names separated by comma
    label_selector: <label>    # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection; can specify multiple by a comma separated list
    instance_count: <instance_number> # Number of nodes to perform action/select that match the label selector
    runs: <run_int>            # number of times to inject each scenario under actions (will perform on same node each time)
    timeout: <timeout>         # duration to wait for completion of node scenario injection
    duration: <duration>       # duration to stop the node before running the start action
    cloud_type: <cloud>        # cloud type on which Kubernetes/OpenShift runs  
    parallel: <true_or_false>  # Run action on label or node name in parallel or sequential, defaults to sequential
    kube_check: <true_or_false> # Run the kubernetes api calls to see if the node gets to a certain state during the node scenario
    disable_ssl_verification: <true_or_false> # Disable SSL verification, to avoid certificate errors

AWS

Cloud setup instructions can be found here. Sample scenario config can be found here.

The cloud type in the scenario yaml file needs to be aws

Baremetal

Sample scenario config can be found here.

The cloud type in the scenario yaml file needs to be bm

Note

Baremetal requires setting the IPMI user and password to power on, off, and reboot nodes, using the config options bm_user and bm_password. It can either be set in the root of the entry in the scenarios config, or it can be set per machine.

If no per-machine addresses are specified, kraken attempts to use the BMC value in the BareMetalHost object. To list them, you can do ‘oc get bmh -o wide –all-namespaces’. If the BMC values are blank, you must specify them per-machine using the config option ‘bmc_addr’ as specified below.

For per-machine settings, add a “bmc_info” section to the entry in the scenarios config. Inside there, add a configuration section using the node name. In that, add per-machine settings. Valid settings are ‘bmc_user’, ‘bmc_password’, and ‘bmc_addr’. See the example node scenario or the example below.

Note

Baremetal requires oc (openshift client) be installed on the machine running Kraken.

Note

Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.

Docker

The Docker provider can be used to run node scenarios against kind clusters.

kind is a tool for running local Kubernetes clusters using Docker container “nodes”.

kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.

GCP

Cloud setup instructions can be found here. Sample scenario config can be found here.

The cloud type in the scenario yaml file needs to be gcp

Openstack

How to set up Openstack cli to run node scenarios is defined here.

The cloud type in the scenario yaml file needs to be openstack

The supported node level chaos scenarios on an OPENSTACK cloud are only: node_stop_start_scenario, stop_start_kubelet_scenario and node_reboot_scenario.

Note

For stop_start_helper_node_scenario, visit here to learn more about the helper node and its usage.

To execute the scenario, ensure the value for ssh_private_key in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.

Azure

Cloud setup instructions can be found here. Sample scenario config can be found here.

The cloud type in the scenario yaml file needs to be azure

Alibaba

How to set up Alibaba cli to run node scenarios is defined here.

Note

There is no “terminating” idea in Alibaba, so any scenario with terminating will “release” the node . Releasing a node is 2 steps, stopping the node and then releasing it.

The cloud type in the scenario yaml file needs to be alibaba

VMware

How to set up VMware vSphere to run node scenarios is defined here

The cloud type in the scenario yaml file needs to be vmware

IBMCloud

How to set up IBMCloud to run node scenarios is defined here

See a sample of ibm cloud node scenarios example config file

The cloud type in the scenario yaml file needs to be ibm

Note

To avoid ssl certificate errors, set disable_ssl_verification to true in the scenario yaml file.

General

Note

The node_crash_scenario and stop_kubelet_scenario scenarios are supported independent of the cloud platform.

Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.

7.11.2 - Node Scenarios using Krknctl

krknctl run node-scenarios (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters: (be sure to scroll to right)

Parameter	Description	Type	Default	Possible Values
--action	action performed on the node, visit https://github.com/krkn-chaos/krkn/blob/main/docs/node_scenarios.md for more infos	enum		node_start_scenario,node_stop_scenario,node_stop_start_scenario,node_termination_scenario,node_reboot_scenario,stop_kubelet_scenario,stop_start_kubelet_scenario,restart_kubelet_scenario,node_crash_scenario,stop_start_helper_node_scenario
--label-selector	Node label to target	string	node-role.kubernetes.io/worker
--node-name	Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma	string
--instance-count	Targeted instance count matching the label selector	number	1
--runs	Iterations to perform action on a single node	number	1
--cloud-type	Cloud platform on top of which cluster is running, supported platforms - aws, azure, gcp, vmware, ibmcloud, bm	enum	aws
--kube-check	Connecting to the kubernetes api to check the node status, set to False for SNO	enum	true
--timeout	Duration to wait for completion of node scenario injection	number	180
--duration	Duration to wait for completion of node scenario injection	number	120
--vsphere-ip	VSpere IP Address	string
--vsphere-username	VSpere IP Address	string (secret)
--vsphere-password	VSpere password	string (secret)
--aws-access-key-id	AWS Access Key Id	string (secret)
--aws-secret-access-key	AWS Secret Access Key	string (secret)
--aws-default-region	AWS default region	string
--bmc-user	Only needed for Baremetal ( bm ) - IPMI/bmc username	string(secret)
--bmc-password	Only needed for Baremetal ( bm ) - IPMI/bmc password	string(secret)
--bmc-address	Only needed for Baremetal ( bm ) - IPMI/bmc address	string
--ibmc-address	IBM Cloud URL	string
--ibmc-api-key	IBM Cloud API Key	string (secret)
--disable-ssl-verification	Disable SSL verification, to avoid certificate errors	enum	false
--azure-tenant	Azure Tenant	string
--azure-client-secret	Azure Client Secret	string(secret)
--azure-client-id	Azure Client ID	string(secret)
--azure-subscription-id	Azure Subscription ID	string (secret)
--gcp-application-credentials	GCP application credentials file location	file

NOTE: The secret string types will be masked when scenario is ran

To see all available scenario options

krknctl run node-scenarios --help

7.11.3 - Node Scenarios using Krkn-Hub

This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
ACTION	Action can be one of the following	node_stop_start_scenario
LABEL_SELECTOR	Node label to target	node-role.kubernetes.io/worker
NODE_NAME	Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma	""
INSTANCE_COUNT	Targeted instance count matching the label selector	1
RUNS	Iterations to perform action on a single node	1
CLOUD_TYPE	Cloud platform on top of which cluster is running, supported platforms - aws, vmware, ibmcloud, bm	aws
TIMEOUT	Duration to wait for completion of node scenario injection	180
DURATION	Duration to stop the node before running the start action - not supported for vmware and ibm cloud type	120
KUBE_CHECK	Connect to the kubernetes api to see if the node gets to a certain state during the node scenario	False
PARALLEL	Run action on label or node name in parallel or sequential, set to true for parallel	False
DISABLE_SSL_VERIFICATION	Disable SSL verification, to avoid certificate errors	False
BMC_USER	Only needed for Baremetal ( bm ) - IPMI/bmc username	""
BMC_PASSWORD	Only needed for Baremetal ( bm ) - IPMI/bmc password	""
BMC_ADDR	Only needed for Baremetal ( bm ) - IPMI/bmc username	""

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios

The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

VMware Vsphere

$ export VSPHERE_IP=<vSphere_client_IP_address>

$ export VSPHERE_USERNAME=<vSphere_client_username>

$ export VSPHERE_PASSWORD=<vSphere_client_password>

Ibmcloud

$ export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1

$ export IBMC_APIKEY=<ibmcloud_api_key>

Baremetal
Check Bare Metal Documentation

Google Cloud Platform

$ export GOOGLE_APPLICATION_CREDENTIALS=<GCP Json>

Azure

$ export AZURE_TENANT_ID=<>
$ export AZURE_CLIENT_SECRET=<>
$ export AZURE_CLIENT_ID=<>

OpenStack

export OS_USERNAME=username
export OS_PASSWORD=password
export OS_TENANT_NAME=projectName
export OS_AUTH_URL=https://identityHost:portNumber/v2.0
export OS_TENANT_ID=tenantIDString
export OS_REGION_NAME=regionName
export OS_CACERT=/path/to/cacertFile

Demo

See a demo of this scenario:

7.11.4 - Node Scenarios on Bare Metal using Krkn-Hub

Node Scenarios Bare Metal

This scenario disrupts the node(s) matching the label on a bare metal Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here

Run

Unlike other krkn-hub scenarios, this one requires a specific configuration due to its unique structure. You must set up the scenario in a local file following the scenario syntax, and then pass this file’s base64-encoded content to the container via the SCENARIO_BASE64 variable.

$ podman run --name=<container_name> --net=host \
    -env-host=true \
    -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
    -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host \
    -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
    -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm
OR 
$ docker run \
     -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
     --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Demo

See a demo of this scenario:

7.12 - Pod Network Scenarios

Pod outage

Scenario to block the traffic (Ingress/Egress) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc. With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.

Excluding Pods from Network Outage

The pod outage scenario now supports excluding specific pods from chaos testing using the exclude_label parameter. This allows you to target a namespace or group of pods with your chaos testing while deliberately preserving certain critical workloads.

Why Use Pod Exclusion?

This feature addresses several common use cases:

Testing resiliency of an application while keeping critical monitoring pods operational
Preserving designated “control plane” pods within a microservice architecture
Allowing targeted chaos without affecting auxiliary services in the same namespace
Enabling more precise pod selection when network policies require all related services to be in the same namespace

How to Use the `exclude_label` Parameter

The exclude_label parameter works alongside existing pod selection parameters (label_selector and pod_name). The system will:

Identify all pods in the target namespace
Exclude pods matching the exclude_label criteria (in format “key=value”)
Apply the existing filters (label_selector or pod_name)
Apply the chaos scenario to the resulting pod list

Example Configurations

Basic exclude configuration:

- id: pod_network_outage
  config:
    namespace: my-application
    label_selector: "app=my-service"
    exclude_label: "critical=true"
    direction:
      - egress
    test_duration: 600

In this example, network disruption is applied to all pods with the label app=my-service in the my-application namespace, except for those that also have the label critical=true.

Complete scenario example:

- id: pod_network_outage
  config:
    namespace: openshift-console
    direction:
      - ingress
    ingress_ports:
      - 8443
    label_selector: 'component=ui'
    exclude_label: 'excluded=true'
    test_duration: 600

This scenario blocks ingress traffic on port 8443 for pods matching component=ui label in the openshift-console namespace, but will skip any pods labeled with excluded=true.

The exclude_label parameter is also supported in the pod network shaping scenarios (pod_egress_shaping and pod_ingress_shaping), allowing for the same selective application of network latency, packet loss, and bandwidth restriction.

7.12.1 - Pod Network Chaos Scenarios using Krknctl

krknctl run pod-network-chaos (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--namespace	Namespace of the pod to which filter need to be applied	string
--label-selector	When pod_name is not specified, pod matching the label will be selected for the chaos scenario	string
--pod-name	When label_selector is not specified, pod matching the name will be selected for the chaos scenario	string
--instance-count	Targeted instance count matching the label selector	number	1
--traffic-type	List of directions to apply filters - egress/ingress ( needs to be a list )	string	“[ingress,egress]”
--ingress-ports	Ingress ports to block ( needs to be a list )	string
--egress-ports	Egress ports to block ( needs to be a list )	string
--wait-duration	Ensure that it is at least about twice of test_duration	number	300
--test-duration	Duration of the test run	number	120

To see all available scenario options

krknctl run pod-network-chaos --help

7.12.2 - Pod Scenarios using Krkn

Sample scenario config (using a plugin)

- id: pod_network_outage
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied
    direction:                     # Optional - List of directions to apply filters
        - ingress                  # Blocks ingress traffic, Default both egress and ingress
    ingress_ports:                 # Optional - List of ports to block traffic on
        - 8443                     # Blocks 8443, Default [], i.e. all ports.
    label_selector: 'component=ui' # Blocks access to openshift console
    exclude_label: 'critical=true' # Optional - Pods matching this label will be excluded from the chaos

Pod Network shaping

Scenario to introduce network latency, packet loss, and bandwidth restriction in the Pod’s network interface. The purpose of this scenario is to observe faults caused by random variations in the network.

Sample scenario config for egress traffic shaping (using plugin)

- id: pod_egress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    exclude_label: 'critical=true' # Optional - Pods matching this label will be excluded from the chaos
    network_params:
        latency: 500ms             # Add 500ms latency to egress traffic from the pod.

Sample scenario config for ingress traffic shaping (using plugin)

- id: pod_ingress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    exclude_label: 'critical=true' # Optional - Pods matching this label will be excluded from the chaos
    network_params:
        latency: 500ms             # Add 500ms latency to egress traffic from the pod.

Steps

Pick the pods to introduce the network anomaly either from label_selector or pod_name.
Identify the pod interface name on the node.
Set traffic shaping config on pod’s interface using tc and netem.
Wait for the duration time.
Remove traffic shaping config on pod’s interface.
Remove the job that spawned the pod.

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - pod_network_scenarios:
            - scenarios/<scenario_name>.yaml

7.12.3 - Pod Network Chaos Scenarios using Krkn-hub

This scenario runs network chaos at the pod level on a Kubernetes/OpenShift cluster.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
NAMESPACE	Required - Namespace of the pod to which filter need to be applied	""
LABEL_SELECTOR	Label of the pod(s) to target	""
POD_NAME	When label_selector is not specified, pod matching the name will be selected for the chaos scenario	""
EXCLUDE_LABEL	Pods matching this label will be excluded from the chaos even if they match other criteria	""
INSTANCE_COUNT	Number of pods to perform action/select that match the label selector	1
TRAFFIC_TYPE	List of directions to apply filters - egress/ingress ( needs to be a list )	[ingress, egress]
INGRESS_PORTS	Ingress ports to block ( needs to be a list )	[] i.e all ports
EGRESS_PORTS	Egress ports to block ( needs to be a list )	[] i.e all ports
WAIT_DURATION	Ensure that it is at least about twice of test_duration	300
TEST_DURATION	Duration of the test run	120

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos

7.13 - Pod Scenarios

Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

Why pod scenarios are important:

Modern applications demand high availability, low downtime, and resilient infrastructure. Kubernetes provides building blocks like Deployments, ReplicaSets, and Services to support fault tolerance, but understanding how these interact during disruptions is critical for ensuring reliability. Pod disruption scenarios test this reliability under various conditions, validating that the application and infrastructure respond as expected.

Krkn Telemetry: Krkn collects metrics during chaos experiments, such as recovery timing. These indicators help assess how resilient the application is under test conditions.

Use cases and importance of pod scenarios

Deleting a single pod

Use Case: Simulates unplanned deletion of a single pod
Why It’s Important: Validates whether the ReplicaSet or Deployment automatically creates a replacement.
Customer Impact: Ensures continuous service even if a pod unexpectedly crashes.
Recovery Timing: Typically less than 10 seconds for stateless apps (seen in Krkn telemetry output).
HA Indicator: Pod is automatically rescheduled and becomes Ready without manual intervention.

kubectl delete pod <pod-name> -n <namespace>
kubectl get pods -n <namespace> -w # watch for new pods

Deleting multiple pods simultaneously

Use Case: Simulates a larger failure event, such as a node crash or AZ outage.
Why It’s Important: Tests whether the system has enough resources and policies to recover gracefully.
Customer Impact: If all pods of a service fail, user experience is directly impacted.
HA Indicator: Application can continue functioning from other replicas across zones/nodes.

Pod Eviction (Soft Disruption)

Use Case: Triggered by Kubernetes itself during node upgrades or scaling down.
Why It’s Important: Ensures graceful termination and restart elsewhere without user impact.
Customer Impact: Should be zero if readiness/liveness probes and PDBs are correctly configured.
HA Indicator: Rolling disruption does not take down the whole application.

How to know if it is highly available

Multiple Replicas Exist: Confirmed by checking kubectl get deploy -n <namespace> and seeing atleast 1 replica.
Pods Distributed Across Nodes/availability zones: Using topologySpreadConstraints or observing pod distribution in kubectl get pods -o wide. See Health Checks for real time visibility into the impact of chaos scenarios on application availability and performance
Service Uptime Remains Unaffected: During chaos test, verify app availability (synthetic probes, Prometheus alerts, etc).
Recovery Is Automatic: No manual intervention needed to restore service.
Krkn Telemetry Indicators: End of run data includes recovery times, pod reschedule latency, and service downtime which are vital metrics for assessing HA.

7.13.1 - Pod Scenarios using Krkn

Example Config

The following are the components of Kubernetes for which a basic chaos scenario config exists today.

kraken:
  chaos_scenarios:
    - pod_disruption_scenarios:
      - path/to/scenario.yaml

You can then create the scenario file with the following contents:

# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
  config:
    namespace_pattern: ^kube-system$
    label_selector: k8s-app=kube-scheduler
    krkn_pod_recovery_time: 120

Please adjust the schema reference to point to the schema file. This file will give you code completion and documentation for the available options in your IDE.

Pod Chaos Scenarios

The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.

Component	Description	Working
Basic pod scenario	Kill a pod.	:heavy_check_mark:
Etcd	Kills a single/multiple etcd replicas.	:heavy_check_mark:
Kube ApiServer	Kills a single/multiple kube-apiserver replicas.	:heavy_check_mark:
ApiServer	Kills a single/multiple apiserver replicas.	:heavy_check_mark:
Prometheus	Kills a single/multiple prometheus replicas.	:heavy_check_mark:
OpenShift System Pods	Kills random pods running in the OpenShift system namespaces.	:heavy_check_mark:

7.13.2 - Pod Scenarios using Krknctl

krknctl run pod-scenarios (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--namespace	Targeted namespace in the cluster ( supports regex )	string	openshift-*
--pod-label	Label of the pod(s) to target ex. “app=test”	string
--name-pattern	Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified	string	.*
--disruption-count	Number of pods to disrupt	number	1
--kill-timeout	Timeout to wait for the target pod(s) to be removed in seconds	number	180
--expected-recovery-time	Fails if the pod disrupted do not recover within the timeout set	number	120

To see all available scenario options

krknctl run pod-scenarios --help

7.13.3 - Pod Scenarios using Krkn-hub

This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
NAMESPACE	Targeted namespace in the cluster ( supports regex )	openshift-.*
POD_LABEL	Label of the pod(s) to target	""
NAME_PATTERN	Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified	.*
DISRUPTION_COUNT	Number of pods to disrupt	1
KILL_TIMEOUT	Timeout to wait for the target pod(s) to be removed in seconds	180
EXPECTED_RECOVERY_TIME	Fails if the pod disrupted do not recover within the timeout set	120

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

Demo

See a demo of this scenario:

7.14 - Power Outage Scenarios

This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy.

7.14.1 - Power Outage Scenario using Krkn

Power Outage/ Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to cluster_shut_down_scenario config file.

Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.

Current accepted cloud types:

cluster_shut_down_scenario:                          # Scenario to stop all the nodes for specified duration and restart the nodes.
  runs: 1                                            # Number of times to execute the cluster_shut_down scenario.
  shut_down_duration: 120                            # Duration in seconds to shut down the cluster.
  cloud_type: aws                                    # Cloud type on which Kubernetes/OpenShift runs.

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - cluster_shut_down_scenarios:
            - scenarios/<scenario_name>.yaml

7.14.2 - Power Outage Scenario using Krknctl

krknctl run power-outages (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--cloud-type	Cloud platform on top of which cluster is running, supported platforms - aws, azure, gcp, vmware, ibmcloud, bm	enum	aws
--timeout	Duration to wait for completion of node scenario injection	number	180
--shutdown-duration	Duration to wait for completion of node scenario injection	number	1200
--vsphere-ip	VSpere IP Address	string
--vsphere-username	VSpere IP Address	string (secret)
--vsphere-password	VSpere password	string (secret)
--aws-access-key-id	AWS Access Key Id	string (secret)
--aws-secret-access-key	AWS Secret Access Key	string (secret)
--aws-default-region	AWS default region	string
--bmc-user	Only needed for Baremetal ( bm ) - IPMI/bmc username	string(secret)
--bmc-password	Only needed for Baremetal ( bm ) - IPMI/bmc password	string(secret)
--bmc-address	Only needed for Baremetal ( bm ) - IPMI/bmc address	string
--ibmc-address	IBM Cloud URL	string
--ibmc-api-key	IBM Cloud API Key	string (secret)
--azure-tenant	Azure Tenant	string
--azure-client-secret	Azure Client Secret	string(secret)
--azure-client-id	Azure Client ID	string(secret)
--azure-subscription-id	Azure Subscription ID	string (secret)
--gcp-application-credentials	GCP application credentials file location	file

NOTE: The secret string types will be masked when scenario is ran

To see all available scenario options

krknctl run power-outages --help

7.14.3 - Power Outage Scenario using Krkn-Hub

This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy. More information can be found here

Right now power outage and cluster shutdown are one in the same. We originally created this scenario to stop all the nodes and then start them back up how a customer would shut their cluster down.

In a real life chaos scenario though, we figured this scenario was close to if the power went out on the aws side so all of our ec2 nodes would be stopped/powered off. We tried to look at if aws cli had a way to forcefully poweroff the nodes (not gracefully) and they don’t currently support so this scenario is as close as we can get to “pulling the plug”

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
SHUTDOWN_DURATION	Duration in seconds to shut down the cluster	1200
CLOUD_TYPE	Cloud platform on top of which cluster is running, supported cloud platforms	aws
TIMEOUT	Time in seconds to wait for each node to be stopped or running after the cluster comes back	600

The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

Google Cloud Platform

TBD

Azure

TBD

OpenStack

TBD

Baremetal

TBD

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

Demo

See a demo of this scenario:

7.15 - PVC Scenario

Scenario to fill up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults caused by the application using this volume.

7.15.1 - Power Outage Scenario using Krknctl

krknctl run pvc-scenarios (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--pvc-name	Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required)	string
--pod-name	Targeted pod in the cluster (if null, PVC_NAME is required)	string
--namespace	Targeted namespace in the cluster (required)	string
--fill-percentage	Targeted percentage to be filled up in the PVC	number	50
--duration	Duration to wait for completion of node scenario injection	number	1200

To see all available scenario options

krknctl run pvc-scenarios --help

7.15.2 - PVC Scenario using Krkn

Sample scenario config

pvc_scenario:
  pvc_name: <pvc_name>          # Name of the target PVC.
  pod_name: <pod_name>          # Name of the pod where the PVC is mounted. It will be ignored if the pvc_name is defined.
  namespace: <namespace_name>   # Namespace where the PVC is.
  fill_percentage: 50           # Target percentage to fill up the cluster. Value must be higher than current percentage. Valid values are between 0 and 99.
  duration: 60                  # Duration in seconds for the fault.

Steps

Get the pod name where the PVC is mounted.
Get the volume name mounted in the container pod.
Get the container name where the PVC is mounted.
Get the mount path where the PVC is mounted in the pod.
Get the PVC capacity and current used capacity.
Calculate file size to fill the PVC to the target fill_percentage.
Connect to the pod.
Create a temp file kraken.tmp with random data on the mount path:
- dd bs=1024 count=$file_size </dev/urandom > /mount_path/kraken.tmp
Wait for the duration time.
Remove the temp file created:
- rm kraken.tmp

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - pvc_scenarios:
            - scenarios/<scenario_name>.yaml

7.15.3 - PVC Scenario using Krkn-Hub

This scenario fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults cause by the application using this volume. For more information refer the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

If both PVC_NAME and POD_NAME are defined, POD_NAME value will be overridden from the Mounted By: value on PVC definition.

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
PVC_NAME	Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required)
POD_NAME	Targeted pod in the cluster (if null, PVC_NAME is required)
NAMESPACE	Targeted namespace in the cluster (required)
FILL_PERCENTAGE	Targeted percentage to be filled up in the PVC	50
DURATION	Duration in seconds with the PVC filled up	60

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios

7.16 - Service Disruption Scenarios

Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.

7.16.1 - Service Disruption Scenario using Krknctl

krknctl run service-disruption-scenarios (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--namespace	Targeted namespace in the cluster (required)	string	openshift-etcd
--label-selector	Label of the namespace to target. Set this parameter only if NAMESPACE is not set	string
--delete-count	Number of namespaces to kill in each run, based on matching namespace and label specified	number	1
--runs	Number of runs to execute the action	number	1

To see all available scenario options

krknctl run service-disruption-scenarios --help

7.16.2 - Service Disruption Scenarios using Krkn

Configuration Options:

namespace: Specific namespace or regex style namespace of what you want to delete. Gets all namespaces if not specified; set to "" if you want to use the label_selector field.

Set to ‘^.*$’ and label_selector to "" to randomly select any namespace in your cluster.

label_selector: Label on the namespace you want to delete. Set to "" if you are using the namespace variable.

delete_count: Number of namespaces to kill in each run. Based on matching namespace and label specified, default is 1.

runs: Number of runs/iterations to kill namespaces, default is 1.

sleep: Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set

Refer to namespace_scenarios_example config file.

scenarios:
- namespace: "^.*$"
  runs: 1
- namespace: "^.*ingress.*$"
  runs: 1
  sleep: 15

Steps

This scenario will select a namespace (or multiple) dependent on the configuration and will kill all of the below object types in that namespace and will wait for them to be Running in the post action

Services
Daemonsets
Statefulsets
Replicasets
Deployments

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - service_disruption_scenarios:
            - scenarios/<scenario_name>.yaml

7.16.3 - Service Disruption Scenario using Krkn-Hub

This scenario deletes main objects within a namespace in your Kubernetes/OpenShift cluster. More information can be found here.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
LABEL_SELECTOR	Label of the namespace to target. Set this parameter only if NAMESPACE is not set	""
NAMESPACE	Name of the namespace you want to target. Set this parameter only if LABEL_SELECTOR is not set	“openshift-etcd”
SLEEP	Number of seconds to wait before polling to see if namespace exists again	15
DELETE_COUNT	Number of namespaces to kill in each run, based on matching namespace and label specified	1
RUNS	Number of runs to execute the action	1

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios

Demo

You can find a link to a demo of the scenario here

7.17 - Service Hijacking Scenario

Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.

7.17.1 - Service Hijacking Scenario using Krknctl

krknctl run service-hijacking (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--scenario-file-path	The absolute path of the scenario file compiled following the documentation	file_base64

To see all available scenario options

krknctl run service-hijacking --help

7.17.2 - Service Hijacking Scenarios using Krkn

The web service’s source code is available here. It employs a time-based test plan from the scenario configuration file, which specifies the behavior of resources during the chaos scenario as follows:

service_target_port: http-web-svc # The port of the service to be hijacked (can be named or numeric, based on the workload and service configuration).
service_name: nginx-service # The name of the service that will be hijacked.
service_namespace: default # The namespace where the target service is located.
image: quay.io/krkn-chaos/krkn-service-hijacking:v0.1.3 # Image of the krkn web service to be deployed to receive traffic.
chaos_duration: 30 # Total duration of the chaos scenario in seconds.
plan:
  - resource: "/list/index.php" # Specifies the resource or path to respond to in the scenario. For paths, both the path and query parameters are captured but ignored. For resources, only query parameters are captured.

    steps:                      # A time-based plan consisting of steps can be defined for each resource.
      GET:                      # One or more HTTP methods can be specified for each step. Note: Non-standard methods are supported for fully custom web services (e.g., using NONEXISTENT instead of POST).

        - duration: 15          # Duration in seconds for this step before moving to the next one, if defined. Otherwise, this step will continue until the chaos scenario ends.

          status: 500           # HTTP status code to be returned in this step.
          mime_type: "application/json" # MIME type of the response for this step.
          payload: |            # The response payload for this step.
            {
              "status":"internal server error"
            }
        - duration: 15
          status: 201
          mime_type: "application/json"
          payload: |
            {
              "status":"resource created"
            }            
      POST:
        - duration: 15
          status: 401
          mime_type: "application/json"
          payload: |
            {
               "status": "unauthorized"
            }            
        - duration: 15
          status: 404
          mime_type: "text/plain"
          payload: "not found"

The scenario will focus on the service_name within the service_namespace, substituting the selector with a randomly generated one, which is added as a label in the mock service manifest. This allows multiple scenarios to be executed in the same namespace, each targeting different services without causing conflicts.

The newly deployed mock web service will expose a service_target_port, which can be either a named or numeric port based on the service configuration. This ensures that the Service correctly routes HTTP traffic to the mock web service during the chaos run.

Each step will last for duration seconds from the deployment of the mock web service in the cluster. For each HTTP resource, defined as a top-level YAML property of the plan (it could be a specific resource, e.g., /list/index.php, or a path-based resource typical in MVC frameworks), one or more HTTP request methods can be specified. Both standard and custom request methods are supported.

During this time frame, the web service will respond with:

status: The HTTP status code (can be standard or custom).
mime_type: The MIME type (can be standard or custom).
payload: The response body to be returned to the client.

At the end of the step duration, the web service will proceed to the next step (if available) until the global chaos_duration concludes. At this point, the original service will be restored, and the custom web service and its resources will be undeployed.

NOTE: Some clients (e.g., cURL, jQuery) may optimize queries using lightweight methods (like HEAD or OPTIONS) to probe API behavior. If these methods are not defined in the test plan, the web service may respond with a 405 or 404 status code. If you encounter unexpected behavior, consider this use case.

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - service_hijacking_scenarios:
            - scenarios/<scenario_name>.yaml

7.17.3 - Service Hijacking Scenario using Krkn-Hub

This scenario reroutes traffic intended for a target service to a custom web service that is automatically deployed by Krkn. This web service responds with user-defined HTTP statuses, MIME types, and bodies. For more details, please refer to the following documentation.

Run

$ podman run  --name=<container_name> \
              -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
              -v <path_to_kubeconfig>:/home/krkn/.kube/config:Z containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking
              
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ export SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"
$ docker run $(./get_docker_params.sh) --name=<container_name> \
                                       --net=host \
                                       -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
                                       -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking
OR 
$ docker run --name=<container_name> -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
                                     --net=host \
                                     -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
                                     -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

ecause the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected: example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description
SCENARIO_BASE64	Base64 encoded service-hijacking scenario file. Note that the -w0 option in the command substitution `SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"` is mandatory in order to remove line breaks from the base64 command output

Note

For example:

$ podman run -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
             --name=<container_name> \
             --net=host \
             --env-host=true \
             -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml \
             -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts \
             -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
             -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking

7.18 - Syn Flood Scenarios

Syn Flood Scenarios

This scenario generates a substantial amount of TCP traffic directed at one or more Kubernetes services within the cluster to test the server’s resiliency under extreme traffic conditions. It can also target hosts outside the cluster by specifying a reachable IP address or hostname. This scenario leverages the distributed nature of Kubernetes clusters to instantiate multiple instances of the same pod against a single host, significantly increasing the effectiveness of the attack. The configuration also allows for the specification of multiple node selectors, enabling Kubernetes to schedule the attacker pods on a user-defined subset of nodes to make the test more realistic.

The attacker container source code is available here.

7.18.1 - Syn Flood Scenario using Krkn

Sample scenario config

packet-size: 120 # hping3 packet size
window-size: 64 # hping 3 TCP window size
duration: 10 # chaos scenario duration
namespace: default # namespace where the target service(s) are deployed
target-service: target-svc # target service name (if set target-service-label must be empty)
target-port: 80 # target service TCP port
target-service-label : "" # target service label, can be used to target multiple target at the same time
                          # if they have the same label set (if set target-service must be empty)
number-of-pods: 2 # number of attacker pod instantiated per each target
image: quay.io/krkn-chaos/krkn-syn-flood # syn flood attacker container image
attacker-nodes: # this will set the node affinity to schedule the attacker node. Per each node label selector
                # can be specified multiple values in this way the kube scheduler will schedule the attacker pods
                # in the best way possible based on the provided labels. Multiple labels can be specified
  kubernetes.io/hostname:
    - host_1
    - host_2
  kubernetes.io/os:
    - linux

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - syn_flood_scenarios:
            - scenarios/<scenario_name>.yaml

7.18.2 - Syn Flood Scenario using Krknctl

krknctl run syn-flood (optional: --<parameter>:<value> ) |

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--packet-size	The size in bytes of the SYN packet	number	120
--window-size	The TCP window size between packets in bytes	number	64
--chaos-duration	The number of seconds the chaos will last	number	120
--namespace	The namespace containing the target service and where the attacker pods will be deployed	string	default
--target-service	The service name (or the hostname/IP address in case an external target will be hit) that will be affected by the attack.Must be empty if TARGET_SERVICE_LABEL will be set	string
--target-port	The TCP port that will be targeted by the attack	number
--target-service-label	The label that will be used to select one or more services.Must be left empty if TARGET_SERVICE variable is set	string
--number-of-pods	The number of attacker pods that will be deployed	number	2
--image	The container image that will be used to perform the scenario	string	quay.io/krkn-chaos/krkn-syn-flood:latest
--node-selectors	The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster s capacity.	string

To see all available scenario options

krknctl run syn-flood --help

7.18.3 - Syn Flood Scenario using Krkn-Hub

Syn Flood scenario

This scenario simulates a user-defined surge of TCP SYN requests directed at one or more services deployed within the cluster or an external target reachable by the cluster. For more details, please refer to the following documentation.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z 
-e TARGET_PORT=<target_port> \
-e NAMESPACE=<target_namespace> \
-e TOTAL_CHAOS_DURATION=<duration> \
-e TARGET_SERVICE=<target_service> \
-e NUMBER_OF_PODS=10 \
-e NODE_SELECTORS=<key>=<value>;<key>=<othervalue> \
-d 
containers.krkn-chaos.dev/krkn-chaos/krkn-hub:syn-flood

$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z
-e TARGET_PORT=<target_port> \
-e NAMESPACE=<target_namespace> \
-e TOTAL_CHAOS_DURATION=<duration> \
-e TARGET_SERVICE=<target_service> \
-e NUMBER_OF_PODS=10 \
-e NODE_SELECTORS=<key>=<value>;<key>=<othervalue> \ 
-d 
containers.krkn-chaos.dev/krkn-chaos/krkn-hub:syn-flood

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
PACKET_SIZE	The size in bytes of the SYN packet	120
WINDOW_SIZE	The TCP window size between packets in bytes	64
TOTAL_CHAOS_DURATION	The number of seconds the chaos will last	120
NAMESPACE	The namespace containing the target service and where the attacker pods will be deployed	default
TARGET_SERVICE	The service name (or the hostname/IP address in case an external target will be hit) that will be affected by the attack. Must be empty if TARGET_SERVICE_LABEL will be set
TARGET_PORT	The TCP port that will be targeted by the attack
TARGET_SERVICE_LABEL	The label that will be used to select one or more services. Must be left empty if TARGET_SERVICE variable is set
NUMBER_OF_PODS	The number of attacker pods that will be deployed	2
IMAGE	The container image that will be used to perform the scenario	quay.io/krkn-chaos/krkn-syn-flood:latest
NODE_SELECTORS	The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster’s capacity.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:syn-flood

7.19 - Time Scenarios

Using this type of scenario configuration, one is able to change the time and/or date of the system for pods or nodes.

7.19.1 - Time Scenarios using Krkn

Configuration Options:

action: skew_time or skew_date.

object_type: pod or node.

namespace: namespace of the pods you want to skew. Needs to be set if setting a specific pod name.

label_selector: Label on the nodes or pods you want to skew.

container_name: Container name in pod you want to reset time on. If left blank it will randomly select one.

object_name: List of the names of pods or nodes you want to skew.

Refer to time_scenarios_example config file.

time_scenarios:
  - action: skew_time
    object_type: pod
    object_name:
      - apiserver-868595fcbb-6qnsc
      - apiserver-868595fcbb-mb9j5
    namespace: openshift-apiserver
    container_name: openshift-apiserver
  - action: skew_date
    object_type: node
    label_selector: node-role.kubernetes.io/worker

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - time_scenarios:
            - scenarios/<scenario_name>.yaml

7.19.2 - Time Scenarios using Krknctl

krknctl run time-scenarios  (optional: --<parameter>:<value> ) |

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--object-type	Object to target. Supported options `pod` or `node`	enum	pod
--label-selector	Label of the container(s) or nodes to target	string	“k8s-app=etcd”
--action	Action to run. Supported actions: `skew_time` or `skew_date`	enum	skew_date
--object-names	List of the names of pods or nodes you want to skew	string
--container-name	Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty	string
--namespace	Namespace of the pods you want to skew, need to be set only if setting a specific pod name	string

To see all available scenario options

krknctl run time-scenarios --help

7.19.3 - Time Skew Scenarios using Krkn-Hub

This scenario skews the date and time of the nodes and pods matching the label on a Kubernetes/OpenShift cluster. More information can be found here.

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
OBJECT_TYPE	Object to target. Supported options: pod, node	pod
LABEL_SELECTOR	Label of the container(s) or nodes to target	k8s-app=etcd
ACTION	Action to run. Supported actions: skew_time, skew_date	skew_date
OBJECT_NAME	List of the names of pods or nodes you want to skew ( optional parameter )	[]
CONTAINER_NAME	Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty	""
NAMESPACE	Namespace of the pods you want to skew, need to be set only if setting a specific pod name	""

Note

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios

Demo

See a demo of this scenario:

7.20 - Zone Outage Scenarios

Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone.

There are 2 ways these scenarios run: For AWS, it tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.

For GCP, it in a specific zone you want to target and finds the nodes (master, worker, and infra) and stops the nodes for the set duration and then starts them back up. The reason we do it this way is because any edits to the nodes require you to first stop the node before performing any updates. So, editing the network as the AWS way would still require you to stop the nodes first.

7.20.1 - Zone Outage Scenario using Krknctl

krknctl run zone-outages (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--cloud-type	Cloud platform on top of which cluster is running, supported platforms - aws, gcp	enum	aws
--duration	Duration in seconds after which the zone will be back online	number	600
--vpc-id	cluster virtual private network to target	string
--subnet-id	subnet-id to deny both ingress and egress traffic ( REQUIRED ). Format: [subnet1, subnet2]	string
--zone	cluster zone to target (only for gcp cloud type )	string
--kube-check	Connecting to the kubernetes api to check the node status, set to False for SNO	enum
--aws-access-key-id	AWS Access Key Id	string (secret)
--aws-secret-access-key	AWS Secret Access Key	string (secret)
--aws-default-region	AWS default region	string
--gcp-application-credentials	GCP application credentials file location	file

NOTE: The secret string types will be masked when scenario is ran

To see all available scenario options

krknctl run zone-outages --help

7.20.2 - Zone Outage Scenarios using Krkn

Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.

Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to run zone outages on

Current accepted cloud types:

Sample scenario config

zone_outage:                                         # Scenario to create an outage of a zone by tweaking network ACL.
  cloud_type: aws                                    # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.
  duration: 600                                      # Duration in seconds after which the zone will be back online.
  vpc_id:                                            # Cluster virtual private network to target.
  subnet_id: [subnet1, subnet2]                      # List of subnet-id's to deny both ingress and egress traffic.

Note

vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).

zone_outage:                                         # Scenario to create an outage of a zone by tweaking network ACL
  cloud_type: gcp                                    # cloud type on which Kubernetes/OpenShift runs. aws is only platform supported currently for this scenario.
  duration: 600                                      # duration in seconds after which the zone will be back online
  zone: <zone>                                       # Zone of nodes to stop and then restart after the duration ends
  kube_check: True                                   # Run kubernetes api calls to see if the node gets to a certain state during the scenario

Note

Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.

AWS- Debugging steps in case of failures

In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:

OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.

GCP - Debugging steps in case of failures

In case of failures during the steps which bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:

From the gcp web console, select one of the instances in the zone which is down
Kraken just stops the node, so you’ll just have to select the stopped nodes and START them. This will get back the nodes in the targeted zone into Ready state

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - zone_outages_scenarios:
            - scenarios/<scenario_name>.yaml

7.20.3 - Zone Outage Scenarios using Krkn-Hub

This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here

Run

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
CLOUD_TYPE	Cloud platform on top of which cluster is running, supported cloud platforms	aws or gcp
DURATION	Duration in seconds after which the zone will be back online	600
VPC_ID	cluster virtual private network to target ( REQUIRED for AWS )	""
SUBNET_ID	subnet-id to deny both ingress and egress traffic ( REQUIRED for AWS ). Format: [subenet1, subnet2]	""
ZONE	zone you want to target ( REQUIRED for GCP )	""
KUBE_CHECK	Connect to the kubernetes api to see if the node gets to a certain state during the scenario	True
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

Google Cloud Platform

export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"




Note

    In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.



 For example:
```bash
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

8 - Chaos Testing Guide

Chaos testing guide with strategies and methodologies

Test Strategies and Methodology

Failures in production are costly. To help mitigate risk to service health, consider the following strategies and approaches to service testing:

Be proactive vs reactive. We have different types of test suites in place - unit, integration and end-to-end - that help expose bugs in code in a controlled environment. Through implementation of a chaos engineering strategy, we can discover potential causes of service degradation. We need to understand the systems’ behavior under unpredictable conditions in order to find the areas to harden, and use performance data points to size the clusters to handle failures in order to keep downtime to a minimum.
Test the resiliency of a system under turbulent conditions by running tests that are designed to disrupt while monitoring the systems adaptability and performance:
- Establish and define your steady state and metrics - understand the behavior and performance under stable conditions and define the metrics that will be used to evaluate the system’s behavior. Then decide on acceptable outcomes before injecting chaos.
- Analyze the statuses and metrics of all components during the chaos test runs.
- Improve the areas that are not resilient and performant by comparing the key metrics and Service Level Objectives (SLOs) to the stable conditions before the chaos. For example: evaluating the API server latency or application uptime to see if the key performance indicators and service level indicators are still within acceptable limits.

Best Practices

Now that we understand the test methodology, let us take a look at the best practices for an Kubernetes cluster. On that platform there are user applications and cluster workloads that need to be designed for stability and to provide the best user experience possible:

Alerts with appropriate severity should get fired.
- Alerts are key to identify when a component starts degrading, and can help focus the investigation effort on affected system components.
- Alerts should have proper severity, description, notification policy, escalation policy, and SOP in order to reduce MTTR for responding SRE or Ops resources.
- Detailed information on the alerts consistency can be found here.
Minimal performance impact - Network, CPU, Memory, Disk, Throughput etc.
- The system, as well as the applications, should be designed to have minimal performance impact during disruptions to ensure stability and also to avoid hogging resources that other applications can use. We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
- We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
Appropriate CPU/Memory limits set to avoid performance throttling and OOM kills.
- There might be rogue applications hogging resources ( CPU/Memory ) on the nodes which might lead to applications underperforming or worse getting OOM killed. It is important to ensure that applications and system components have reserved resources for the kube-scheduler to take into consideration in order to keep them performing at the expected levels.
Services dependent on the system under test need to handle the failure gracefully to avoid performance degradation and downtime - appropriate timeouts.
- In a distributed system, services deployed coordinate with each other and might have external dependencies. Each of the services deployed as a deployment, pod, or container, need to handle the downtime of other dependent services gracefully instead of crashing due to not having appropriate timeouts, fallback logic etc.
Proper node sizing to avoid cascading failures and ensure cluster stability especially when the cluster is large and dense
- The platform needs to be sized taking into account the resource usage spikes that might occur during chaotic events. For example, if one of the main nodes goes down, the other two main nodes need to have enough resources to handle the load. The resource usage depends on the load or number of objects that are running being managed by the Control Plane ( Api Server, Etcd, Controller and Scheduler ). As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the applications stable during unplanned events without the control plane undergoing cascading failures which can potentially bring down the entire cluster.
Proper node sizing to avoid application failures and maintain stability.
- An application pod might use more resources during reinitialization after a crash, so it is important to take that into account for sizing the nodes in the cluster to accommodate it. For example, monitoring solutions like Prometheus need high amounts of memory to replay the write ahead log ( WAL ) when it restarts. As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the application stable during unplanned events without undergoing degradation in performance or even worse hog the resources on the node which can impact other applications and system pods.
Minimal initialization time and fast recovery logic.
- The controller watching the component should recognize a failure as soon as possible. The component needs to have minimal initialization time to avoid extended downtime or overloading the replicas if it is a highly available configuration. The cause of failure can be because of issues with the infrastructure on top of which it is running, application failures, or because of service failures that it depends on.
High Availability deployment strategy.
- There should be multiple replicas ( both Kubernetes and application control planes ) running preferably in different availability zones to survive outages while still serving the user/system requests. Avoid single points of failure.
Backed by persistent storage
- It is important to have the system/application backed by persistent storage. This is especially important in cases where the application is a database or a stateful application given that a node, pod, or container failure will wipe off the data.
There should be fallback routes to the backend in case of using CDN, for example, Akamai in case of console.redhat.com - a managed service deployed on top of Kubernetes dedicated:
- Content delivery networks (CDNs) are commonly used to host resources such as images, JavaScript files, and CSS. The average web page is nearly 2 MB in size, and offloading heavy resources to third-parties is extremely effective for reducing backend server traffic and latency. However, this makes each CDN an additional point of failure for every site that relies on it. If the CDN fails, its customers could also fail.
- To test how the application reacts to failures, drop all network traffic between the system and CDN. The application should still serve the content to the user irrespective of the failure.
Appropriate caching and Content Delivery Network should be enabled to be performant and usable when there is a latency on the client side.
- Not every user or machine has access to unlimited bandwidth, there might be a delay on the user side ( client ) to access the API’s due to limited bandwidth, throttling or latency depending on the geographic location. It is important to inject latency between the client and API calls to understand the behavior and optimize things including caching wherever possible, using CDN’s or opting for different protocols like HTTP/2 or HTTP/3 vs HTTP.
Ensure Disruption Budgets are enabled for your critical applications
- Protect your application during disruptions by setting a pod disruption budget to avoid downtime. For instance, etcd, zookeeper or similar applications need at least 2 replicas to maintain quorum. This can be ensured by setting PDB maxUnavailable to 1.
Enable Machine Health Checks to remediate node failures to avoid extended application and critical components downtime
- Deploy machine health checks with appropriate conditions to remediate unhealthy nodes for the workloads to have enough capacity to run without downtime

Tooling

Now that we looked at the best practices, In this section, we will go through how Kraken - a chaos testing framework can help test the resilience of Kubernetes and make sure the applications and services are following the best practices.

Cluster recovery checks, metrics evaluation and pass/fail criteria

Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and it’s extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where Cerberus comes to the rescue. If the monitoring tool, cerberus is enabled it will consume the signal and continue running chaos or not based on that signal.
Apart from checking the recovery and cluster health status, it’s equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found here.
The overall pass or fail of kraken is based on the recovery of the specific component (within a certain amount of time), the cerberus health signal which tracks the health of the entire cluster and metrics evaluation from incluster prometheus.

Scenarios

Let us take a look at how to run the chaos scenarios on your Kubernetes clusters using Kraken-hub - a lightweight wrapper around Kraken to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:

Pod Scenarios (Documentation)
- Disrupts Kubernetes/Kubernetes and applications deployed as pods:
  - Helps understand the availability of the application, the initialization timing and recovery status.
- Demo
Container Scenarios (Documentation)
- Disrupts Kubernetes/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
  - Helps understand the impact and recovery timing when the program/process running in the containers are disrupted - hangs, paused, killed etc., using various kill signals, i.e. SIGHUP, SIGTERM, SIGKILL etc.
- Demo
Node Scenarios (Documentation)
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
  - Terminate nodes
  - Fork bomb inside the node
  - Stop the node
  - Crash the kubelet running on the node
  - etc.
- Demo
Zone Outages (Documentation)
- Creates outage of availability zone(s) in a targeted region in the public cloud where the Kubernetes cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
  - Helps understand the impact on both Kubernetes/Kubernetes control plane as well as applications and services running on the worker nodes in that zone.
  - Currently, only set up for AWS cloud platform: 1 VPC and multiples subnets within the VPC can be specified.
  - Demo
Application Outages (Documentation)
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
  - Helps understand how the dependent services react to the unavailability.
  - Demo
Power Outages (Documentation)
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
  - There are various use cases in the customer environments. For example, when some of the clusters are shutdown in cases where the applications are not needed to run in a particular time/season in order to save costs.
  - The nodes are stopped in parallel to mimic a power outage i.e., pulling off the plug
- Demo
Resource Hog (Documenattion)
- Hogs CPU, Memory and IO on the targeted nodes
  - Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
    - CPU Hog (Documentation, Demo)
    - Memory Hog (Documentation, Demo)
Time Skewing (Documentation)
- Manipulate the system time and/or date of specific pods/nodes.
  - Verify scheduling of objects so they continue to work.
  - Verify time gets reset properly.
Namespace Failures (Documentation)
- Delete namespaces for the specified duration.
  - Helps understand the impact on other components and tests/improves recovery time of the components in the targeted namespace.
Persistent Volume Fill (Documentation)
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
  - Helps understand how an application deals when it is no longer able to write data to the disk. For example, kafka’s behavior when it is not able to commit data to the disk.
Network Chaos (Documentation)
- Scenarios supported includes:
  - Network latency
  - Packet loss
  - Interface flapping
  - DNS errors
  - Packet corruption
  - Bandwidth limitation
Pod Network Scenario (Documentation)
- Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
- With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
Service Disruption Scenarios (Documentation)
- Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
Service Hijacking Scenarios (Documentation)
- Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.

Test Environment Recommendations - how and where to run chaos tests

Let us take a look at few recommendations on how and where to run the chaos tests:

Run the chaos tests continuously in your test pipelines:
- Software, systems, and infrastructure does change – and the condition/health of each can change pretty rapidly. A good place to run tests is in your CI/CD pipeline running on a regular cadence.
Run the chaos tests manually to learn from the system:
- When running a Chaos scenario or Fault tests, it is more important to understand how the system responds and reacts, rather than mark the execution as pass or fail.
- It is important to define the scope of the test before the execution to avoid some issues from masking others.
Run the chaos tests in production environments or mimic the load in staging environments:
- As scary as a thought about testing in production is, production is the environment that users are in and traffic spikes/load are real. To fully test the robustness/resilience of a production system, running Chaos Engineering experiments in a production environment will provide needed insights. A couple of things to keep in mind:
  - Minimize blast radius and have a backup plan in place to make sure the users and customers do not undergo downtime.
  - Mimic the load in a staging environment in case Service Level Agreements are too tight to cover any downtime.
Enable Observability:
- Chaos Engineering Without Observability … Is Just Chaos.
- Make sure to have logging and monitoring installed on the cluster to help with understanding the behaviour as to why it is happening. In case of running the tests in the CI where it is not humanly possible to monitor the cluster all the time, it is recommended to leverage Cerberus to capture the state during the runs and metrics collection in Kraken to store metrics long term even after the cluster is gone.
- Kraken ships with dashboards that will help understand API, Etcd and Kubernetes cluster level stats and performance metrics.
- Pay attention to Prometheus alerts. Check if they are firing as expected.
Run multiple chaos tests at once to mimic the production outages:
- For example, hogging both IO and Network at the same time instead of running them separately to observe the impact.
- You might have existing test cases, be it related to Performance, Scalability or QE. Run the chaos in the background during the test runs to observe the impact. Signaling feature in Kraken can help with coordinating the chaos runs i.e., start, stop, pause the scenarios based on the state of the other test jobs.

Chaos testing in Practice

OpenShift organization

Within the OpenShift organization we use kraken to perform chaos testing throughout a release before the code is available to customers.

1. We execute kraken during our regression test suite.

    i. We cover each of the chaos scenarios across different clouds.

        a. Our testing is predominantly done on AWS, Azure and GCP.

2. We run the chaos scenarios during a long running reliability test.

    i. During this test we perform different types of tasks by different users on the cluster.

    ii. We have added the execution of kraken to perform at certain times throughout the long running test and monitor the health of the cluster.

    iii. This test can be seen here: https://github.com/openshift/svt/tree/master/reliability-v2

3. We are starting to add in test cases that perform chaos testing during an upgrade (not many iterations of this have been completed).

startx-lab

NOTE: Requests for enhancements and any issues need to be filed at the mentioned links given that they are not natively supported in Kraken.

The following content covers the implementation details around how Startx is leveraging Kraken:

Using kraken as part of a tekton pipeline

You can find on artifacthub.io the kraken-scenario tekton-task which can be used to start a kraken chaos scenarios as part of a chaos pipeline.

To use this task, you must have :

Openshift pipeline enabled (or tekton CRD loaded for Kubernetes clusters)
1 Secret named kraken-aws-creds for scenarios using aws
1 ConfigMap named kraken-kubeconfig with credentials to the targeted cluster
1 ConfigMap named kraken-config-example with kraken configuration file (config.yaml)
1 ConfigMap named kraken-common-example with all kraken related files
The pipeline SA with be autorized to run with priviveged SCC

You can create theses resources using the following sequence :

oc project default
oc adm policy add-scc-to-user privileged -z pipeline
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/common.yaml

Then you must change content of kraken-aws-creds secret, kraken-kubeconfig and kraken-config-example configMap to reflect your cluster configuration. Refer to the kraken configuration and configuration examples for details on how to configure theses resources.

Start as a single taskrun

oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/taskrun.yaml

Start as a pipelinerun

oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/pipelinerun.yaml

Deploying kraken using a helm-chart

You can find on artifacthub.io the chaos-kraken helm-chart which can be used to deploy a kraken chaos scenarios.

Default configuration create the following resources :

1 project named chaos-kraken
1 scc with privileged context for kraken deployment
1 configmap with kraken 21 generic scenarios, various scripts and configuration
1 configmap with kubeconfig of the targeted cluster
1 job named kraken-test-xxx
1 service to the kraken pods
1 route to the kraken service

# Install the startx helm repository
helm repo add startx https://startxfr.github.io/helm-repository/packages/
# Install the kraken project
helm install --set project.enabled=true chaos-kraken-project  startx/chaos-kraken
# Deploy the kraken instance
helm install \
--set kraken.enabled=true \
--set kraken.aws.credentials.region="eu-west-3" \
--set kraken.aws.credentials.key_id="AKIAXXXXXXXXXXXXXXXX" \
--set kraken.aws.credentials.secret="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
--set kraken.kubeconfig.token.server="https://api.mycluster:6443" \
--set kraken.kubeconfig.token.token="sha256~XXXXXXXXXX_PUT_YOUR_TOKEN_HERE_XXXXXXXXXXXX" \
-n chaos-kraken \
chaos-kraken-instance startx/chaos-kraken

9 - Contribution Guidelines

How to contribute and get started

How to contribute

We’re excited to have you consider contributing to our chaos! Contributions are always appreciated.

Krkn

Contributing to Krkn

If you would like to contribute to Krkn, but are not sure exactly what to work on, you can find a number of open issues that are awaiting contributions in issues.

Adding New Scenarios and Configurations

New Scenarios

We are always looking for new scenarios to make krkn better and more usable for our chaos community. If you have any ideas, please first open an issue to explain the new scenario you are wanting to add. We will review and respond with ideas of how to get started.

If adding a new scenario or tweaking the main config, be sure to add in updates into the CI to be sure the CI is up to date. Please read this file for more information on updates.

Scenario Plugin Development

If you’re gearing up to develop new scenarios, take a moment to review our Scenario Plugin API Documentation. It’s the perfect starting point to tap into your chaotic creativity!

New Configuration to Scenarios

If you are currently using a scenario but want more configuration options, please open a github issue describing your use case and what fields and functionality you would like to see added. We will review the sugguestion and give pointers on how to add the functionality. If you feel inclined, you can start working on the feature and we’ll help if you get stuck along the way.

Work in Progress PR’s

If you are working on a contribution in any capacity and would like to get a new set of eyes on your work, go ahead and open a PR with ‘[WIP]’ at the start of the tite in your PR and tag the maintainers for review. We will review your changes and give you sugguestions to keep you moving!

Office Hours

If you have any questions that you think could be better discussed on a meeting we have monthly office hours zoom link. Please add items to agenda before so we can best prepare to help you.

Good PR Checklist

Here’s a quick checklist for a good PR, more details below:

One feature/change per PR
One commit per PR squash your commits
PR rebased on main (git rebase, not git pull)
Good descriptive commit message, with link to issue
No changes to code not directly related to your PR
Includes functional/integration test (more applicable to krkn-lib)
Includes link to documentation PR (documentation hosted in https://github.com/krkn-chaos/website)

Helpful Documents

Refer to the docs below to be able to test your own images with any changes and be able to contribute them to the repository

9.1 - Testing your changes

This page gives details about how you can get a kind cluster configured to be able to run on krkn-lib (the lowest level of krkn-chaos repos) up through krkn-hub

Testing Changes in Krkn-lib

Create a kind cluster if needed

Install kind
Create cluster using kind-config.yml under krkn-lib base folder

kind create cluster --wait 300s --config=kind-config.yml

Elasticsearch and Prometheus

To be able to run the full test suite of tests you need to have elasticsearch and promethues properly configured on the cluster

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add stable https://charts.helm.sh/stable
helm repo update

Prometheus

Configuring prometheus

ElasticSearch

Set enviornment variables of elasticsearch variables

export ELASTIC_URL="https://localhost"
export ELASTIC_PORT="9091"
export ELASTIC_USER="elastic"
export ELASTIC_PASSWORD="test"

Configuring elasticsearch

Install poetry

Using a virtual enviornment install poetry and install krkn-lib requirmenets

$ pip install poetry
$ poetry install --no-interaction

Run tests

poetry run python3 -m coverage run -a -m unittest discover -v src/krkn_lib/tests/

Testing Changes for Krkn-hub

Install Podman/Docker Compose

You can use either podman-compose or docker-compose for this step

NOTE: Podman might not work on Mac’s

pip3 install docker-compose

To get latest podman-compose features we need, use this installation command

pip3 install https://github.com/containers/podman-compose/archive/devel.tar.gz

Build Your Changes

Run build.sh to get Dockerfiles for each scenario
Edit the docker-compose.yaml file to point to your quay.io repository (optional; required if you want to push)

ex.) 
image: containers.krkn-chaos.dev/krkn-chaos/krkn-hub:chaos-recommender 

change to >

image: quay.io/<user>/krkn-hub:chaos-recommender

Build your image(s) from base krkn-hub directory

Builds all images in docker-compose file

docker-compose build

Builds single image defined by service/scenario name

docker-compose build <scenario_type>

Builds all images in podman-compose file

podman-compose build

Builds single image defined by service/scenario name

podman-compose build <scenario_type>`

Push Images to your quay.io

All Images

docker-compose push

Single image

docker-compose push <scenario_type>

Single Image (have to go one by one to push images through podman)

podman-compose push <scenario_type>

Run your new scenario

docker run -d -v <kube_config_path>:/root/.kube/config:Z quay.io/<username>/kraken-hub:<scenario_type>

podman run -d -v <kube_config_path>:/root/.kube/config:Z quay.io/<username>/kraken-hub:<scenario_type>

See krkn-hub files for each scenario

Adding a New Scenario to Krknctl

Add KrknCtl Input Json

This file adds every enviornment variable that is set up for krkn-hub to be defined as a flag to the krknctl cli commanfd. There are

Enum Type Required Key/Values

{
    "name": "<name>",
    "short_description":"<short-description>",
    "description":"<longer-description>",
    "variable":"<variable_name>", //this needs to match enviornment variable in krkn-hub
    "type": "enum",
    "allowed_values": "<value>,<value>",
    "separator": ",",
    "default":"", // any default value
    "required":"<true_or_false>" // true or false if required to set when running
}

String Type Required Key/Values

{
    "name": "<name>",
    "short_description":"<short-description>",
    "description":"<longer-description>",
    "variable":"<variable_name>", //this needs to match enviornment variable in krkn-hub
    "type": "string",
    "default": "", // any default value
    "required":"<true_or_false>" // true or false if required to set when running
}

Number Type Required Key/Values

{
    "name": "<name>",
    "short_description": "<short-description>",
    "description": "<longer-description>",
    "variable": "<variable_name>", //this needs to match enviornment variable in krkn-hub
    "type": "number",  // options: string, number, file, file64
    "default": "", // any default value
    "required": "<true_or_false>" // true or false if required to set when running
}

File Type Required Key/Values

{
    "name": "<name>",
    "short_description":"<short-description>",
    "description":"<longer-description>",
    "variable":"<variable_name>", //this needs to match enviornment variable in krkn-hub
    "type":"file",  
    "mount_path": "/home/krkn/<file_loc>", // file location to mount to, using /home/krkn as the base has correct read/write locations
    "required":"<true_or_false>" // true or false if required to set when running
}

File Base 64 Type Required Key/Values

{
    "name": "<name>",
    "short_description":"<short-description>",
    "description":"<longer-description>",
    "variable":"<variable_name>", //this needs to match enviornment variable in krkn-hub
    "type":"file_base64",  
    "required":"<true_or_false>" // true or false if required to set when running
}

Push to personal Quay

See build your own changes on how to build and push changes to your own quay repository for testing

Run Krknctl with Personal Image

Once you have your images in quay, you are all set to configure krknctl to look for these new images. You’ll edit the quay_org (your quay username), quay_scenario_registry (krkn-hub), quay_base_image_registry variables here

With these updates to your config, you’ll re-build your personal krknctl binary and you’l be all set to start testing your new scenario and config options.

If any krknctl code changes are required, you’ll have to make changes and rebuild the the krknctl binary.

Follow Contribution Guide

Once all you’re happy with your changes, follow the contribution guide on how to create your own branch and squash your commits

9.2 - Adding scenarios via plugin api

Scenario Plugin API:

This API enables seamless integration of Scenario Plugins for Krkn. Plugins are automatically detected and loaded by the plugin loader, provided they extend the AbstractPluginScenario abstract class, implement the required methods, and adhere to the specified naming conventions.

Plugin folder:

The plugin loader automatically loads plugins found in the krkn/scenario_plugins directory, relative to the Krkn root folder. Each plugin must reside in its own directory and can consist of one or more Python files. The entry point for each plugin is a Python class that extends the AbstractPluginScenario abstract class and implements its required methods.

`init` file

For the plugin to be properly found by the plugin api, there needs to be a init file in the base folder

For example: init.py

`AbstractPluginScenario` abstract class:

This abstract class defines the contract between the plugin and krkn. It consists of two methods:

run(...)
get_scenario_type()

Most IDEs can automatically suggest and implement the abstract methods defined in AbstractPluginScenario: pycharm (IntelliJ PyCharm)

`run(...)`

    def run(
        self,
        run_uuid: str,
        scenario: str,
        krkn_config: dict[str, any],
        lib_telemetry: KrknTelemetryOpenshift,
        scenario_telemetry: ScenarioTelemetry,
    ) -> int:

This method represents the entry point of the plugin and the first method that will be executed.

Parameters:

run_uuid:
- the uuid of the chaos run generated by krkn for every single run.
scenario:
- the config file of the scenario that is currently executed
krkn_config:
- the full dictionary representation of the config.yaml
lib_telemetry
- it is a composite object of all the krkn-lib objects and methods needed by a krkn plugin to run.
scenario_telemetry
- the ScenarioTelemetry object of the scenario that is currently executed

Return value:

Returns 0 if the scenario succeeds and 1 if it fails.

WARNING

All the exception must be handled inside the run method and not propagated.

`get_scenario_types()`:

python def get_scenario_types(self) -> list[str]:

Indicates the scenario types specified in the config.yaml. For the plugin to be properly loaded, recognized and executed, it must be implemented and must return one or more strings matching scenario_type strings set in the config.

DANGER

Multiple strings can map to a single ScenarioPlugin but the same string cannot map to different plugins, an exception will be thrown for scenario_type redefinition.

INFO

The scenario_type strings must be unique across all plugins; otherwise, an exception will be thrown.

Naming conventions:

A key requirement for developing a plugin that will be properly loaded by the plugin loader is following the established naming conventions. These conventions are enforced to maintain a uniform and readable codebase, making it easier to onboard new developers from the community.

plugin folder:

the plugin folder must be placed in the krkn/scenario_plugin folder starting from the krkn root folder
the plugin folder cannot contain the words
- plugin
- scenario

plugin file name and class name:

the plugin file containing the main plugin class must be named in snake case and must have the suffix _scenario_plugin:
- example_scenario_plugin.py
the main plugin class must named in capital camel case and must have the suffix ScenarioPlugin :
- ExampleScenarioPlugin
the file name must match the class name in the respective syntax:
- example_scenario_plugin.py -> ExampleScenarioPlugin

scenario type:

the scenario type must be unique between all the scenarios.

logging:

If your new scenario does not adhere to the naming conventions, an error log will be generated in the Krkn standard output, providing details about the issue:

2024-10-03 18:06:31,136 [INFO] 📣 `ScenarioPluginFactory`: types from config.yaml mapped to respective classes for execution:
2024-10-03 18:06:31,136 [INFO]   ✅ type: application_outages_scenarios ➡️ `ApplicationOutageScenarioPlugin` 
2024-10-03 18:06:31,136 [INFO]   ✅ types: [hog_scenarios, arcaflow_scenario] ➡️ `ArcaflowScenarioPlugin` 
2024-10-03 18:06:31,136 [INFO]   ✅ type: container_scenarios ➡️ `ContainerScenarioPlugin` 
2024-10-03 18:06:31,136 [INFO]   ✅ type: managedcluster_scenarios ➡️ `ManagedClusterScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ types: [pod_disruption_scenarios, pod_network_scenario, vmware_node_scenarios, ibmcloud_node_scenarios] ➡️ `NativeScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: network_chaos_scenarios ➡️ `NetworkChaosScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: node_scenarios ➡️ `NodeActionsScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: pvc_scenarios ➡️ `PvcScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: service_disruption_scenarios ➡️ `ServiceDisruptionScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: service_hijacking_scenarios ➡️ `ServiceHijackingScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: cluster_shut_down_scenarios ➡️ `ShutDownScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: syn_flood_scenarios ➡️ `SynFloodScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: time_scenarios ➡️ `TimeActionsScenarioPlugin` 
2024-10-03 18:06:31,137 [INFO]   ✅ type: zone_outages_scenarios ➡️ `ZoneOutageScenarioPlugin`

2024-09-18 14:48:41,735 [INFO] Failed to load Scenario Plugins:

2024-09-18 14:48:41,735 [ERROR] ⛔ Class: ExamplePluginScenario Module: krkn.scenario_plugins.example.example_scenario_plugin
2024-09-18 14:48:41,735 [ERROR] ⚠️ scenario plugin class name must start with a capital letter, end with `ScenarioPlugin`, and cannot be just `ScenarioPlugin`.

INFO

If you’re trying to understand how the scenario types in the config.yaml are mapped to their corresponding plugins, this log will guide you! Each scenario plugin class mentioned can be found in the krkn/scenario_plugin folder simply convert the camel case notation and remove the ScenarioPlugin suffix from the class name e.g ShutDownScenarioPlugin class can be found in the krkn/scenario_plugin/shut_down folder.

ExampleScenarioPlugin

The ExampleScenarioPlugin class included in the tests folder can be used as a scaffolding for new plugins and it is considered part of the documentation.

For any questions or further guidance, feel free to reach out to us on the Kubernetes workspace in the #krkn channel. We’re happy to assist. Now, release the Krkn!

9.3 - Git Help For Contributions

How to contribute

Contributions are always appreciated.

How to:

Pull request

In order to submit a change or a PR, please fork the project and follow instructions:

$ git clone http://github.com/<me>/krkn-hub
$ cd krkn-hub
$ git checkout -b <branch_name>
$ <make change>
$ git add <changes>
$ git commit -a
$ <insert good message>
$ git push

Squash Commits

If there are mutliple commits, please rebase/squash multiple commits before creating the PR by following:

$ git checkout <my-working-branch>
$ git rebase -i HEAD~<num_of_commits_to_merge>
   -OR-
$ git rebase -i <commit_id_of_first_change_commit>

In the interactive rebase screen, set the first commit to pick and all others to squash (or whatever else you may need to do).

Push your rebased commits (you may need to force), then issue your PR.

$ git push origin <my-working-branch> --force

Rebase with Upstream

If new commits were merged while you were working you’ll need to rebase with upstream before creating the PR by following:

$ git checkout <my-working-branch>
$ git remote add upstream https://github.com/krkn-chaos/krkn (or krkn-hub)
$ git fetch upstream
$ git rebase upstream/<branch_in_upstream_to_rebase> (most likely `main`)

If any errors occur:

It’ll list off any files that have merge issues
Edit the files with the code blocks you want to keep
Add and continue rebase

$ git add .
$ git rebase --continue

Might need to repeat steps 1-3 until you see Successfully rebased and updated refs/heads/<my-working-branch>.

Push your rebased commits (you may need to force), then issue your PR.

$ git push origin <my-working-branch> --force

Developer’s Certificate of Origin

Any contributions to Krkn must only contain code that can legally be contributed to Krkn, and which the Krkn project can distribute under its license.

Prior to contributing to Krkn please read the Developer’s Certificate of Origin and sign-off all commits with the –signoff option provided by git commit. For example:

git rebase HEAD~1 --signoff
git push origin <branch_name> --force

This option adds a Signed-off-by trailer at the end of the commit log message.

10 - Cerberus

Guardian of kubernetes

Cerberus

Guardian of Kubernetes and OpenShift Clusters

Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly.

Workflow

Cerberus workflow

Installation

Instructions on how to setup, configure and run Cerberus can be found at Installation.

What Kubernetes/OpenShift components can Cerberus monitor?

Following are the components of Kubernetes/OpenShift that Cerberus can monitor today, we will be adding more soon.

Component	Description	Working
Nodes	Watches all the nodes including masters, workers as well as nodes created using custom MachineSets	:heavy_check_mark:
Namespaces	Watches all the pods including containers running inside the pods in the namespaces specified in the config	:heavy_check_mark:
Cluster Operators	Watches all Cluster Operators	:heavy_check_mark:
Masters Schedulability	Watches and warns if masters nodes are marked as schedulable	:heavy_check_mark:
Routes	Watches specified routes	:heavy_check_mark:
CSRs	Warns if any CSRs are not approved	:heavy_check_mark:
Critical Alerts	Warns the user on observing abnormal behavior which might effect the health of the cluster	:heavy_check_mark:
Bring your own checks	Users can bring their own checks and Ceberus runs and includes them in the reporting as wells as go/no-go signal	:heavy_check_mark:

An explanation of all the components that Cerberus can monitor are explained here

How does Cerberus report cluster health?

Cerberus exposes the cluster health and failures through a go/no-go signal, report and metrics API.

Go or no-go signal

When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a light weight http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.

Report

The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. Refer report for example.

You can use the “-o <file_path_name>” option to change the location of the created report

Metrics API

Cerberus exposes the metrics including the failures observed during the run through an API. Tools consuming Cerberus can query the API to get a blob of json with the observed failures to scrape and act accordingly. For example, we can query for etcd failures within a start and end time and take actions to determine pass/fail for test cases or report whether the cluster is healthy or unhealthy for that duration.

The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=.
The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.

Slack integration

Cerberus supports reporting failures in slack. Refer slack integration for information on how to set it up.

Node Problem Detector

Cerberus also consumes node-problem-detector to detect various failures in Kubernetes/OpenShift nodes. More information on setting it up can be found at node-problem-detector

Bring your own checks

Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {‘status’:status, ‘message’:message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it’s optional to return a value. Refer to example_check for an example custom check file.

Alerts

Monitoring metrics and alerting on abnormal behavior is critical as they are the indicators for clusters health. Information on supported alerts can be found at alerts.

Use cases

There can be number of use cases, here are some of them:

We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable.
When running chaos experiments on a kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components which means that the chaos experiment won’t be able to find it. The go/no-go signal can be used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.

Tools consuming Cerberus

Benchmark Operator: The intent of this Operator is to deploy common workloads to establish a performance baseline of Kubernetes cluster on your provider. Benchmark Operator consumes Cerberus to determine if the cluster was healthy during the benchmark run. More information can be found at cerberus-integration.
Kraken: Tool to inject deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient. Kraken consumes Cerberus to determine if the cluster is healthy as a whole in addition to the targeted component during chaos testing. More information can be found at cerberus-integration.

Blogs and other useful resources

Contributions

We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.

More information on how to Contribute

Community

Key Members(slack_usernames): paige, rook, mffiedler, mohit, dry923, rsevilla, ravi

Credits

Thanks to Mary Shakshober ( https://github.com/maryshak1996 ) for designing the logo.

10.1 - Installation

Following ways are supported to run Cerberus:

Standalone python program through Git or python package
Containerized version using either Podman or Docker as the runtime
Kubernetes or OpenShift deployment

Note

Only OpenShift 4.x versions are tested.

Git

Pick the latest stable release to install here.

$ git clone https://github.com/redhat-chaos/cerberus.git --branch <release>

Install the dependencies

NOTE: Recommended to use a virtual environment(pyenv,venv) so as to prevent conflicts with already installed packages.

$ pip3 install -r requirements.txt

Configure and Run

Setup the config according to your requirements. Information on the available options can be found at usage.

Run

$ python3 start_cerberus.py --config <config_file_location>

NOTE: When config file location is not passed, default config is used.

Python Package

Cerberus is also available as a python package to ease the installation and setup.

To install the lastest release:

$ pip3 install cerberus-client

Configure and Run

Setup the config according to your requirements. Information on the available options can be found at usage.

Run

$ cerberus_client -c <config_file_location>`

Note

When config_file_location is not passed, default config is used.

Note

It’s recommended to run Cerberus either using the containerized or github version to be able to use the latest enhancements and fixes.

Containerized version

Assuming docker ( 17.05 or greater with multi-build support ) is intalled on the host, run:

$ docker pull quay.io/redhat-chaos/cerberus
# Setup the [config](https://github.com/redhat-chaos/cerberus/tree/master/config) according to your requirements. Information on the available options can be found at [usage](usage.md).
$ docker run --name=cerberus --net=host -v <path_to_kubeconfig>:/root/.kube/config -v <path_to_cerberus_config>:/root/cerberus/config/config.yaml -d quay.io/redhat-chaos/cerberus:latest
$ docker logs -f cerberus

Similarly, podman can be used to achieve the same:

$ podman pull quay.io/redhat-chaos/cerberus
# Setup the [config](https://github.com/redhat-chaos/cerberus/tree/master/config) according to your requirements. Information on the available options can be found at [usage](usage.md).
$ podman run --name=cerberus --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_cerberus_config>:/root/cerberus/config/config.yaml:Z -d quay.io/redhat-chaos/cerberus:latest
$ podman logs -f cerberus

The go/no-go signal ( True or False ) gets published at http://<hostname>:8080. Note that the cerberus will only support ipv4 for the time being.

Note

The report is generated at /root/cerberus/cerberus.report inside the container, it can mounted to a directory on the host in case we want to capture it.

If you want to build your own Cerberus image, see here. To run Cerberus on Power (ppc64le) architecture, build and run a containerized version by following the instructions given here.

Run containerized Cerberus as a Kubernetes/OpenShift deployment

Refer to the instructions for information on how to run cerberus as a Kubernetes or OpenShift application.

10.2 - Config

Cerberus Config Components Explained

Config

Set the components to monitor and the tunings like duration to wait between each check in the config file located at config/config.yaml. A sample config looks like:

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: /root/.kube/config                      # Path to kubeconfig
    port: 8081                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators
    watch_terminating_namespaces: True                   # Set to True to monitor if any namespaces (set below under 'watch_namespaces' start terminating
    watch_url_routes:
    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
    watch_master_schedulable:                            # When enabled checks for the schedulable master nodes with given label.
        enabled: True
        label: node-role.kubernetes.io/master
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn                                   # When enabled, it will check for the cluster sdn and monitor that namespace
    watch_namespaces_ignore_pattern: []                  # Ignores pods matching the regex pattern in the namespaces specified under watch_namespaces
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: False                            # Enable it only when OpenShift client is supported to run
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.

    slack_integration: False                             # When enabled, cerberus reports the failed iterations in the slack channel
                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:
        -   custom_checks/custom_check_sample.py       # Relative paths of files conataining additional user defined checks

tunings:
    timeout: 20                                          # Number of seconds before requests fail
    iterations: 1                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 3                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures

Watch Nodes

This flag returns any nodes where the KernelDeadlock is not set to False and does not have a Ready status

Watch Cluster Operators

When watch_cluster_operators is set to True, this will monitor the degraded status of all the cluster operators and report a failure if any are degraded. If set to False will not query or report the status of the cluster operators

Watch Routes

This parameter expects a double array with each item having the url and an optional bearer token or authorization for each of the url’s to properly connect

For example:

watch_url_routes:
- - <url>
  - <authorization> (optional)
- - https://prometheus-k8s-openshift-monitoring.apps.****.devcluster.openshift.com
  - Bearer ****
- - http://nodejs-mongodb-example-default.apps.****.devcluster.openshift.com

Watch Master Schedulable Status

When this check is enabled, cerberus queries each of the nodes for the given label and verifies the taint effect does not equal “NoSchedule”

watch_master_schedulable:                            # When enabled checks for the schedulable master nodes with given label.
    enabled: True
    label: <label of master nodes>

Watch Namespaces

It supports monitoring pods in any namespaces specified in the config, the watch is enabled for system components mentioned in the config by default as they are critical for running the operations on Kubernetes/OpenShift clusters.

watch_namespaces support regex patterns. Any valid regex pattern can be used to watch all the namespaces matching the regex pattern. For example, ^openshift-.*$ can be used to watch all namespaces that start with openshift- or openshift can be used to watch all namespaces that have openshift in it. Or you can use ^.*$ to watch all namespaces in your cluster

Watch Terminating Namespaces

When watch_terminating_namespaces is set to True, this will monitor the status of all the namespaces defind under watch namespaces and report a failure if any are terminating. If set to False will not query or report the status of the terminating namespaces

Publish Status

Parameter to set if you want to publish the go/no-go signal to the http server

Inspect Components

inspect_components if set to True will perform an oc adm inspect namespace <namespace> when any namespace has any failing pods

Custom Checks

Refer to example_check for an example custom check file.

10.3 - Example Report

2020-03-26 22:05:06,393 [INFO] Starting ceberus
2020-03-26 22:05:06,401 [INFO] Initializing client to talk to the Kubernetes cluster
2020-03-26 22:05:06,434 [INFO] Fetching cluster info
2020-03-26 22:05:06,739 [INFO] Publishing cerberus status at http://0.0.0.0:8080
2020-03-26 22:05:06,753 [INFO] Starting http server at http://0.0.0.0:8080
2020-03-26 22:05:06,753 [INFO] Daemon mode enabled, cerberus will monitor forever
2020-03-26 22:05:06,753 [INFO] Ignoring the iterations set

2020-03-26 22:05:25,104 [INFO] Iteration 4: Node status: True
2020-03-26 22:05:25,133 [INFO] Iteration 4: Etcd member pods status: True
2020-03-26 22:05:25,161 [INFO] Iteration 4: OpenShift apiserver status: True
2020-03-26 22:05:25,546 [INFO] Iteration 4: Kube ApiServer status: True
2020-03-26 22:05:25,717 [INFO] Iteration 4: Monitoring stack status: True
2020-03-26 22:05:25,720 [INFO] Iteration 4: Kube controller status: True
2020-03-26 22:05:25,746 [INFO] Iteration 4: Machine API components status: True
2020-03-26 22:05:25,945 [INFO] Iteration 4: Kube scheduler status: True
2020-03-26 22:05:25,963 [INFO] Iteration 4: OpenShift ingress status: True
2020-03-26 22:05:26,077 [INFO] Iteration 4: OpenShift SDN status: True
2020-03-26 22:05:26,077 [INFO] HTTP requests served: 0
2020-03-26 22:05:26,077 [INFO] Sleeping for the specified duration: 5


2020-03-26 22:05:31,134 [INFO] Iteration 5: Node status: True
2020-03-26 22:05:31,162 [INFO] Iteration 5: Etcd member pods status: True
2020-03-26 22:05:31,190 [INFO] Iteration 5: OpenShift apiserver status: True
127.0.0.1 - - [26/Mar/2020 22:05:31] "GET / HTTP/1.1" 200 -
2020-03-26 22:05:31,588 [INFO] Iteration 5: Kube ApiServer status: True
2020-03-26 22:05:31,759 [INFO] Iteration 5: Monitoring stack status: True
2020-03-26 22:05:31,763 [INFO] Iteration 5: Kube controller status: True
2020-03-26 22:05:31,788 [INFO] Iteration 5: Machine API components status: True
2020-03-26 22:05:31,989 [INFO] Iteration 5: Kube scheduler status: True
2020-03-26 22:05:32,007 [INFO] Iteration 5: OpenShift ingress status: True
2020-03-26 22:05:32,118 [INFO] Iteration 5: OpenShift SDN status: False
2020-03-26 22:05:32,118 [INFO] HTTP requests served: 1
2020-03-26 22:05:32,118 [INFO] Sleeping for the specified duration: 5
+--------------------------------------------------Failed Components--------------------------------------------------+
2020-03-26 22:05:37,123 [INFO] Failed openshfit sdn components: ['sdn-xmqhd']

2020-05-23 23:26:43,041 [INFO] ------------------------- Iteration Stats ---------------------------------------------
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_nodes in iteration 1: 0.0996248722076416 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3672499656677246 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_namespaces in iteration 1: 1.085144281387329 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run entire_iteration in iteration 1: 4.107403039932251 seconds
2020-05-23 23:26:43,041 [INFO] ---------------------------------------------------------------------------------------

10.4 - Usage

Config

Set the supported components to monitor and the tunings like number of iterations to monitor and duration to wait between each check in the config file located at config/config.yaml. A sample config looks like:

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: ~/.kube/config                      # Path to kubeconfig
    port: 8080                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators. Parameter is optional, will set to True if not specified
    watch_url_routes:                                    # Route url's you want to monitor
        - - https://...
          - Bearer ****                                  # This parameter is optional, specify authorization need for get call to route
        - - http://...
    watch_master_schedulable:                            # When enabled checks for the schedulable
        enabled: True                                     master nodes with given label.
        label: node-role.kubernetes.io/master
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: False                            # Enable it only when OpenShift client is supported to run.
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.

    slack_integration: False                             # When enabled, cerberus reports status of failed iterations in the slack channel
                                                         # The following env vars need to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:                                       # Relative paths of files conataining additional user defined checks
        -   custom_checks/custom_check_sample.py
        -   custom_check.py

tunings:
    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 60                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures

Note

watch_namespaces support regex patterns. Any valid regex pattern can be used to watch all the namespaces matching the regex pattern. For example, ^openshift-.*$ can be used to watch all namespaces that start with openshift- or openshift can be used to watch all namespaces that have openshift in it.

Note

The current implementation can monitor only one cluster from one host. It can be used to monitor multiple clusters provided multiple instances of Cerberus are launched on different hosts.

Note

The components especially the namespaces needs to be changed depending on the distribution i.e Kubernetes or OpenShift. The default specified in the config assumes that the distribution is OpenShift. A config file for Kubernetes is located at config/kubernetes_config.yaml

10.5 - Alerts

Cerberus consumes the metrics from Prometheus deployed on the cluster to report the alerts.

When provided the prometheus url and bearer token in the config, Cerberus reports the following alerts:

KubeAPILatencyHigh: alerts at the end of each iteration and warns if 99th percentile latency for given requests to the kube-apiserver is above 1 second. It is the official SLI/SLO defined for Kubernetes.
High number of etcd leader changes: alerts the user when an increase in etcd leader changes are observed on the cluster. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.

NOTE: The prometheus url and bearer token are automatically picked from the cluster if the distribution is OpenShift since it’s the default metrics solution. In case of Kubernetes, they need to be provided in the config if prometheus is deployed.

10.6 - Node Problem Detector

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack.

Installation

Please follow the instructions in the installation section to setup Node Problem Detector on Kubernetes. The following instructions are setting it up on OpenShift:

Create openshift-node-problem-detector namespace ns.yaml with oc create -f ns.yaml
Add cluster role with oc adm policy add-cluster-role-to-user system:node-problem-detector -z default -n openshift-node-problem-detector
Add security context constraints with oc adm policy add-scc-to-user privileged system:serviceaccount:openshift-node-problem-detector:default
Edit node-problem-detector.yaml to fit your environment.
Edit node-problem-detector-config.yaml to configure node-problem-detector.
Create the ConfigMap with oc create -f node-problem-detector-config.yaml
Create the DaemonSet with oc create -f node-problem-detector.yaml

Once installed you will see node-problem-detector pods in openshift-node-problem-detector namespace. Now enable openshift-node-problem-detector in the config.yaml. Cerberus just monitors KernelDeadlock condition provided by the node problem detector as it is system critical and can hinder node performance.

10.7 - Slack Integration

The user has the option to enable/disable the slack integration ( disabled by default ). To use the slack integration, the user has to first create an app and add a bot to it on slack. SLACK_API_TOKEN and SLACK_CHANNEL environment variables have to be set. SLACK_API_TOKEN refers to Bot User OAuth Access Token and SLACK_CHANNEL refers to the slack channel ID the user wishes to receive the notifications. Make sure the Slack Bot Token Scopes contains this permission [calls:read] [channels:read] [chat:write] [groups:read] [im:read] [mpim:read]

Reports when cerberus starts monitoring a cluster in the specified slack channel.
Reports the component failures in the slack channel.
A watcher can be assigned for each day of the week. The watcher of the day is tagged while reporting failures in the slack channel instead of everyone. (NOTE: Defining the watcher id’s is optional and when the watcher slack id’s are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)

Go or no-go signal

When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a simple http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.

Failures in a time window

The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=.
The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.

Sample Slack Config

This is a snippet of how would your slack config could look like within your cerberus_config.yaml.

    watcher_slack_ID:
        Monday: U1234ABCD   # replace with your Slack ID from Profile-> More -> Copy Member ID
        Tuesday:            # Same or different ID can be used for remaining days depending on who you want to tag
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:   @group_or_team_id

10.8 - Contribute

How to contribute

Contributions are always appreciated.

How to:

Pull request

In order to submit a change or a PR, please fork the project and follow instructions:

$ git clone http://github.com/<me>/cerberus
$ cd cerberus
$ git checkout -b <branch_name>
$ <make change>
$ git add <changes>
$ git commit -a
$ <insert good message>
$ git push

Fix Formatting

Cerberus uses pre-commit framework to maintain the code linting and python code styling. The CI would run the pre-commit check on each pull request. We encourage our contributors to follow the same pattern, while contributing to the code.

The pre-commit configuration file is present in the repository .pre-commit-config.yaml It contains the different code styling and linting guide which we use for the application.

Following command can be used to run the pre-commit: pre-commit run --all-files

If pre-commit is not installed in your system, it can be install with : pip install pre-commit

Squash Commits

If there are mutliple commits, please rebase/squash multiple commits before creating the PR by following:

$ git checkout <my-working-branch>
$ git rebase -i HEAD~<num_of_commits_to_merge>
   -OR-
$ git rebase -i <commit_id_of_first_change_commit>

In the interactive rebase screen, set the first commit to pick and all others to squash (or whatever else you may need to do).

Push your rebased commits (you may need to force), then issue your PR.

$ git push origin <my-working-branch> --force

11 - Chaos Recommendation Tool

Krkn scenario recommendor tool

This tool, designed for Kraken, operates through the command line and offers recommendations for chaos testing. It suggests probable chaos test cases that can disrupt application services by analyzing their behavior and assessing their susceptibility to specific fault types.

This tool profiles an application and gathers telemetry data such as CPU, Memory, and Network usage, analyzing it to suggest probable chaos scenarios. For optimal results, it is recommended to activate the utility while the application is under load.

Pre-requisites

Openshift Or Kubernetes Environment where the application is hosted
Access to the metrics via the exposed Prometheus endpoint
Python3.9

Usage

To run

$ python3.9 -m venv chaos
$ source chaos/bin/activate
$ git clone https://github.com/krkn-chaos/krkn.git 
$ cd krkn
$ pip3 install -r requirements.txt
Edit configuration file:
$ vi config/recommender_config.yaml 
$ python3.9 utils/chaos_recommender/chaos_recommender.py -c utils/chaos_recommender/recommender_config.yaml

Follow the prompts to provide the required information.

Configuration

To run the recommender with a config file specify the config file path with the -c argument. You can customize the default values by editing the recommender_config.yaml file. The configuration file contains the following options:

application: Specify the application name.
namespaces: Specify the namespaces names (separated by coma or space). If you want to profile
labels: Specify the labels (not used).
kubeconfig: Specify the location of the kubeconfig file (not used).
prometheus_endpoint: Specify the prometheus endpoint (must).
auth_token: Auth token to connect to prometheus endpoint (must).
scrape_duration: For how long data should be fetched, e.g., ‘1m’ (must).
chaos_library: “kraken” (currently it only supports kraken).
json_output_file: True or False (by default False).
json_output_folder_path: Specify folder path where output should be saved. If empty the default path is used.
chaos_tests: (for output purpose only do not change if not needed)
- GENERAL: list of general purpose tests available in Krkn
- MEM: list of memory related tests available in Krkn
- NETWORK: list of network related tests available in Krkn
- CPU: list of memory related tests available in Krkn
threshold: Specify the threshold to use for comparison and identifying outliers
cpu_threshold: Specify the cpu threshold to compare with the cpu limits set on the pods and identify outliers
mem_threshold: Specify the memory threshold to compare with the memory limits set on the pods and identify outliers

TIP: to collect prometheus endpoint and token from your OpenShift cluster you can run the following commands: prometheus_url=$(kubectl get routes -n openshift-monitoring prometheus-k8s --no-headers | awk '{print $2}') #TO USE YOUR CURRENT SESSION TOKEN token=$(oc whoami -t) #TO CREATE A NEW TOKEN token=$(kubectl create token -n openshift-monitoring prometheus-k8s --duration=6h || oc sa new-token -n openshift-monitoring prometheus-k8s)

You can also provide the input values through command-line arguments launching the recommender with -o option:

  -o, --options         Evaluate command line options
  -a APPLICATION, --application APPLICATION
                        Kubernetes application name
  -n NAMESPACES, --namespaces NAMESPACE
                        Kubernetes application namespaces separated by space
  -l LABELS, --labels LABELS
                        Kubernetes application labels
  -p PROMETHEUS_ENDPOINT, --prometheus-endpoint PROMETHEUS_ENDPOINT
                        Prometheus endpoint URI
  -k KUBECONFIG, --kubeconfig KUBECONFIG
                        Kubeconfig path
  -t TOKEN, --token TOKEN
                        Kubernetes authentication token
  -s SCRAPE_DURATION, --scrape-duration SCRAPE_DURATION
                        Prometheus scrape duration
  -i LIBRARY, --library LIBRARY
                        Chaos library
  -L LOG_LEVEL, --log-level LOG_LEVEL
                        log level (DEBUG, INFO, WARNING, ERROR, CRITICAL
  -J [FOLDER_PATH], --json-output-file [FOLDER_PATH]
                        Create output file, the path to the folder can be specified, if not specified the default folder is used.
  -M MEM [MEM ...], --MEM MEM [MEM ...]
                        Memory related chaos tests (space separated list)
  -C CPU [CPU ...], --CPU CPU [CPU ...]
                        CPU related chaos tests (space separated list)
  -N NETWORK [NETWORK ...], --NETWORK NETWORK [NETWORK ...]
                        Network related chaos tests (space separated list)
  -G GENERIC [GENERIC ...], --GENERIC GENERIC [GENERIC ...]
                        Memory related chaos tests (space separated list)
  --threshold THRESHOLD
                        Threshold
  --cpu_threshold CPU_THRESHOLD
                        CPU threshold to compare with the cpu limits
  --mem_threshold MEM_THRESHOLD
                        Memory threshold to compare with the memory limits

If you provide the input values through command-line arguments, the corresponding config file inputs would be ignored.

Podman & Docker image

To run the recommender image please visit the krkn-hub for further infos.

How it works

After obtaining telemetry data, sourced either locally or from Prometheus, the tool conducts a comprehensive data analysis to detect anomalies. Employing the Z-score method and heatmaps, it identifies outliers by evaluating CPU, memory, and network usage against established limits. Services with Z-scores surpassing a specified threshold are categorized as outliers. This categorization classifies services as network, CPU, or memory-sensitive, consequently leading to the recommendation of relevant test cases.

Customizing Thresholds and Options

You can customize the thresholds and options used for data analysis and identifying the outliers by setting the threshold, cpu_threshold and mem_threshold parameters in the config.

Additional Files

recommender_config.yaml: The configuration file containing default values for application, namespace, labels, and kubeconfig.

Happy Chaos!

12 - Krkn Debugging Tips

Common helpful tips if you hit issues running krkn

Common Debugging Issues

SSL Certification

Error

...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.***.io', port=6443): Max retries exceeded with url: /apis/config.openshift.io/v1/clusterversions (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1147)')))

Fix

The user needs to have tls verification by logging in using

$ oc login [-u=<username>] \
  [-p=<password>] \
  [-s=<server>] \
  [-n=<project>] \
  --insecure-skip-tls-verify

Also verify insecure-skip-tls-verify: true is in the kubeconfig:

clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: https://***:6443
  name: test
contexts:
- context:
    cluster: test
    user: admin
  name: admin
current-context: admin
preferences: {}
users:
- name: admin
  user:
    client-certificate-data: ***

Podman vs Docker container runtime

Krknctl Error

...
failed to determine container runtime enviroment please install podman or docker and retry

Fix

ln -s $(podman machine inspect --format '{{ .ConnectionInfo.PodmanSocket.Path }}') ~/.local/share/containers/podman/machine/podman.sock

13 -

ManagedCluster Scenarios

ManagedCluster scenarios provide a way to integrate kraken with Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM).

ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.

The following ManagedCluster chaos scenarios are supported:

managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.

managedcluster_scenarios:
  - actions:                                                        # ManagedCluster chaos scenarios to be injected
    - managedcluster_stop_start_scenario
    managedcluster_name: cluster1                                   # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
    # label_selector:                                               # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
    instance_count: 1                                               # Number of managedcluster to perform action/select that match the label selector
    runs: 1                                                         # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
    timeout: 420                                                    # Duration to wait for completion of ManagedCluster scenario injection
                                                                    # For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
                                                                    # (default leaseDurationSeconds = 60 sec)
  - actions:
    - stop_start_klusterlet_scenario
    managedcluster_name: cluster1
    # label_selector:
    instance_count: 1
    runs: 1
    timeout: 60

14 -

Performance dashboards

Kraken supports installing a mutable grafana on the cluster with the dashboards loaded to help with monitoring the cluster for things like resource usage to find the outliers, API stats, Etcd health, Critical alerts etc. It can be deployed by enabling the following in the config:

performance_monitoring:
    deploy_dashboards: True

The route and credentials to access the dashboards will be printed on the stdout before Kraken starts creating chaos. The dashboards can be edited/modified to include your queries of interest.

NOTE: The dashboards leverage Prometheus for scraping the metrics off of the cluster and currently only supports OpenShift since Prometheus is setup on the cluster by default and leverages routes object to expose the grafana dashboards externally.

Krkn-Chaos

Why Chaos?

Why Krkn?

1 - What is Krkn?

Use Case and Target Personas

Workflow

How to Get Started

Running Krkn with minimal configuration tweaks

Config

Krkn scenario pass/fail criteria and report

Krkn Features

Signaling

Performance monitoring

SLOs validation during and post chaos

Health Checks

Telemetry

OCM / ACM integration

Where should I go next?

1.1 - Krkn Config Explanations

Config

Kraken

Distribution

Exit on failure

Publish kraken status

Chaos Scenarios

Cerberus

Performance Monitoring

Elastic

Tunings

Telemetry

Health Checks

Sample Config file

1.2 - Health Checks

Health Checks

Sample health check config

Sample health check telemetry

1.3 - Krkn RBAC

RBAC Configurations

RBAC YAML Files

Non-Privileged Role

Non-Privileged RoleBinding

Privileged ClusterRole

Privileged ClusterRoleBinding

How to Apply RBAC Configuration

OpenShift-specific Configuration

Krkn Scenarios and Required RBAC Permissions

1.4 - Krkn Roadmap

1.5 - Signaling to Krkn

States

Configuration

Setting Signal

Url Examples

1.6 - SLO Validation

SLOs validation

Checking for critical alerts post chaos

Validation and alerting based on the queries defined by the user during chaos

Alert profile

Metrics Profile

1.7 - Telemetry

Telemetry Details

Deploy your own telemetry AWS service

Sample telemetry config

Sample output of telemetry

2 - What is krkn-hub?

Getting Started

3 - What is krkn-lib?

krkn-lib

Krkn Chaos and resiliency testing tool Foundation Library

Contents

Packages

Documentation

4 - What is krknctl?

4.1 - Usage

Commands:

list <subcommand>:

available:

running:

describe <scenario name>:

run <scenario name> [flags]:

Tip

`list <subcommand>`:

`available`:

`running`:

`describe <scenario name>`:

`run <scenario name> [flags]`:

`graph <subcommand>`:

`scaffold <scenario names> [flags]`:

`run <json execution plan path> [flags]`:

`random <subcommand>`

`scaffold <scenario names> [flags]`

`run <json execution plan path> [flags]`

`attach <scenario ID>`:

`clean`:

`query-status <container Id or Name> [--graph <graph file path>]`: