This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Scenarios

Supported chaos scenarios

ScenarioDescription
Pod failuresInjects pod failures
Container failuresInjects container failures based on the provided kill signal
Node failuresInjects node failure through OpenShift/Kubernetes, cloud API’s
zone outagesCreates zone outage to observe the impact on the cluster, applications
time skewSkews the time and date
Node cpu hogHogs CPU on the targeted nodes
Node memory hogHogs memory on the targeted nodes
Node IO hogHogs io on the targeted nodes
Service DisruptionDeleting all objects within a namespace
Application outagesIsolates application Ingress/Egress traffic to observe the impact on dependent applications and recovery/initialization timing
Power OutagesShuts down the cluster for the specified duration and turns it back on to check the cluster health
PVC disk fillFills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it
Network ChaosIntroduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using tc and Netem
Pod Network ChaosIntroduces network chaos at pod level
Service HijackingHijacks a service http traffic to simulate custom HTTP responses

1 - Application Outage Scenarios

Application outages

Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.

1.1 - Application Outage Scenarios using Krkn

Sample scenario config
application_outage:                                  # Scenario to create an outage of an application by blocking traffic
  duration: 600                                      # Duration in seconds after which the routes will be accessible
  namespace: <namespace-with-application>            # Namespace to target - all application routes will go inaccessible if pod selector is empty
  pod_selector: {app: foo}                            # Pods to target
  block: [Ingress, Egress]                           # It can be Ingress or Egress or Ingress, Egress
Debugging steps in case of failures

Kraken creates a network policy blocking the ingress/egress traffic to create an outage, in case of failures before reverting back the network policy, you can delete it manually by executing the following commands to stop the outage:

$ oc delete networkpolicy/kraken-deny -n <targeted-namespace>

1.2 - Application outage Scenario using Krkn-hub

This scenario disrupts the traffic to the specified application to be able to understand the impact of the outage on the dependent service/user experience. Refer docs for more details.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
DURATIONDuration in seconds after which the routes will be accessible600
NAMESPACENamespace to target - all application routes will go inaccessible if pod selector is empty ( Required )No default
POD_SELECTORPods to target. For example “{app: foo}”No default
BLOCK_TRAFFIC_TYPEIt can be Ingress or Egress or Ingress, Egress ( needs to be a list )[Ingress, Egress]

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages

Demo

You can find a link to a demo of the scenario here

2 - Arcaflow Scenarios

Arcaflow is a workflow engine in development which provides the ability to execute workflow steps in sequence, in parallel, repeatedly, etc. The main difference to competitors such as Netflix Conductor is the ability to run ad-hoc workflows without an infrastructure setup required.

The engine uses containers to execute plugins and runs them either locally in Docker/Podman or remotely on a Kubernetes cluster. The workflow system is strongly typed and allows for generating JSON schema and OpenAPI documents for all data formats involved.

Available Scenarios

Hog scenarios:

Prequisites

Arcaflow supports three deployment technologies:

  • Docker
  • Podman
  • Kubernetes

Docker

In order to run Arcaflow Scenarios with the Docker deployer, be sure that:

  • Docker is correctly installed in your Operating System (to find instructions on how to install docker please refer to Docker Documentation)
  • The Docker daemon is running

Podman

The podman deployer is built around the podman CLI and doesn’t need necessarily to be run along with the podman daemon. To run Arcaflow Scenarios in your Operating system be sure that:

  • podman is correctly installed in your Operating System (to find instructions on how to install podman refer to Podman Documentation)
  • the podman CLI is in your shell PATH

Kubernetes

The kubernetes deployer integrates directly the Kubernetes API Client and needs only a valid kubeconfig file and a reachable Kubernetes/OpenShift Cluster.

2.1 - Arcaflow Scenarios using Krkn

Usage

To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named arcaflow_scenarios then add the desired scenario pointing to the input.yaml file.

kraken:
    ...
    chaos_scenarios:
        - arcaflow_scenarios:
            - scenarios/arcaflow/cpu-hog/input.yaml

input.yaml

The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target

config.yaml

The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:

  • Docker
  • Podman (podman daemon not needed, suggested option)
  • Kubernetes

The supported log levels are:

  • debug
  • info
  • warning
  • error

workflow.yaml

This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.

This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift

3 - Container Scenarios

Kraken uses the oc exec command to kill specific containers in a pod. This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file. These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.

3.1 - Container Scenarios using Krkn

Example Config

The following are the components of Kubernetes for which a basic chaos scenario config exists today.

scenarios:
- name: "<name of scenario>"
  namespace: "<specific namespace>" # can specify "*" if you want to find in all namespaces
  label_selector: "<label of pod(s)>"
  container_name: "<specific container name>"  # This is optional, can take out and will kill all containers in all pods found under namespace and label
  pod_names:  # This is optional, can take out and will select all pods with given namespace and label
  - <pod_name>
  count: <number of containers to disrupt, default=1>
  action: <kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>
  expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)

Post Action

In all scenarios we do a post chaos check to wait and verify the specific component.

Here there are two options:

  1. Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.

See scenarios/post_action_etcd_container.py for an example.

-   container_scenarios:                                 # List of chaos pod scenarios to load.
            - -    scenarios/container_etcd.yml
              -    scenarios/post_action_etcd_container.py
  1. Allow kraken to wait and check the killed containers until they become ready again. Kraken keeps a list of the specific containers that were killed as well as the namespaces and pods to verify all containers that were affected recover properly.
expected_recovery_time: <seconds to wait for container to recover>

3.2 - Container Scenarios using Krkn-hub

This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
NAMESPACETargeted namespace in the clusteropenshift-etcd
LABEL_SELECTORLabel of the container(s) to targetk8s-app=etcd
DISRUPTION_COUNTNumber of container to disrupt1
CONTAINER_NAMEName of the container to disruptetcd
ACTIONkill signal to run. For example 1 ( hang up ) or 91
EXPECTED_RECOVERY_TIMETime to wait before checking if all containers that were affected recover properly60

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

4 - CPU Hog Scenario

This scenario is based on the arcaflow arcaflow-plugin-stressng plugin. The purpose of this scenario is to create cpu pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.

4.1 - CPU Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/cpu-hog/input.yaml as described in the Usage section. This scenario takes a list of objects named input_list with the following properties:

  • kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
  • namespace : string the namespace where the scenario container will be deployed Note: this parameter will be automatically filled by kraken if the kubeconfig_path property is correctly set
  • node_selector : key-value map the node label that will be used as nodeSelector by the pod to target a specific cluster node
  • duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
  • cpu_count : int the number of CPU cores to be used (0 means all)
  • cpu_method : string a fine-grained control of which cpu stressors to use (ackermann, cfloat etc. see manpage for all the cpu_method options)
  • cpu_load_percentage : int the CPU load by percentage

To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item to the input_list with the same properties (and eventually different values eg. different node_selectors to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism in workload.yaml file

Usage

To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named arcaflow_scenarios then add the desired scenario pointing to the input.yaml file.

kraken:
    ...
    chaos_scenarios:
        - arcaflow_scenarios:
            - scenarios/arcaflow/cpu-hog/input.yaml

input.yaml

The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target

config.yaml

The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:

  • Docker
  • Podman (podman daemon not needed, suggested option)
  • Kubernetes

The supported log levels are:

  • debug
  • info
  • warning
  • error

workflow.yaml

This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.

This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift

4.2 - CPU Hog Scenario using Krkn-Hub

This scenario hogs the cpu on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
TOTAL_CHAOS_DURATIONSet chaos duration (in sec) as desired60
NODE_CPU_CORENumber of cores (workers) of node CPU to be consumed2
NODE_CPU_PERCENTAGEPercentage of total cpu to be consumed50
NAMESPACENamespace where the scenario container will be deployeddefault
NODE_SELECTORSNode selectors where the scenario containers will be scheduled in the format “<selector>=<value>”. NOTE: This value can be specified as a list of node selectors separated by “;”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node""

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog

Demo

You can find a link to a demo of the scenario here

5 - IO Hog Scenario

This scenario is based on the arcaflow arcaflow-plugin-stressng plugin. The purpose of this scenario is to create disk pressure on a particular node of the Kubernetes/OpenShift cluster for a time span. The scenario allows to attach a node path to the pod as a hostPath volume.

5.1 - IO Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/io-hog/input.yaml as described in the Usage section. This scenario takes a list of objects named input_list with the following properties:

  • kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
  • namespace : string the namespace where the scenario container will be deployed Note: this parameter will be automatically filled by kraken if the kubeconfig_path property is correctly set
  • node_selector : key-value map the node label that will be used as nodeSelector by the pod to target a specific cluster node
  • duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
  • target_pod_folder : string the path in the pod where the volume is mounted
  • target_pod_volume : object the hostPath volume definition in the Kubernetes/OpenShift format, that will be attached to the pod as a volume
  • io_write_bytes : string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
  • io_block_size : string size of each write in bytes. Size can be from 1 byte to 4m.

To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item to the input_list with the same properties (and eventually different values eg. different node_selectors to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism in workload.yaml file

Usage

To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named arcaflow_scenarios then add the desired scenario pointing to the input.yaml file.

kraken:
    ...
    chaos_scenarios:
        - arcaflow_scenarios:
            - scenarios/arcaflow/cpu-hog/input.yaml

input.yaml

The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target

config.yaml

The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:

  • Docker
  • Podman (podman daemon not needed, suggested option)
  • Kubernetes

The supported log levels are:

  • debug
  • info
  • warning
  • error

workflow.yaml

This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.

This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift

5.2 - IO Hog Scenario using Krkn-Hub

This scenario hogs the IO on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
TOTAL_CHAOS_DURATIONSet chaos duration (in sec) as desired180
IO_BLOCK_SIZEstring size of each write in bytes. Size can be from 1 byte to 4m1m
IO_WORKERSNumber of stressorts5
IO_WRITE_BYTESstring writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g10m
NAMESPACENamespace where the scenario container will be deployeddefault
NODE_SELECTORSNode selectors where the scenario containers will be scheduled in the format “<selector>=<value>”. NOTE: This value can be specified as a list of node selectors separated by “;”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node""

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/root/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/root/kraken/config/alerts -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog

6 - ManagedCluster Scenarios

ManagedCluster scenarios provide a way to integrate kraken with Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM).

ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.

The following ManagedCluster chaos scenarios are supported:

  1. managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
  2. managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
  3. managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
  4. start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
  5. stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
  6. stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.

ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under managedcluster_scenarios option in the Kraken config. Refer to managedcluster_scenarios_example config file.

managedcluster_scenarios:
  - actions:                                                        # ManagedCluster chaos scenarios to be injected
    - managedcluster_stop_start_scenario
    managedcluster_name: cluster1                                   # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
    # label_selector:                                               # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
    instance_count: 1                                               # Number of managedcluster to perform action/select that match the label selector
    runs: 1                                                         # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
    timeout: 420                                                    # Duration to wait for completion of ManagedCluster scenario injection
                                                                    # For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
                                                                    # (default leaseDurationSeconds = 60 sec)
  - actions:
    - stop_start_klusterlet_scenario
    managedcluster_name: cluster1
    # label_selector:
    instance_count: 1
    runs: 1
    timeout: 60

7 - Memory Hog Scenario

This scenario is based on the arcaflow arcaflow-plugin-stressng plugin. The purpose of this scenario is to create Virtual Memory pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.

7.1 - Memory Hog Scenarios using Krkn

To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/memory-hog/input.yaml as described in the Usage section. This scenario takes a list of objects named input_list with the following properties:

  • kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
  • namespace : string the namespace where the scenario container will be deployed Note: this parameter will be automatically filled by kraken if the kubeconfig_path property is correctly set
  • node_selector : key-value map the node label that will be used as nodeSelector by the pod to target a specific cluster node
  • duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
  • vm_bytes : string N bytes per vm process or percentage of memory used (using the % symbol). The size can be expressed in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
  • vm_workers : int Number of VM stressors to be run (0 means 1 stressor per CPU)

To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item to the input_list with the same properties (and eventually different values eg. different node_selectors to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism in workload.yaml file

Usage

To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure and add a new element to the list named arcaflow_scenarios then add the desired scenario pointing to the input.yaml file.

kraken:
    ...
    chaos_scenarios:
        - arcaflow_scenarios:
            - scenarios/arcaflow/cpu-hog/input.yaml

input.yaml

The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target

config.yaml

The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:

  • Docker
  • Podman (podman daemon not needed, suggested option)
  • Kubernetes

The supported log levels are:

  • debug
  • info
  • warning
  • error

workflow.yaml

This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.

This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift

7.2 - Memory Hog Scenario using Krkn-Hub

This scenario hogs the memory on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
TOTAL_CHAOS_DURATIONSet chaos duration (in sec) as desired60
MEMORY_CONSUMPTION_PERCENTAGEpercentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario90%
NUMBER_OF_WORKERSTotal number of workers (stress-ng threads)1
NAMESPACENamespace where the scenario container will be deployeddefault
NODE_SELECTORSNode selectors where the scenario containers will be scheduled in the format “<selector>=<value>”. NOTE: This value can be specified as a list of node selectors separated by “;”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node""

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog

Demo

You can find a link to a demo of the scenario here

8 - Network Chaos Scenario

Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node’s host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.

8.1 - Network Chaos Scenario using Krkn

Sample scenario config for egress traffic shaping
network_chaos:                                    # Scenario to create an outage by simulating random variations in the network.
  duration: 300                                   # In seconds - duration network chaos will be applied.
  node_name:                                      # Comma separated node names on which scenario has to be injected.
  label_selector: node-role.kubernetes.io/master  # When node_name is not specified, a node with matching label_selector is selected for running the scenario.
  instance_count: 1                               # Number of nodes in which to execute network chaos.
  interfaces:                                     # List of interface on which to apply the network restriction.
  - "ens5"                                        # Interface name would be the Kernel host network interface name.
  execution: serial|parallel                      # Execute each of the egress options as a single scenario(parallel) or as separate scenario(serial).
  egress:
    latency: 500ms
    loss: 50%                                    # percentage
    bandwidth: 10mbit
Sample scenario config for ingress traffic shaping (using a plugin)
- id: network_chaos
  config:
    node_interface_name:                            # Dictionary with key as node name(s) and value as a list of its interfaces to test
      ip-10-0-128-153.us-west-2.compute.internal:
        - ens5
        - genev_sys_6081
    label_selector: node-role.kubernetes.io/master  # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
    instance_count: 1                               # Number of nodes to perform action/select that match the label selector
    kubeconfig_path: ~/.kube/config                 # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
    execution_type: parallel                        # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
    network_params:
        latency: 500ms
        loss: '50%'
        bandwidth: 10mbit
    wait_duration: 120
    test_duration: 60
'''

Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.


##### Steps
 - Pick the nodes to introduce the network anomaly either from node_name or label_selector.
 - Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
 - Set traffic shaping config on node's interface using tc and netem.
 - Wait for the duration time.
 - Remove traffic shaping config on node's interface.
 - Remove the job that spawned the pod.

8.2 - Network Chaos Scenario using Krkn-Hub

This scenario introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using the tc and Netem. For more information refer the following documentation.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Egress Scenarios
ParameterDescriptionDefault
DURATIONDuration in seconds - during with network chaos will be applied.300
NODE_NAMENode name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma""
LABEL_SELECTORWhen NODE_NAME is not specified, a node with matching label_selector is selected for running.node-role.kubernetes.io/master
INSTANCE_COUNTTargeted instance count matching the label selector1
INTERFACESList of interface on which to apply the network restriction.[]
EXECUTIONExecute each of the egress option as a single scenario(parallel) or as separate scenario(serial).parallel
EGRESSDictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit){bandwidth: 100mbit}
Ingress Scenarios
ParameterDescriptionDefault
DURATIONDuration in seconds - during with network chaos will be applied.300
TARGET_NODE_AND_INTERFACE# Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]}""
LABEL_SELECTORWhen NODE_NAME is not specified, a node with matching label_selector is selected for running.node-role.kubernetes.io/master
INSTANCE_COUNTTargeted instance count matching the label selector1
EXECUTIONUsed to specify whether you want to apply filters on interfaces one at a time or all at once.parallel
NETWORK_PARAMSlatency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: ‘0.02’}""
WAIT_DURATIONEnsure that it is at least about twice of test_duration300

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

9 - Node Scenarios

This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster.

9.1 - Node Scenarios using Krkn

The following node chaos scenarios are supported:

  1. node_start_scenario: Scenario to stop the node instance.
  2. node_stop_scenario: Scenario to stop the node instance.
  3. node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware.
  4. node_termination_scenario: Scenario to terminate the node instance.
  5. node_reboot_scenario: Scenario to reboot the node instance.
  6. stop_kubelet_scenario: Scenario to stop the kubelet of the node instance.
  7. stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance.
  8. restart_kubelet_scenario: Scenario to restart the kubelet of the node instance.
  9. node_crash_scenario: Scenario to crash the node instance.
  10. stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status.

AWS

Cloud setup instructions can be found here. Sample scenario config can be found here.

Baremetal

Sample scenario config can be found here.

Docker

The Docker provider can be used to run node scenarios against kind clusters.

kind is a tool for running local Kubernetes clusters using Docker container “nodes”.

kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.

GCP

Cloud setup instructions can be found here. Sample scenario config can be found here.

Openstack

How to set up Openstack cli to run node scenarios is defined here.

The supported node level chaos scenarios on an OPENSTACK cloud are node_stop_start_scenario, stop_start_kubelet_scenario and node_reboot_scenario.

To execute the scenario, ensure the value for ssh_private_key in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.

Azure

Cloud setup instructions can be found here. Sample scenario config can be found here.

Alibaba

How to set up Alibaba cli to run node scenarios is defined here.

VMware

How to set up VMware vSphere to run node scenarios is defined here

This cloud type uses a different configuration style, see actions below and example config file

  • vmware-node-terminate
  • vmware-node-reboot
  • vmware-node-stop
  • vmware-node-start

IBMCloud

How to set up IBMCloud to run node scenarios is defined here

This cloud type uses a different configuration style, see actions below and example config file

  • ibmcloud-node-terminate
  • ibmcloud-node-reboot
  • ibmcloud-node-stop
  • ibmcloud-node-start

General

Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.

9.2 - Node Scenarios using Krkn-Hub

This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
ACTIONAction can be one of the followingnode_stop_start_scenario for aws and vmware-node-reboot for vmware, ibmcloud-node-reboot for ibmcloud
LABEL_SELECTORNode label to targetnode-role.kubernetes.io/worker
NODE_NAMENode name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma""
INSTANCE_COUNTTargeted instance count matching the label selector1
RUNSIterations to perform action on a single node1
CLOUD_TYPECloud platform on top of which cluster is running, supported platforms - aws, vmware, ibmcloud, bmaws
TIMEOUTDuration to wait for completion of node scenario injection180
DURATIONDuration to stop the node before running the start action - not supported for vmware and ibm cloud type120
VERIFY_SESSIONOnly needed for vmware - Set to True if you want to verify the vSphere client session using certificatesFalse
SKIP_OPENSHIFT_CHECKSOnly needed for vmware - Set to True if you don’t want to wait for the status of the nodes to change on OpenShift before passing the scenarioFalse
BMC_USEROnly needed for Baremetal ( bm ) - IPMI/bmc username""
BMC_PASSWORDOnly needed for Baremetal ( bm ) - IPMI/bmc password""
BMC_ADDROnly needed for Baremetal ( bm ) - IPMI/bmc username""

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios 

The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

VMware Vsphere

$ export VSPHERE_IP=<vSphere_client_IP_address>

$ export VSPHERE_USERNAME=<vSphere_client_username>

$ export VSPHERE_PASSWORD=<vSphere_client_password>

Ibmcloud

$ export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1

$ export IBMC_APIKEY=<ibmcloud_api_key>

Baremetal

$ export BMC_USER=<bmc/IPMI user>
$ export BMC_PASSWORD=<bmc/IPMI password>
$ export BMC_ADDR=<bmc address>

Google Cloud Platform

TBD

Azure

$ export AZURE_TENANT_ID=<>
$ export AZURE_CLIENT_SECRET=<>
$ export AZURE_CLIENT_ID=<>

OpenStack

TBD

Demo

You can find a link to a demo of the scenario here

10 - Pod Network Scenarios

Pod outage

Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc. With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.

10.1 - Pod Scenarios using Krkn

Sample scenario config (using a plugin)
- id: pod_network_outage
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied
    direction:                     # Optioinal - List of directions to apply filters
        - ingress                  # Blocks ingress traffic, Default both egress and ingress
    ingress_ports:                 # Optional - List of ports to block traffic on
        - 8443                     # Blocks 8443, Default [], i.e. all ports.
    label_selector: 'component=ui' # Blocks access to openshift console

Pod Network shaping

Scenario to introduce network latency, packet loss, and bandwidth restriction in the Pod’s network interface. The purpose of this scenario is to observe faults caused by random variations in the network.

Sample scenario config for egress traffic shaping (using plugin)
- id: pod_egress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    network_params:
        latency: 500ms             # Add 500ms latency to egress traffic from the pod.
Sample scenario config for ingress traffic shaping (using plugin)
- id: pod_ingress_shaping
  config:
    namespace: openshift-console   # Required - Namespace of the pod to which filter need to be applied.
    label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
    network_params:
        latency: 500ms             # Add 500ms latency to egress traffic from the pod.
Steps
  • Pick the pods to introduce the network anomaly either from label_selector or pod_name.
  • Identify the pod interface name on the node.
  • Set traffic shaping config on pod’s interface using tc and netem.
  • Wait for the duration time.
  • Remove traffic shaping config on pod’s interface.
  • Remove the job that spawned the pod.

10.2 - Pod Network Chaos Scenarios using Krkn-hub

This scenario runs network chaos at the pod level on a Kubernetes/OpenShift cluster.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value> See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
NAMESPACERequired - Namespace of the pod to which filter need to be applied""
LABEL_SELECTORLabel of the pod(s) to target""
POD_NAMEWhen label_selector is not specified, pod matching the name will be selected for the chaos scenario""
INSTANCE_COUNTNumber of pods to perform action/select that match the label selector1
TRAFFIC_TYPEList of directions to apply filters - egress/ingress ( needs to be a list )[ingress, egress]
INGRESS_PORTSIngress ports to block ( needs to be a list )[] i.e all ports
EGRESS_PORTSEgress ports to block ( needs to be a list )[] i.e all ports
WAIT_DURATIONEnsure that it is at least about twice of test_duration300
TEST_DURATIONDuration of the test run120

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos

11 - Pod Scenarios

Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

11.1 - Pod Scenarios using Krkn

Example Config

The following are the components of Kubernetes for which a basic chaos scenario config exists today.

kraken:
  chaos_scenarios:
    - plugin_scenarios:
      - path/to/scenario.yaml

You can then create the scenario file with the following contents:

# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
  config:
    namespace_pattern: ^kube-system$
    label_selector: k8s-app=kube-scheduler
    krkn_pod_recovery_time: 120
    

Please adjust the schema reference to point to the schema file. This file will give you code completion and documentation for the available options in your IDE.

Pod Chaos Scenarios

The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.

ComponentDescriptionWorking
Basic pod scenarioKill a pod.:heavy_check_mark:
EtcdKills a single/multiple etcd replicas.:heavy_check_mark:
Kube ApiServerKills a single/multiple kube-apiserver replicas.:heavy_check_mark:
ApiServerKills a single/multiple apiserver replicas.:heavy_check_mark:
PrometheusKills a single/multiple prometheus replicas.:heavy_check_mark:
OpenShift System PodsKills random pods running in the OpenShift system namespaces.:heavy_check_mark:

11.2 - Pod Scenarios using Krkn-hub

This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
NAMESPACETargeted namespace in the cluster ( supports regex )openshift-.*
POD_LABELLabel of the pod(s) to target""
NAME_PATTERNRegex pattern to match the pods in NAMESPACE when POD_LABEL is not specified.*
DISRUPTION_COUNTNumber of pods to disrupt1
KILL_TIMEOUTTimeout to wait for the target pod(s) to be removed in seconds180
EXPECTED_RECOVERY_TIMEFails if the pod disrupted do not recover within the timeout set120

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

12 - Power Outage Scenarios

This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy.

12.1 - Power Outage Scenario using Krkn

Power Outage/ Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to cluster_shut_down_scenario config file.

Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.

Current accepted cloud types:

cluster_shut_down_scenario:                          # Scenario to stop all the nodes for specified duration and restart the nodes.
  runs: 1                                            # Number of times to execute the cluster_shut_down scenario.
  shut_down_duration: 120                            # Duration in seconds to shut down the cluster.
  cloud_type: aws                                    # Cloud type on which Kubernetes/OpenShift runs.

12.2 - Power Outage Scenario using Krkn-Hub

This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy. More information can be found here

Right now power outage and cluster shutdown are one in the same. We originally created this scenario to stop all the nodes and then start them back up how a customer would shut their cluster down.

In a real life chaos scenario though, we figured this scenario was close to if the power went out on the aws side so all of our ec2 nodes would be stopped/powered off. We tried to look at if aws cli had a way to forcefully poweroff the nodes (not gracefully) and they don’t currently support so this scenario is as close as we can get to “pulling the plug”

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
SHUTDOWN_DURATIONDuration in seconds to shut down the cluster1200
CLOUD_TYPECloud platform on top of which cluster is running, supported cloud platformsaws
TIMEOUTTime in seconds to wait for each node to be stopped or running after the cluster comes back600

The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

Google Cloud Platform

TBD

Azure

TBD

OpenStack

TBD

Baremetal

TBD

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

13 - PVC Scenario

Scenario to fill up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults caused by the application using this volume.

13.1 - PVC Scenario using Krkn

Sample scenario config
pvc_scenario:
  pvc_name: <pvc_name>          # Name of the target PVC.
  pod_name: <pod_name>          # Name of the pod where the PVC is mounted. It will be ignored if the pvc_name is defined.
  namespace: <namespace_name>   # Namespace where the PVC is.
  fill_percentage: 50           # Target percentage to fill up the cluster. Value must be higher than current percentage. Valid values are between 0 and 99.
  duration: 60                  # Duration in seconds for the fault.
Steps
  • Get the pod name where the PVC is mounted.
  • Get the volume name mounted in the container pod.
  • Get the container name where the PVC is mounted.
  • Get the mount path where the PVC is mounted in the pod.
  • Get the PVC capacity and current used capacity.
  • Calculate file size to fill the PVC to the target fill_percentage.
  • Connect to the pod.
  • Create a temp file kraken.tmp with random data on the mount path:
    • dd bs=1024 count=$file_size </dev/urandom > /mount_path/kraken.tmp
  • Wait for the duration time.
  • Remove the temp file created:
    • rm kraken.tmp

13.2 - PVC Scenario using Krkn-Hub

This scenario fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults cause by the application using this volume. For more information refer the following documentation.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

If both PVC_NAME and POD_NAME are defined, POD_NAME value will be overridden from the Mounted By: value on PVC definition.

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
PVC_NAMETargeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required)
POD_NAMETargeted pod in the cluster (if null, PVC_NAME is required)
NAMESPACETargeted namespace in the cluster (required)
FILL_PERCENTAGETargeted percentage to be filled up in the PVC50
DURATIONDuration in seconds with the PVC filled up60

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios

14 - Service Disruption Scenarios

Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.

14.1 - Service Disruption Scenarios using Krkn

Configuration Options:

namespace: Specific namespace or regex style namespace of what you want to delete. Gets all namespaces if not specified; set to "" if you want to use the label_selector field.

Set to ‘^.*$’ and label_selector to "" to randomly select any namespace in your cluster.

label_selector: Label on the namespace you want to delete. Set to "" if you are using the namespace variable.

delete_count: Number of namespaces to kill in each run. Based on matching namespace and label specified, default is 1.

runs: Number of runs/iterations to kill namespaces, default is 1.

sleep: Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set

Refer to namespace_scenarios_example config file.

scenarios:
- namespace: "^.*$"
  runs: 1
- namespace: "^.*ingress.*$"
  runs: 1
  sleep: 15

Steps

This scenario will select a namespace (or multiple) dependent on the configuration and will kill all of the below object types in that namespace and will wait for them to be Running in the post action

  1. Services
  2. Daemonsets
  3. Statefulsets
  4. Replicasets
  5. Deployments

Post Action

We do a post chaos check to wait and verify the specific objects in each namespace are Ready

Here there are two options:

  1. Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.

See scenarios/post_action_namespace.py for an example

-   namespace_scenarios:
     - -    scenarios/regex_namespace.yaml
       -    scenarios/post_action_namespace.py
  1. Allow kraken to wait and check all killed objects in the namespaces become ‘Running’ again. Kraken keeps a list of the specific objects in namespaces that were killed to verify all that were affected recover properly.
wait_time: <seconds to wait for namespace to recover>

14.2 - Service Disruption Scenario using Krkn-Hub

This scenario deletes main objects within a namespace in your Kubernetes/OpenShift cluster. More information can be found here.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
LABEL_SELECTORLabel of the namespace to target. Set this parameter only if NAMESPACE is not set""
NAMESPACEName of the namespace you want to target. Set this parameter only if LABEL_SELECTOR is not set“openshift-etcd”
SLEEPNumber of seconds to wait before polling to see if namespace exists again15
DELETE_COUNTNumber of namespaces to kill in each run, based on matching namespace and label specified1
RUNSNumber of runs to execute the action1

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios

Demo

You can find a link to a demo of the scenario here

15 - Service Hijacking Scenario

Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.

15.1 - Service Hijacking Scenarios using Krkn

The web service’s source code is available here. It employs a time-based test plan from the scenario configuration file, which specifies the behavior of resources during the chaos scenario as follows:

service_target_port: http-web-svc # The port of the service to be hijacked (can be named or numeric, based on the workload and service configuration).
service_name: nginx-service # The name of the service that will be hijacked.
service_namespace: default # The namespace where the target service is located.
image: quay.io/krkn-chaos/krkn-service-hijacking:v0.1.3 # Image of the krkn web service to be deployed to receive traffic.
chaos_duration: 30 # Total duration of the chaos scenario in seconds.
plan:
  - resource: "/list/index.php" # Specifies the resource or path to respond to in the scenario. For paths, both the path and query parameters are captured but ignored. For resources, only query parameters are captured.

    steps:                      # A time-based plan consisting of steps can be defined for each resource.
      GET:                      # One or more HTTP methods can be specified for each step. Note: Non-standard methods are supported for fully custom web services (e.g., using NONEXISTENT instead of POST).

        - duration: 15          # Duration in seconds for this step before moving to the next one, if defined. Otherwise, this step will continue until the chaos scenario ends.

          status: 500           # HTTP status code to be returned in this step.
          mime_type: "application/json" # MIME type of the response for this step.
          payload: |            # The response payload for this step.
            {
              "status":"internal server error"
            }
        - duration: 15
          status: 201
          mime_type: "application/json"
          payload: |
            {
              "status":"resource created"
            }            
      POST:
        - duration: 15
          status: 401
          mime_type: "application/json"
          payload: |
            {
               "status": "unauthorized"
            }            
        - duration: 15
          status: 404
          mime_type: "text/plain"
          payload: "not found"

The scenario will focus on the service_name within the service_namespace, substituting the selector with a randomly generated one, which is added as a label in the mock service manifest. This allows multiple scenarios to be executed in the same namespace, each targeting different services without causing conflicts.

The newly deployed mock web service will expose a service_target_port, which can be either a named or numeric port based on the service configuration. This ensures that the Service correctly routes HTTP traffic to the mock web service during the chaos run.

Each step will last for duration seconds from the deployment of the mock web service in the cluster. For each HTTP resource, defined as a top-level YAML property of the plan (it could be a specific resource, e.g., /list/index.php, or a path-based resource typical in MVC frameworks), one or more HTTP request methods can be specified. Both standard and custom request methods are supported.

During this time frame, the web service will respond with:

  • status: The HTTP status code (can be standard or custom).
  • mime_type: The MIME type (can be standard or custom).
  • payload: The response body to be returned to the client.

At the end of the step duration, the web service will proceed to the next step (if available) until the global chaos_duration concludes. At this point, the original service will be restored, and the custom web service and its resources will be undeployed.

NOTE: Some clients (e.g., cURL, jQuery) may optimize queries using lightweight methods (like HEAD or OPTIONS) to probe API behavior. If these methods are not defined in the test plan, the web service may respond with a 405 or 404 status code. If you encounter unexpected behavior, consider this use case.

15.2 - Service Hijacking Scenario using Krkn-Hub

This scenario reroutes traffic intended for a target service to a custom web service that is automatically deployed by Krkn. This web service responds with user-defined HTTP statuses, MIME types, and bodies. For more details, please refer to the following documentation.

Run

Unlike other krkn-hub scenarios, this one requires a specific configuration due to its unique structure. You must set up the scenario in a local file following the scenario syntax, and then pass this file’s base64-encoded content to the container via the SCENARIO_BASE64 variable.

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run  --name=<container_name> \
              -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
              -v <path_to_kubeconfig>:/home/krkn/.kube/config:Z quay.io/krkn-chaos/krkn-hub:service-hijacking
              
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ export SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"
$ docker run $(./get_docker_params.sh) --name=<container_name> \
                                       --net=host \
                                       -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
                                       -d quay.io/krkn-chaos/krkn-hub:service-hijacking
OR 
$ docker run --name=<container_name> -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
                                     --net=host \
                                     -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
                                     -d quay.io/krkn-chaos/krkn-hub:service-hijacking

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected: example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescription
SCENARIO_BASE64Base64 encoded service-hijacking scenario file. Note that the -w0 option in the command substitution SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" is mandatory in order to remove line breaks from the base64 command output

For example:

$ podman run -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
             --name=<container_name> \
             --net=host \
             --env-host=true \
             -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml \
             -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts \
             -v <path-to-kube-config>:/home/krkn/.kube/config:Z \
             -d quay.io/krkn-chaos/krkn-hub:service-hijacking

16 - Time Scenarios

Using this type of scenario configuration, one is able to change the time and/or date of the system for pods or nodes.

16.1 - Time Scenarios using Krkn

Configuration Options:

action: skew_time or skew_date.

object_type: pod or node.

namespace: namespace of the pods you want to skew. Needs to be set if setting a specific pod name.

label_selector: Label on the nodes or pods you want to skew.

container_name: Container name in pod you want to reset time on. If left blank it will randomly select one.

object_name: List of the names of pods or nodes you want to skew.

Refer to time_scenarios_example config file.

time_scenarios:
  - action: skew_time
    object_type: pod
    object_name:
      - apiserver-868595fcbb-6qnsc
      - apiserver-868595fcbb-mb9j5
    namespace: openshift-apiserver
    container_name: openshift-apiserver
  - action: skew_date
    object_type: node
    label_selector: node-role.kubernetes.io/worker

16.2 - Time Skew Scenarios using Krkn-Hub

This scenario skews the date and time of the nodes and pods matching the label on a Kubernetes/OpenShift cluster. More information can be found here.

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
OBJECT_TYPEObject to target. Supported options: pod, nodepod
LABEL_SELECTORLabel of the container(s) or nodes to targetk8s-app=etcd
ACTIONAction to run. Supported actions: skew_time, skew_dateskew_date
OBJECT_NAMEList of the names of pods or nodes you want to skew ( optional parameter )[]
CONTAINER_NAMEContainer in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty""
NAMESPACENamespace of the pods you want to skew, need to be set only if setting a specific pod name""

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

17 - Zone Outage Scenarios

Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.

17.1 - Zone Outage Scenarios using Krkn

Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.

Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.

Current accepted cloud types:
Sample scenario config
zone_outage:                                         # Scenario to create an outage of a zone by tweaking network ACL.
  cloud_type: aws                                    # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.
  duration: 600                                      # Duration in seconds after which the zone will be back online.
  vpc_id:                                            # Cluster virtual private network to target.
  subnet_id: [subnet1, subnet2]                      # List of subnet-id's to deny both ingress and egress traffic.
Debugging steps in case of failures

In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:

  • OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
  • From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
  • Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
  • Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.

17.2 - Zone Outage Scenarios using Krkn-Hub

This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

ParameterDescriptionDefault
CLOUD_TYPECloud platform on top of which cluster is running, supported cloud platformsaws
DURATIONDuration in seconds after which the zone will be back online600
VPC_IDcluster virtual private network to target ( REQUIRED )""
SUBNET_IDsubnet-id to deny both ingress and egress traffic ( REQUIRED ). Format: [subenet1, subnet2]""

The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

Google Cloud Platform

TBD

Azure

TBD

OpenStack

TBD

Baremetal

TBD

For example:

$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

18 - All Scenarios Variables

These variables are to be used for the top level configuration template that are shared by all the scenarios

See the description and default values below

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

example: export <parameter_name>=<value>

ParameterDescriptionDefault
CERBERUS_ENABLEDSet this to true if cerberus is running and monitoring the clusterFalse
CERBERUS_URLURL to poll for the go/no-go signalhttp://0.0.0.0:8080
WAIT_DURATIONDuration in seconds to wait between each chaos scenario60
ITERATIONSNumber of times to execute the scenarios1
DAEMON_MODEIterations are set to infinity which means that the kraken will cause chaos foreverFalse
PUBLISH_KRAKEN_STATUSIf you wantTrue
SIGNAL_ADDRESSAddress to print kraken status to0.0.0.0
PORTPort to print kraken status to8081
SIGNAL_STATEWaits for the RUN signal when set to PAUSE before running the scenarios, refer docs for more detailsRUN
DEPLOY_DASHBOARDSDeploys mutable grafana loaded with dashboards visualizing performance metrics pulled from in-cluster prometheus. The dashboard will be exposed as a route.False
CAPTURE_METRICSCaptures metrics as specified in the profile from in-cluster prometheus. Default metrics captures are listed hereFalse
ENABLE_ALERTSEvaluates expressions from in-cluster prometheus and exits 0 or 1 based on the severity set. Default profile. More details can be found hereFalse
ALERTS_PATHPath to the alerts file to use when ENABLE_ALERTS is setconfig/alerts
CHECK_CRITICAL_ALERTSWhen enabled will check prometheus for critical alerts firing post chaosFalse
TELEMETRY_ENABLEDEnable/disables the telemetry collection featureFalse
TELEMETRY_API_URLtelemetry service endpointhttps://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production
TELEMETRY_USERNAMEtelemetry service usernameredhat-chaos
TELEMETRY_PASSWORDNo default
TELEMETRY_PROMETHEUS_BACKUPenables/disables prometheus data collectionTrue
TELEMTRY_FULL_PROMETHEUS_BACKUPif is set to False only the /prometheus/wal folder will be downloadedFalse
TELEMETRY_BACKUP_THREADSnumber of telemetry download/upload threads5
TELEMETRY_ARCHIVE_PATHlocal path where the archive files will be temporarly stored/tmp
TELEMETRY_MAX_RETRIESmaximum number of upload retries (if 0 will retry forever)0
TELEMETRY_RUN_TAGif set, this will be appended to the run folder in the bucket (useful to group the runschaos
TELEMETRY_GROUPif set will archive the telemetry in the S3 bucket on a folder named after the valuedefault
TELEMETRY_ARCHIVE_SIZEthe size of the prometheus data archive size in KB. The lower the size of archive is1000
TELEMETRY_LOGS_BACKUPLogs backup to s3False
TELEMETRY_FILTER_PATTERFilter logs based on certain time stamp patterns["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+",“kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+”,"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"]
TELEMETRY_CLI_PATHOC Cli path, if not specified will be search in $PATHblank
ELASTIC_SERVERBe able to track telemtry data in elasticsearch, this is the url of the elasticsearch data storageblank
ELASTIC_INDEXElastic search index pattern to post results toblank

19 - Supported Cloud Providers

AWS

NOTE: For clusters with AWS make sure AWS CLI is installed and properly configured using an AWS account

GCP

NOTE: For clusters with GCP make sure GCP CLI is installed.

A google service account is required to give proper authentication to GCP for node actions. See here for how to create a service account.

NOTE: A user with ‘resourcemanager.projects.setIamPolicy’ permission is required to grant project-level permissions to the service account.

After creating the service account you will need to enable the account using the following: export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"

Openstack

NOTE: For clusters with Openstack Cloud, ensure to create and source the OPENSTACK RC file to set the OPENSTACK environment variables from the server where Kraken runs.

Azure

NOTE: You will need to create a service principal and give it the correct access, see here for creating the service principal and setting the proper permissions.

To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.

Before running you will need to set the following:

  1. export AZURE_SUBSCRIPTION_ID=<subscription_id>

  2. export AZURE_TENANT_ID=<tenant_id>

  3. export AZURE_CLIENT_SECRET=<client secret>

  4. export AZURE_CLIENT_ID=<client id>

Alibaba

See the Installation guide to install alicloud cli.

  1. export ALIBABA_ID=<access_key_id>

  2. export ALIBABA_SECRET=<access key secret>

  3. export ALIBABA_REGION_ID=<region id>

Refer to region and zone page to get the region id for the region you are running on.

Set cloud_type to either alibaba or alicloud in your node scenario yaml file.

VMware

Set the following environment variables

  1. export VSPHERE_IP=<vSphere_client_IP_address>

  2. export VSPHERE_USERNAME=<vSphere_client_username>

  3. export VSPHERE_PASSWORD=<vSphere_client_password>

These are the credentials that you would normally use to access the vSphere client.

IBMCloud

If no api key is set up with proper VPC resource permissions, use the following to create:

  • Access group
  • Service id with the following access
    • With policy VPC Infrastructure Services
    • Resources = All
    • Roles:
      • Editor
      • Administrator
      • Operator
      • Viewer
  • API Key

Set the following environment variables

  1. export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1

  2. export IBMC_APIKEY=<ibmcloud_api_key>