Scenarios
Krkn scenario list
Supported chaos scenarios
Scenario | Plugin Type | Description |
---|
Pod failures | pod_disruption_scenarios | Injects pod failures |
Container failures | container_scenarios | Injects container failures based on the provided kill signal |
Node failures | node_scenarios | Injects node failure through OpenShift/Kubernetes, cloud API’s |
Zone outages | zone_outages_scenarios | Creates zone outage to observe the impact on the cluster, applications |
Time skew | time_scenarios | Skews the time and date |
Node cpu hog | hog_scenarios | Hogs CPU on the targeted nodes |
Node memory hog | hog_scenarios | Hogs memory on the targeted nodes |
Node IO hog | hog_scenarios | Hogs io on the targeted nodes |
Service Disruption | service_disruption_scenarios | Deleting all objects within a namespace |
Application outages | application_outages_scenarios | Isolates application Ingress/Egress traffic to observe the impact on dependent applications and recovery/initialization timing |
Power Outages | cluster_shut_down_scenarios | Shuts down the cluster for the specified duration and turns it back on to check the cluster health |
PVC disk fill | pvc_scenarios | Fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it |
Network Chaos | network_chaos_scenarios | Introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using tc and Netem |
Network Chaos NG | network_chaos_ng_scenarios | Introduces Node network filtering scenario and a new infrastructure to refactor and port the Network Chaos scenarios |
Pod Network Chaos | pod_network_scenarios | Introduces network chaos at pod level |
Service Hijacking | service_hijacking_scenarios | Hijacks a service http traffic to simulate custom HTTP responses |
Syn Flood | syn_flood_scenarios | Generates a substantial amount of TCP traffic directed at one or more Kubernetes services |
How to Use Plugin Names
Use the plugin type in the second column above when creating your chaos_scenarios section in the config/config.yaml file
kraken:
kubeconfig_path: ~/.kube/config # Path to kubeconfig
..
chaos_scenarios:
- <plugin_type>:
- scenarios/<scenario_name>.yaml
1 - All Scenarios Variables
These variables are to be used for the top level configuration template that are shared by all the scenarios in Krkn-hub
See the description and default values below
Supported parameters for all scenarios in Krkn-Hub
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
Parameter | Description | Default |
---|
CERBERUS_ENABLED | Set this to true if cerberus is running and monitoring the cluster | False |
CERBERUS_URL | URL to poll for the go/no-go signal | http://0.0.0.0:8080 |
WAIT_DURATION | Duration in seconds to wait between each chaos scenario | 60 |
ITERATIONS | Number of times to execute the scenarios | 1 |
DAEMON_MODE | Iterations are set to infinity which means that the kraken will cause chaos forever | False |
PUBLISH_KRAKEN_STATUS | If you want | True |
SIGNAL_ADDRESS | Address to print kraken status to | 0.0.0.0 |
PORT | Port to print kraken status to | 8081 |
SIGNAL_STATE | Waits for the RUN signal when set to PAUSE before running the scenarios, refer docs for more details | RUN |
DEPLOY_DASHBOARDS | Deploys mutable grafana loaded with dashboards visualizing performance metrics pulled from in-cluster prometheus. The dashboard will be exposed as a route. | False |
CAPTURE_METRICS | Captures metrics as specified in the profile from in-cluster prometheus. Default metrics captures are listed here | False |
ENABLE_ALERTS | Evaluates expressions from in-cluster prometheus and exits 0 or 1 based on the severity set. Default profile. | False |
ALERTS_PATH | Path to the alerts file to use when ENABLE_ALERTS is set | config/alerts |
CHECK_CRITICAL_ALERTS | When enabled will check prometheus for critical alerts firing post chaos | False |
TELEMETRY_ENABLED | Enable/disables the telemetry collection feature | False |
TELEMETRY_API_URL | telemetry service endpoint | https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production |
TELEMETRY_USERNAME | telemetry service username | redhat-chaos |
TELEMETRY_PASSWORD | | No default |
TELEMETRY_PROMETHEUS_BACKUP | enables/disables prometheus data collection | True |
TELEMTRY_FULL_PROMETHEUS_BACKUP | if is set to False only the /prometheus/wal folder will be downloaded | False |
TELEMETRY_BACKUP_THREADS | number of telemetry download/upload threads | 5 |
TELEMETRY_ARCHIVE_PATH | local path where the archive files will be temporarly stored | /tmp |
TELEMETRY_MAX_RETRIES | maximum number of upload retries (if 0 will retry forever) | 0 |
TELEMETRY_RUN_TAG | if set, this will be appended to the run folder in the bucket (useful to group the runs | chaos |
TELEMETRY_GROUP | if set will archive the telemetry in the S3 bucket on a folder named after the value | default |
TELEMETRY_ARCHIVE_SIZE | the size of the prometheus data archive size in KB. The lower the size of archive is | 1000 |
TELEMETRY_LOGS_BACKUP | Logs backup to s3 | False |
TELEMETRY_FILTER_PATTER | Filter logs based on certain time stamp patterns | ["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+",“kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+”,"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"] |
TELEMETRY_CLI_PATH | OC Cli path, if not specified will be search in $PATH | blank |
ELASTIC_SERVER | Be able to track telemtry data in elasticsearch, this is the url of the elasticsearch data storage | blank |
ELASTIC_INDEX | Elastic search index pattern to post results to | blank |
Note
For setting the TELEMETRY_ARCHIVE_SIZE,the higher the number of archive files will be produced and uploaded (and processed by backup_thread simultaneously).For unstable/slow connection is better to keep this value low increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the failed chunk without affecting the whole upload.2 - Supported Cloud Providers
AWS
NOTE: For clusters with AWS make sure AWS CLI is installed and properly configured using an AWS account
GCP
NOTE: For clusters with GCP make sure GCP CLI is installed.
A google service account is required to give proper authentication to GCP for node actions. See here for how to create a service account.
NOTE: A user with ‘resourcemanager.projects.setIamPolicy’ permission is required to grant project-level permissions to the service account.
After creating the service account you will need to enable the account using the following: export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"
In krkn-hub, you’ll need to both set the environemnt variable and also copy the file to the local container
-e GOOGLE_APPLICATION_CREDENTIALS=<container_creds_file>
Nees to match above file path
-v <local_gcp_creds_file>:<container_creds_file>:Z
Example:
podman run -e GOOGLE_APPLICATION_CREDENTIALS=/home/krkn/GCP_app.json -e DURATION=10 --net=host -v <kubeconfig>:/home/krkn/.kube/config:Z -v <local_gcp_creds_file>:/home/krkn/GCP_app.json:Z -d quay.io/krkn-chaos/krkn-hub:...
Openstack
NOTE: For clusters with Openstack Cloud, ensure to create and source the OPENSTACK RC file to set the OPENSTACK environment variables from the server where Kraken runs.
Azure
NOTE: You will need to create a service principal and give it the correct access, see here for creating the service principal and setting the proper permissions.
To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.
Before running you will need to set the following:
export AZURE_SUBSCRIPTION_ID=<subscription_id>
export AZURE_TENANT_ID=<tenant_id>
export AZURE_CLIENT_SECRET=<client secret>
export AZURE_CLIENT_ID=<client id>
Alibaba
See the Installation guide to install alicloud cli.
export ALIBABA_ID=<access_key_id>
export ALIBABA_SECRET=<access key secret>
export ALIBABA_REGION_ID=<region id>
Refer to region and zone page to get the region id for the region you are running on.
Set cloud_type to either alibaba or alicloud in your node scenario yaml file.
VMware
Set the following environment variables
export VSPHERE_IP=<vSphere_client_IP_address>
export VSPHERE_USERNAME=<vSphere_client_username>
export VSPHERE_PASSWORD=<vSphere_client_password>
These are the credentials that you would normally use to access the vSphere client.
IBMCloud
If no api key is set up with proper VPC resource permissions, use the following to create:
- Access group
- Service id with the following access
- With policy VPC Infrastructure Services
- Resources = All
- Roles:
- Editor
- Administrator
- Operator
- Viewer
- API Key
Set the following environment variables
export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
export IBMC_APIKEY=<ibmcloud_api_key>
3 - Network Chaos NG Scenario
This scenario introduce a new infrastructure to refactor and port the current implementation of the network chaos plugins, it also introduces a new scenario
for node ip traffic filtering
3.1 - Network Chaos API
AbstractNetworkChaosModule
abstract module class
All the plugins must implement the AbstractNetworkChaosModule
abstract class in order to be instantiated and ran by the Netwok Chaos NG plugin.
This abstract class implements two main abstract methods:
run(self, target: str, kubecli: KrknTelemetryOpenshift, error_queue: queue.Queue = None)
is the entrypoint for each Network Chaos module.
If the module is configured to be run in parallel error_queue
must not be Nonetarget
: param is the name of the resource (Pod, Node etc.) that will be targeted by the scenariokubecli
: the KrknTelemetryOpenshift
needed by the scenario to access to the krkn-lib methodserror_queue
: a queue that will be used by the plugin to push the errors raised during the execution of parallel modules
get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig)
returns the common subset of settings shared by all the scenarios BaseNetworkChaosConfig
and the type of Network Chaos Scenario that is running (Pod Scenario or Node Scenario)
BaseNetworkChaosConfig
base module configuration
Is the base class that contains the common parameters shared by all the Network Chaos NG modules.
id
is the string name of the Network Chaos NG modulewait_duration
if there is more than one network module config in the same config file, the plugin will wait wait_duration
seconds before running the following onetest_duration
the duration in seconds of the scenariolabel_selector
the selector used to target the resourceinstance_count
if greater than 0 picks instance_count
elements from the targets selected by the filters randomlyexecution
if more than one target are selected by the selector the scenario can target the resources both in serial
or parallel
.namespace
the namespace were the scenario workloads will be deployed
3.2 - Node Network Filter
Overview
Creates iptables rules on one or more nodes to block incoming and outgoing traffic on a port in the node network interface. Can be used to block network based services connected to the node or to block inter-node communication.
Configuration
- id: node_network_filter
wait_duration: 300
test_duration: 100
label_selector: "kubernetes.io/hostname=ip-10-0-39-182.us-east-2.compute.internal"
instance_count: 1
execution: parallel
namespace: 'default'
# scenario specific settings
ingress: false
egress: true
target: node
interfaces: []
ports:
- 2049
for the common module settings please refer to the documentation.
ingress
: filters the incoming traffic on one or more ports. If set one or more network interfaces must be specifiedegress
: filters the outgoing traffic on one or more ports.target
: sets the type of resource to be targeted, values can be node
or pod
interfaces
: a list of network interfaces where the incoming traffic will be filteredports
: the list of ports that will be filtered
Examples
AWS EFS (Elastic File System) disruption
- id: node_network_filter
wait_duration: 300
test_duration: 100
label_selector: "node-role.kubernetes.io/worker="
instance_count: 0
execution: parallel
namespace: 'default'
# scenario specific settings
ingress: false
egress: true
target: node
interfaces: []
ports:
- 2049
This configuration will disrupt all the PVCs provided by the AWS EFS service to an OCP/K8S cluster. The service is essentially an elastic NFS service so blocking the outgoing traffic on the port 2049
in the worker nodes will cause all the pods mounting the PVC to be unable to read and write in the mounted folder.
Etcd Split Brain
- id: node_network_filter
wait_duration: 300
test_duration: 100
label_selector: "node-role.kubernetes.io/master="
instance_count: 1
execution: parallel
namespace: 'default'
# scenario specific settings
ingress: false
egress: true
target: node
interfaces: []
ports:
- 2379
- 2380
This configuration will cause the disruption of the etcd traffic in one of the master nodes, this configuration will cause one of the three master node to be isolated by the other nodes causing the election of two etcd leader nodes, one is the isolated node, the other will be elected between one of the two remaining nodes.
4 - Application Outage Scenarios
Application outages
Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
You can add in your applications URL into the health checks section of the config to track the downtime of your application during this scenario
4.1 - Application Outage Scenarios using Krkn
Sample scenario config
application_outage: # Scenario to create an outage of an application by blocking traffic
duration: 600 # Duration in seconds after which the routes will be accessible
namespace: <namespace-with-application> # Namespace to target - all application routes will go inaccessible if pod selector is empty
pod_selector: {app: foo} # Pods to target
block: [Ingress, Egress] # It can be Ingress or Egress or Ingress, Egress
Debugging steps in case of failures
Kraken creates a network policy blocking the ingress/egress traffic to create an outage, in case of failures before reverting back the network policy, you can delete it manually by executing the following commands to stop the outage:
$ oc delete networkpolicy/kraken-deny -n <targeted-namespace>
4.2 - Application outage Scenario using Krkn-hub
This scenario disrupts the traffic to the specified application to be able to understand the impact of the outage on the dependent service/user experience. Refer docs for more details.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
DURATION | Duration in seconds after which the routes will be accessible | 600 |
NAMESPACE | Namespace to target - all application routes will go inaccessible if pod selector is empty ( Required ) | No default |
POD_SELECTOR | Pods to target. For example “{app: foo}” | No default |
BLOCK_TRAFFIC_TYPE | It can be Ingress or Egress or Ingress, Egress ( needs to be a list ) | [Ingress, Egress] |
Note
Defining the NAMESPACE
parameter is required for running this scenario while the pod_selector is optional. In case of using pod selector to target a particular application, make sure to define it using the following format with a space between key and value: “{key: value}”.Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
Demo
You can find a link to a demo of the scenario here
5 - Container Scenarios
Kraken uses the oc exec
command to kill
specific containers in a pod.
This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file.
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.
5.1 - Container Scenarios using Krkn
Example Config
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
scenarios:
- name: "<name of scenario>"
namespace: "<specific namespace>" # can specify "*" if you want to find in all namespaces
label_selector: "<label of pod(s)>"
container_name: "<specific container name>" # This is optional, can take out and will kill all containers in all pods found under namespace and label
pod_names: # This is optional, can take out and will select all pods with given namespace and label
- <pod_name>
count: <number of containers to disrupt, default=1>
action: <kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>
expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)
Post Action
In all scenarios we do a post chaos check to wait and verify the specific component.
Here there are two options:
- Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
See scenarios/post_action_etcd_container.py for an example.
- container_scenarios: # List of chaos pod scenarios to load.
- - scenarios/container_etcd.yml
- scenarios/post_action_etcd_container.py
- Allow kraken to wait and check the killed containers until they become ready again. Kraken keeps a list of the specific
containers that were killed as well as the namespaces and pods to verify all containers that were affected recover properly.
expected_recovery_time: <seconds to wait for container to recover>
5.2 - Container Scenarios using Krkn-hub
This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
NAMESPACE | Targeted namespace in the cluster | openshift-etcd |
LABEL_SELECTOR | Label of the container(s) to target | k8s-app=etcd |
DISRUPTION_COUNT | Number of container to disrupt | 1 |
CONTAINER_NAME | Name of the container to disrupt | etcd |
ACTION | kill signal to run. For example 1 ( hang up ) or 9 | 1 |
EXPECTED_RECOVERY_TIME | Time to wait before checking if all containers that were affected recover properly | 60 |
Note
Set NAMESPACE environment variable to openshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
6 - Hog Scenarios
Hog Scenarios background
Hog Scenarios are designed to push the limits of memory, CPU, or I/O on one or more nodes in your cluster. They also serve to evaluate whether your cluster can withstand rogue pods that excessively consume resources without any limits.
These scenarios involve deploying one or more workloads in the cluster. Based on the specific configuration, these workloads will use a predetermined amount of resources for a specified duration.
Config Options
Common options
Option | Type | Description |
---|
duration | number | the duration of the stress test in seconds |
workers | number (Optional) | the number of threads instantiated by stress-ng, if left empty the number of workers will match the number of available cores in the node. |
hog-type | string (Enum) | can be cpu, memory or io. |
image | string | the container image of the stress workload |
namespace | string | the namespace where the stress workload will be deployed |
node-selector | string (Optional) | defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector. |
number-of-nodes | number (Optional) | restricts the number of selected nodes by the selector |
Available Scenarios
Hog scenarios:
6.1 - Hog Scenarios using Krkn
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named hog_scenarios
then add the desired scenario
pointing to the input.yaml
file.
kraken:
...
chaos_scenarios:
- hog_scenarios:
- scenarios/kube/cpu-hog/input.yaml
The implemented scenarios can be found in scenarios/hog/<scenario_name> folder.
The entrypoint of each scenario is the input.yaml file.
In this file there are all the options to set up the scenario accordingly to the desired target
config.yaml
The hog config file. Here you can set the hog deployer and the hog log level.
The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target.
Each step is represented by a container that will be executed from the deployer and its options.
Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows.
To have more details regarding the hog workflows architecture and syntax it is suggested to refer to the hog Documentation.
This edit is no longer in quay image
Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494
This will effect all versions 4.12 and higher of OpenShift
6.2 - CPU Hog Scenario
The purpose of this scenario is to create cpu pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
6.2.1 - CPU Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/kube/cpu-hog.yml
as described in the
Usage section.
cpu-hog
options
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of CPU to hog on a certain worker node
Option | Type | Description |
---|
cpu-load-percentage | number | the amount of cpu that will be consumed by the hog |
cpu-method | string | reflects the cpu load strategy adopted by stress-ng, please refer to the stress-ng documentation for all the available options |
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named hog_scenarios
then add the desired scenario
pointing to the hog.yaml
file.
kraken:
...
chaos_scenarios:
- hog_scenarios:
- scenarios/kube/cpu-hog.yml
6.2.2 - CPU Hog Scenario using Krkn-Hub
This scenario hogs the cpu on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 60 |
NODE_CPU_CORE | Number of cores (workers) of node CPU to be consumed | 2 |
NODE_CPU_PERCENTAGE | Percentage of total cpu to be consumed | 50 |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTOR | defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector. | "" |
NUMBER_OF_NODES | restricts the number of selected nodes by the selector | "" |
IMAGE | the container image of the stress workload | quay.io/krkn-chaos/krkn-hog |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
Demo
You can find a link to a demo of the scenario here
6.3 - IO Hog Scenario
The purpose of this scenario is to create disk pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
The scenario allows to attach a node path to the pod as a hostPath
volume.
6.3.1 - IO Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/kube/io-hog.yaml
as described in the
Usage section.
io-hog
options
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to target specific pod IO
Option | Type | Description |
---|
io-block-size | string | the block size written by the stressor |
io-write-bytes | string | the total amount of data that will be written by the stressor. The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g |
io-target-pod-folder | string | the folder where the volume will be mounted in the pod |
io-target-pod-volume | dictionary | the pod volume definition that will be stressed by the scenario. |
[!CAUTION]
Modifying the structure of io-target-pod-volume
might alter how the hog operates, potentially rendering it ineffective.
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named hog_scenarios
then add the desired scenario
pointing to the hog.yaml
file.
kraken:
...
chaos_scenarios:
- hog_scenarios:
- scenarios/kube/io-hog.yml
6.3.2 - IO Hog Scenario using Krkn-Hub
This scenario hogs the IO on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 180 |
IO_BLOCK_SIZE | string size of each write in bytes. Size can be from 1 byte to 4m | 1m |
IO_WORKERS | Number of stressorts | 5 |
IO_WRITE_BYTES | string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g | 10m |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTOR | defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector. | "" |
NODE_MOUNT_PATH | the local path in the node that will be mounted in the pod and that will be filled by the scenario | "" |
NUMBER_OF_NODES | restricts the number of selected nodes by the selector | "" |
IMAGE | the container image of the stress workload | quay.io/krkn-chaos/krkn-hog |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/root/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/root/kraken/config/alerts -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
6.4 - Memory Hog Scenario
The purpose of this scenario is to create Virtual Memory pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
6.4.1 - Memory Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/kube/memory-hog.yml
as described in the
Usage section.
memory-hog
options
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of memory to hog on a certain worker node
Option | Type | Description |
---|
memory-vm-bytes | string | the amount of memory that the scenario will try to hog.The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g |
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named hog_scenarios
then add the desired scenario
pointing to the hog.yaml
file.
kraken:
...
chaos_scenarios:
- hog_scenarios:
- scenarios/kube/memory-hog.yml
6.4.2 - Memory Hog Scenario using Krkn-Hub
This scenario hogs the memory on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 60 |
MEMORY_CONSUMPTION_PERCENTAGE | percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario | 90% |
NUMBER_OF_WORKERS | Total number of workers (stress-ng threads) | 1 |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTOR | defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector. | "" |
NUMBER_OF_NODES | restricts the number of selected nodes by the selector | "" |
IMAGE | the container image of the stress workload | quay.io/krkn-chaos/krkn-hog |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
Demo
You can find a link to a demo of the scenario here
7 - ManagedCluster Scenarios
ManagedCluster scenarios provide a way to integrate kraken with Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM).
ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.
The following ManagedCluster chaos scenarios are supported:
- managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
- managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
- managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
- start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
- stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
- stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.
ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under managedcluster_scenarios
option in the Kraken config. Refer to managedcluster_scenarios_example config file.
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60
8 - Network Chaos Scenario
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node’s host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
8.1 - Network Chaos Scenario using Krkn
Sample scenario config for egress traffic shaping
network_chaos: # Scenario to create an outage by simulating random variations in the network.
duration: 300 # In seconds - duration network chaos will be applied.
node_name: # Comma separated node names on which scenario has to be injected.
label_selector: node-role.kubernetes.io/master # When node_name is not specified, a node with matching label_selector is selected for running the scenario.
instance_count: 1 # Number of nodes in which to execute network chaos.
interfaces: # List of interface on which to apply the network restriction.
- "ens5" # Interface name would be the Kernel host network interface name.
execution: serial|parallel # Execute each of the egress options as a single scenario(parallel) or as separate scenario(serial).
egress:
latency: 500ms
loss: 50% # percentage
bandwidth: 10mbit
Sample scenario config for ingress traffic shaping (using a plugin)
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
ip-10-0-128-153.us-west-2.compute.internal:
- ens5
- genev_sys_6081
label_selector: node-role.kubernetes.io/master # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
instance_count: 1 # Number of nodes to perform action/select that match the label selector
kubeconfig_path: ~/.kube/config # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
execution_type: parallel # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
network_params:
latency: 500ms
loss: '50%'
bandwidth: 10mbit
wait_duration: 120
test_duration: 60
'''
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
##### Steps
- Pick the nodes to introduce the network anomaly either from node_name or label_selector.
- Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
- Set traffic shaping config on node's interface using tc and netem.
- Wait for the duration time.
- Remove traffic shaping config on node's interface.
- Remove the job that spawned the pod.
8.2 - Network Chaos Scenario using Krkn-Hub
This scenario introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using the tc and Netem. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
Note
export TRAFFIC_TYPE=egress
for Egress scenarios and export TRAFFIC_TYPE=ingress
for Ingress scenariosSee list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Egress Scenarios
Parameter | Description | Default |
---|
DURATION | Duration in seconds - during with network chaos will be applied. | 300 |
NODE_NAME | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | "" |
LABEL_SELECTOR | When NODE_NAME is not specified, a node with matching label_selector is selected for running. | node-role.kubernetes.io/master |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
INTERFACES | List of interface on which to apply the network restriction. | [] |
EXECUTION | Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial). | parallel |
EGRESS | Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit) | {bandwidth: 100mbit} |
Ingress Scenarios
Parameter | Description | Default |
---|
DURATION | Duration in seconds - during with network chaos will be applied. | 300 |
TARGET_NODE_AND_INTERFACE | # Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]} | "" |
LABEL_SELECTOR | When NODE_NAME is not specified, a node with matching label_selector is selected for running. | node-role.kubernetes.io/master |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
EXECUTION | Used to specify whether you want to apply filters on interfaces one at a time or all at once. | parallel |
NETWORK_PARAMS | latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: ‘0.02’} | "" |
WAIT_DURATION | Ensure that it is at least about twice of test_duration | 300 |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
9 - Node Scenarios
This scenario disrupts the node(s) matching the label or node name(s) on a Kubernetes/OpenShift cluster.
Recovery Times
In each node scenario, the end telemetry details of the run will show the time it took for each node to stop and recover depening on the scenario.
The details printed in telemetry:
- node_name: Node name
- node_id: Node id
- not_ready_time: Amount of time the node took to get to a not ready state after cloud provider has stopped node
- ready_time: Amount of time the node took to get to a ready state after cloud provider has become in started state
- stopped_time: Amount of time the cloud provider took to stop a node
- running_time: Amount of time the cloud provider took to get a node running
- terminating_time: Amount of time the cloud provider took for node to become terminated
Example:
"affected_nodes": [
{
"node_name": "cluster-name-**.438115.internal",
"node_id": "cluster-name-**",
"not_ready_time": 0.18194103240966797,
"ready_time": 0.0,
"stopped_time": 140.74104499816895,
"running_time": 0.0,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**-master-0.438115.internal",
"node_id": "cluster-name-**-master-0",
"not_ready_time": 0.1611928939819336,
"ready_time": 0.0,
"stopped_time": 146.72056317329407,
"running_time": 0.0,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**.438115.internal",
"node_id": "cluster-name-**",
"not_ready_time": 0.0,
"ready_time": 43.521320104599,
"stopped_time": 0.0,
"running_time": 12.305592775344849,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**-master-0.438115.internal",
"node_id": "cluster-name-**-master-0",
"not_ready_time": 0.0,
"ready_time": 48.33336925506592,
"stopped_time": 0.0,
"running_time": 12.052034854888916,
"terminating_time": 0.0
}
9.1 - Node Scenarios using Krkn
The following node chaos scenarios are supported:
- node_start_scenario: Scenario to start the node instance.
- node_stop_scenario: Scenario to stop the node instance.
- node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware.
- node_termination_scenario: Scenario to terminate the node instance.
- node_reboot_scenario: Scenario to reboot the node instance.
- stop_kubelet_scenario: Scenario to stop the kubelet of the node instance.
- stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance.
- restart_kubelet_scenario: Scenario to restart the kubelet of the node instance.
- node_crash_scenario: Scenario to crash the node instance.
- stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status.
Supported cloud supported:
Note
If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.Note
node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported on
- AWS
- Azure
- OpenStack
- BareMetal
- GCP
- VMware
- Alibaba
- IbmCloud
General
Note
The node_crash_scenario
and stop_kubelet_scenario
scenario is supported independent of the cloud platform.Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.
AWS
Cloud setup instructions can be found here.
Sample scenario config can be found here.
Sample scenario config can be found here.
Note
Baremetal requires setting the IPMI user and password to power on, off, and reboot nodes, using the config options bm_user
and bm_password
. It can either be set in the root of the entry in the scenarios config, or it can be set per machine.
If no per-machine addresses are specified, kraken attempts to use the BMC value in the BareMetalHost object. To list them, you can do ‘oc get bmh -o wide –all-namespaces’. If the BMC values are blank, you must specify them per-machine using the config option ‘bmc_addr’ as specified below.
For per-machine settings, add a “bmc_info” section to the entry in the scenarios config. Inside there, add a configuration section using the node name. In that, add per-machine settings. Valid settings are ‘bmc_user’, ‘bmc_password’, and ‘bmc_addr’.
See the example node scenario or the example below.
Note
Baremetal requires oc (openshift client) be installed on the machine running Kraken.Note
Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.Docker
The Docker provider can be used to run node scenarios against kind clusters.
kind is a tool for running local Kubernetes clusters using Docker container “nodes”.
kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
GCP
Cloud setup instructions can be found here. Sample scenario config can be found here.
Openstack
How to set up Openstack cli to run node scenarios is defined here.
The supported node level chaos scenarios on an OPENSTACK cloud are node_stop_start_scenario
, stop_start_kubelet_scenario
and node_reboot_scenario
.
Note
For
stop_start_helper_node_scenario
, visit
here to learn more about the helper node and its usage.
To execute the scenario, ensure the value for ssh_private_key
in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.
Azure
Cloud setup instructions can be found here. Sample scenario config can be found here.
Alibaba
How to set up Alibaba cli to run node scenarios is defined here.
Note
There is no “terminating” idea in Alibaba, so any scenario with terminating will “release” the node
. Releasing a node is 2 steps, stopping the node and then releasing it.VMware
How to set up VMware vSphere to run node scenarios is defined here
IBMCloud
How to set up IBMCloud to run node scenarios is defined here
See a sample of ibm cloud node scenarios example config file
9.2 - Node Scenarios using Krkn-Hub
This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
ACTION | Action can be one of the following | node_stop_start_scenario for aws and vmware-node-reboot for vmware, ibmcloud-node-reboot for ibmcloud |
LABEL_SELECTOR | Node label to target | node-role.kubernetes.io/worker |
NODE_NAME | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | "" |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
RUNS | Iterations to perform action on a single node | 1 |
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported platforms - aws, vmware, ibmcloud, bm | aws |
TIMEOUT | Duration to wait for completion of node scenario injection | 180 |
DURATION | Duration to stop the node before running the start action - not supported for vmware and ibm cloud type | 120 |
VERIFY_SESSION | Only needed for vmware - Set to True if you want to verify the vSphere client session using certificates | False |
SKIP_OPENSHIFT_CHECKS | Only needed for vmware - Set to True if you don’t want to wait for the status of the nodes to change on OpenShift before passing the scenario | False |
BMC_USER | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
BMC_PASSWORD | Only needed for Baremetal ( bm ) - IPMI/bmc password | "" |
BMC_ADDR | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
VMware Vsphere
$ export VSPHERE_IP=<vSphere_client_IP_address>
$ export VSPHERE_USERNAME=<vSphere_client_username>
$ export VSPHERE_PASSWORD=<vSphere_client_password>
Ibmcloud
$ export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
$ export IBMC_APIKEY=<ibmcloud_api_key>
Baremetal
$ export BMC_USER=<bmc/IPMI user>
$ export BMC_PASSWORD=<bmc/IPMI password>
$ export BMC_ADDR=<bmc address>
Google Cloud Platform
$ export GOOGLE_APPLICATION_CREDENTIALS=<GCP Json>
Azure
$ export AZURE_TENANT_ID=<>
$ export AZURE_CLIENT_SECRET=<>
$ export AZURE_CLIENT_ID=<>
OpenStack
export OS_USERNAME=username
export OS_PASSWORD=password
export OS_TENANT_NAME=projectName
export OS_AUTH_URL=https://identityHost:portNumber/v2.0
export OS_TENANT_ID=tenantIDString
export OS_REGION_NAME=regionName
export OS_CACERT=/path/to/cacertFile
Demo
You can find a link to a demo of the scenario here
10 - Pod Network Scenarios
Pod outage
Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
10.1 - Pod Scenarios using Krkn
Sample scenario config (using a plugin)
- id: pod_network_outage
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied
direction: # Optioinal - List of directions to apply filters
- ingress # Blocks ingress traffic, Default both egress and ingress
ingress_ports: # Optional - List of ports to block traffic on
- 8443 # Blocks 8443, Default [], i.e. all ports.
label_selector: 'component=ui' # Blocks access to openshift console
Pod Network shaping
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Pod’s network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
Sample scenario config for egress traffic shaping (using plugin)
- id: pod_egress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
Sample scenario config for ingress traffic shaping (using plugin)
- id: pod_ingress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
Steps
- Pick the pods to introduce the network anomaly either from label_selector or pod_name.
- Identify the pod interface name on the node.
- Set traffic shaping config on pod’s interface using tc and netem.
- Wait for the duration time.
- Remove traffic shaping config on pod’s interface.
- Remove the job that spawned the pod.
10.2 - Pod Network Chaos Scenarios using Krkn-hub
This scenario runs network chaos at the pod level on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
NAMESPACE | Required - Namespace of the pod to which filter need to be applied | "" |
LABEL_SELECTOR | Label of the pod(s) to target | "" |
POD_NAME | When label_selector is not specified, pod matching the name will be selected for the chaos scenario | "" |
INSTANCE_COUNT | Number of pods to perform action/select that match the label selector | 1 |
TRAFFIC_TYPE | List of directions to apply filters - egress/ingress ( needs to be a list ) | [ingress, egress] |
INGRESS_PORTS | Ingress ports to block ( needs to be a list ) | [] i.e all ports |
EGRESS_PORTS | Egress ports to block ( needs to be a list ) | [] i.e all ports |
WAIT_DURATION | Ensure that it is at least about twice of test_duration | 300 |
TEST_DURATION | Duration of the test run | 120 |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
11 - Pod Scenarios
Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
11.1 - Pod Scenarios using Krkn
Example Config
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
kraken:
chaos_scenarios:
- pod_disruption_scenarios:
- path/to/scenario.yaml
You can then create the scenario file with the following contents:
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
krkn_pod_recovery_time: 120
Please adjust the schema reference to point to the schema file. This file will give you code completion and documentation for the available options in your IDE.
Pod Chaos Scenarios
The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.
Component | Description | Working |
---|
Basic pod scenario | Kill a pod. | :heavy_check_mark: |
Etcd | Kills a single/multiple etcd replicas. | :heavy_check_mark: |
Kube ApiServer | Kills a single/multiple kube-apiserver replicas. | :heavy_check_mark: |
ApiServer | Kills a single/multiple apiserver replicas. | :heavy_check_mark: |
Prometheus | Kills a single/multiple prometheus replicas. | :heavy_check_mark: |
OpenShift System Pods | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |
11.2 - Pod Scenarios using Krkn-hub
This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
NAMESPACE | Targeted namespace in the cluster ( supports regex ) | openshift-.* |
POD_LABEL | Label of the pod(s) to target | "" |
NAME_PATTERN | Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified | .* |
DISRUPTION_COUNT | Number of pods to disrupt | 1 |
KILL_TIMEOUT | Timeout to wait for the target pod(s) to be removed in seconds | 180 |
EXPECTED_RECOVERY_TIME | Fails if the pod disrupted do not recover within the timeout set | 120 |
Note
Set NAMESPACE environment variable to openshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
12 - Power Outage Scenarios
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy.
12.1 - Power Outage Scenario using Krkn
Power Outage/ Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to cluster_shut_down_scenario config file.
Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.
Current accepted cloud types:
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes.
runs: 1 # Number of times to execute the cluster_shut_down scenario.
shut_down_duration: 120 # Duration in seconds to shut down the cluster.
cloud_type: aws # Cloud type on which Kubernetes/OpenShift runs.
12.2 - Power Outage Scenario using Krkn-Hub
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy. More information can be found here
Right now power outage and cluster shutdown are one in the same. We originally created this scenario to stop all the nodes and then start them back up how a customer would shut their cluster down.
In a real life chaos scenario though, we figured this scenario was close to if the power went out on the aws side so all of our ec2 nodes would be stopped/powered off.
We tried to look at if aws cli had a way to forcefully poweroff the nodes (not gracefully) and they don’t currently support so this scenario is as close as we can get to “pulling the plug”
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
SHUTDOWN_DURATION | Duration in seconds to shut down the cluster | 1200 |
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported cloud platforms | aws |
TIMEOUT | Time in seconds to wait for each node to be stopped or running after the cluster comes back | 600 |
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
Google Cloud Platform
Azure
OpenStack
Baremetal
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
13 - PVC Scenario
Scenario to fill up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults caused by the application using this volume.
13.1 - PVC Scenario using Krkn
Sample scenario config
pvc_scenario:
pvc_name: <pvc_name> # Name of the target PVC.
pod_name: <pod_name> # Name of the pod where the PVC is mounted. It will be ignored if the pvc_name is defined.
namespace: <namespace_name> # Namespace where the PVC is.
fill_percentage: 50 # Target percentage to fill up the cluster. Value must be higher than current percentage. Valid values are between 0 and 99.
duration: 60 # Duration in seconds for the fault.
Steps
- Get the pod name where the PVC is mounted.
- Get the volume name mounted in the container pod.
- Get the container name where the PVC is mounted.
- Get the mount path where the PVC is mounted in the pod.
- Get the PVC capacity and current used capacity.
- Calculate file size to fill the PVC to the target fill_percentage.
- Connect to the pod.
- Create a temp file
kraken.tmp
with random data on the mount path:dd bs=1024 count=$file_size </dev/urandom > /mount_path/kraken.tmp
- Wait for the duration time.
- Remove the temp file created:
13.2 - PVC Scenario using Krkn-Hub
This scenario fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults cause by the application using this volume. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
If both PVC_NAME
and POD_NAME
are defined, POD_NAME
value will be overridden from the Mounted By:
value on PVC definition.
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
PVC_NAME | Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required) | |
POD_NAME | Targeted pod in the cluster (if null, PVC_NAME is required) | |
NAMESPACE | Targeted namespace in the cluster (required) | |
FILL_PERCENTAGE | Targeted percentage to be filled up in the PVC | 50 |
DURATION | Duration in seconds with the PVC filled up | 60 |
Note
Set NAMESPACE environment variable to openshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
14 - Service Disruption Scenarios
Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
14.1 - Service Disruption Scenarios using Krkn
Configuration Options:
namespace: Specific namespace or regex style namespace of what you want to delete. Gets all namespaces if not specified; set to "" if you want to use the label_selector field.
Set to ‘^.*$’ and label_selector to "" to randomly select any namespace in your cluster.
label_selector: Label on the namespace you want to delete. Set to "" if you are using the namespace variable.
delete_count: Number of namespaces to kill in each run. Based on matching namespace and label specified, default is 1.
runs: Number of runs/iterations to kill namespaces, default is 1.
sleep: Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set
Refer to namespace_scenarios_example config file.
scenarios:
- namespace: "^.*$"
runs: 1
- namespace: "^.*ingress.*$"
runs: 1
sleep: 15
Steps
This scenario will select a namespace (or multiple) dependent on the configuration and will kill all of the below object types in that namespace and will wait for them to be Running in the post action
- Services
- Daemonsets
- Statefulsets
- Replicasets
- Deployments
Post Action
We do a post chaos check to wait and verify the specific objects in each namespace are Ready
Here there are two options:
- Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
See scenarios/post_action_namespace.py for an example
- namespace_scenarios:
- - scenarios/regex_namespace.yaml
- scenarios/post_action_namespace.py
- Allow kraken to wait and check all killed objects in the namespaces become ‘Running’ again. Kraken keeps a list of the specific
objects in namespaces that were killed to verify all that were affected recover properly.
wait_time: <seconds to wait for namespace to recover>
14.2 - Service Disruption Scenario using Krkn-Hub
This scenario deletes main objects within a namespace in your Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
LABEL_SELECTOR | Label of the namespace to target. Set this parameter only if NAMESPACE is not set | "" |
NAMESPACE | Name of the namespace you want to target. Set this parameter only if LABEL_SELECTOR is not set | “openshift-etcd” |
SLEEP | Number of seconds to wait before polling to see if namespace exists again | 15 |
DELETE_COUNT | Number of namespaces to kill in each run, based on matching namespace and label specified | 1 |
RUNS | Number of runs to execute the action | 1 |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
Demo
You can find a link to a demo of the scenario here
15 - Service Hijacking Scenario
Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.
15.1 - Service Hijacking Scenarios using Krkn
The web service’s source code is available here.
It employs a time-based test plan from the scenario configuration file, which specifies the behavior of resources during the chaos scenario as follows:
service_target_port: http-web-svc # The port of the service to be hijacked (can be named or numeric, based on the workload and service configuration).
service_name: nginx-service # The name of the service that will be hijacked.
service_namespace: default # The namespace where the target service is located.
image: quay.io/krkn-chaos/krkn-service-hijacking:v0.1.3 # Image of the krkn web service to be deployed to receive traffic.
chaos_duration: 30 # Total duration of the chaos scenario in seconds.
plan:
- resource: "/list/index.php" # Specifies the resource or path to respond to in the scenario. For paths, both the path and query parameters are captured but ignored. For resources, only query parameters are captured.
steps: # A time-based plan consisting of steps can be defined for each resource.
GET: # One or more HTTP methods can be specified for each step. Note: Non-standard methods are supported for fully custom web services (e.g., using NONEXISTENT instead of POST).
- duration: 15 # Duration in seconds for this step before moving to the next one, if defined. Otherwise, this step will continue until the chaos scenario ends.
status: 500 # HTTP status code to be returned in this step.
mime_type: "application/json" # MIME type of the response for this step.
payload: | # The response payload for this step.
{
"status":"internal server error"
}
- duration: 15
status: 201
mime_type: "application/json"
payload: |
{
"status":"resource created"
}
POST:
- duration: 15
status: 401
mime_type: "application/json"
payload: |
{
"status": "unauthorized"
}
- duration: 15
status: 404
mime_type: "text/plain"
payload: "not found"
The scenario will focus on the service_name
within the service_namespace
,
substituting the selector with a randomly generated one, which is added as a label in the mock service manifest.
This allows multiple scenarios to be executed in the same namespace, each targeting different services without
causing conflicts.
The newly deployed mock web service will expose a service_target_port
,
which can be either a named or numeric port based on the service configuration.
This ensures that the Service correctly routes HTTP traffic to the mock web service during the chaos run.
Each step will last for duration
seconds from the deployment of the mock web service in the cluster.
For each HTTP resource, defined as a top-level YAML property of the plan
(it could be a specific resource, e.g., /list/index.php, or a path-based resource typical in MVC frameworks),
one or more HTTP request methods can be specified. Both standard and custom request methods are supported.
During this time frame, the web service will respond with:
status
: The HTTP status code (can be standard or custom).mime_type
: The MIME type (can be standard or custom).payload
: The response body to be returned to the client.
At the end of the step duration
, the web service will proceed to the next step (if available) until
the global chaos_duration
concludes. At this point, the original service will be restored,
and the custom web service and its resources will be undeployed.
NOTE: Some clients (e.g., cURL, jQuery) may optimize queries using lightweight methods (like HEAD or OPTIONS)
to probe API behavior. If these methods are not defined in the test plan, the web service may respond with
a 405
or 404
status code. If you encounter unexpected behavior, consider this use case.
15.2 - Service Hijacking Scenario using Krkn-Hub
This scenario reroutes traffic intended for a target service to a custom web service that is automatically deployed by Krkn.
This web service responds with user-defined HTTP statuses, MIME types, and bodies.
For more details, please refer to the following documentation.
Run
Unlike other krkn-hub scenarios, this one requires a specific configuration due to its unique structure.
You must set up the scenario in a local file following the scenario syntax,
and then pass this file’s base64-encoded content to the container via the SCENARIO_BASE64 variable.
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs.
Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> \
-e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
-v <path_to_kubeconfig>:/home/krkn/.kube/config:Z quay.io/krkn-chaos/krkn-hub:service-hijacking
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ export SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"
$ docker run $(./get_docker_params.sh) --name=<container_name> \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
OR
$ docker run --name=<container_name> -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
ecause the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description |
---|
SCENARIO_BASE64 | Base64 encoded service-hijacking scenario file. Note that the -w0 option in the command substitution SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" is mandatory in order to remove line breaks from the base64 command output |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
--name=<container_name> \
--net=host \
--env-host=true \
-v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml \
-v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
16 - Syn Flood Scenarios
Syn Flood Scenarios
This scenario generates a substantial amount of TCP traffic directed at one or more Kubernetes services within
the cluster to test the server’s resiliency under extreme traffic conditions.
It can also target hosts outside the cluster by specifying a reachable IP address or hostname.
This scenario leverages the distributed nature of Kubernetes clusters to instantiate multiple instances
of the same pod against a single host, significantly increasing the effectiveness of the attack.
The configuration also allows for the specification of multiple node selectors, enabling Kubernetes to schedule
the attacker pods on a user-defined subset of nodes to make the test more realistic.
The attacker container source code is available here.
16.1 - Syn Flood Scenario using Krkn
Sample scenario config
packet-size: 120 # hping3 packet size
window-size: 64 # hping 3 TCP window size
duration: 10 # chaos scenario duration
namespace: default # namespace where the target service(s) are deployed
target-service: target-svc # target service name (if set target-service-label must be empty)
target-port: 80 # target service TCP port
target-service-label : "" # target service label, can be used to target multiple target at the same time
# if they have the same label set (if set target-service must be empty)
number-of-pods: 2 # number of attacker pod instantiated per each target
image: quay.io/krkn-chaos/krkn-syn-flood # syn flood attacker container image
attacker-nodes: # this will set the node affinity to schedule the attacker node. Per each node label selector
# can be specified multiple values in this way the kube scheduler will schedule the attacker pods
# in the best way possible based on the provided labels. Multiple labels can be specified
kubernetes.io/hostname:
- host_1
- host_2
kubernetes.io/os:
- linux
16.2 - Syn Flood Scenario using Krkn-Hub
Syn Flood scenario
This scenario simulates a user-defined surge of TCP SYN requests directed at one or more services deployed within the cluster or an external target reachable by the cluster.
For more details, please refer to the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z
-e TARGET_PORT=<target_port> \
-e NAMESPACE=<target_namespace> \
-e TOTAL_CHAOS_DURATION=<duration> \
-e TARGET_SERVICE=<target_service> \
-e NUMBER_OF_PODS=10 \
-e NODE_SELECTORS=<key>=<value>;<key>=<othervalue> \
-d
quay.io/krkn-chaos/krkn-hub:syn-flood
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z
-e TARGET_PORT=<target_port> \
-e NAMESPACE=<target_namespace> \
-e TOTAL_CHAOS_DURATION=<duration> \
-e TARGET_SERVICE=<target_service> \
-e NUMBER_OF_PODS=10 \
-e NODE_SELECTORS=<key>=<value>;<key>=<othervalue> \
-d
quay.io/krkn-chaos/krkn-hub:syn-flood
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
PACKET_SIZE | The size in bytes of the SYN packet | 120 |
WINDOW_SIZE | The TCP window size between packets in bytes | 64 |
TOTAL_CHAOS_DURATION | The number of seconds the chaos will last | 120 |
NAMESPACE | The namespace containing the target service and where the attacker pods will be deployed | default |
TARGET_SERVICE | The service name (or the hostname/IP address in case an external target will be hit) that will be affected by the attack. Must be empty if TARGET_SERVICE_LABEL will be set | |
TARGET_PORT | The TCP port that will be targeted by the attack | |
TARGET_SERVICE_LABEL | The label that will be used to select one or more services. Must be left empty if TARGET_SERVICE variable is set | |
NUMBER_OF_PODS | The number of attacker pods that will be deployed | 2 |
IMAGE | The container image that will be used to perform the scenario | quay.io/krkn-chaos/krkn-syn-flood:latest |
NODE_SELECTORS | The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster’s capacity. | |
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
. For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:syn-flood
17 - Time Scenarios
Using this type of scenario configuration, one is able to change the time and/or date of the system for pods or nodes.
17.1 - Time Scenarios using Krkn
Configuration Options:
action: skew_time or skew_date.
object_type: pod or node.
namespace: namespace of the pods you want to skew. Needs to be set if setting a specific pod name.
label_selector: Label on the nodes or pods you want to skew.
container_name: Container name in pod you want to reset time on. If left blank it will randomly select one.
object_name: List of the names of pods or nodes you want to skew.
Refer to time_scenarios_example config file.
time_scenarios:
- action: skew_time
object_type: pod
object_name:
- apiserver-868595fcbb-6qnsc
- apiserver-868595fcbb-mb9j5
namespace: openshift-apiserver
container_name: openshift-apiserver
- action: skew_date
object_type: node
label_selector: node-role.kubernetes.io/worker
17.2 - Time Skew Scenarios using Krkn-Hub
This scenario skews the date and time of the nodes and pods matching the label on a Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
OBJECT_TYPE | Object to target. Supported options: pod, node | pod |
LABEL_SELECTOR | Label of the container(s) or nodes to target | k8s-app=etcd |
ACTION | Action to run. Supported actions: skew_time, skew_date | skew_date |
OBJECT_NAME | List of the names of pods or nodes you want to skew ( optional parameter ) | [] |
CONTAINER_NAME | Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty | "" |
NAMESPACE | Namespace of the pods you want to skew, need to be set only if setting a specific pod name | "" |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
Demo
You can find a link to a demo of the scenario here
18 - Zone Outage Scenarios
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone.
There are 2 ways these scenarios run:
For AWS, it tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.
For GCP, it in a specific zone you want to target and finds the nodes (master, worker, and infra) and stops the nodes for the set duration and then starts them back up. The reason we do it this way is because any edits to the nodes require you to first stop the node before performing any updates. So, editing the network as the AWS way would still require you to stop the nodes first.
18.1 - Zone Outage Scenarios using Krkn
Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.
Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to run zone outages on
Current accepted cloud types:
Sample scenario config
zone_outage: # Scenario to create an outage of a zone by tweaking network ACL.
cloud_type: aws # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.
duration: 600 # Duration in seconds after which the zone will be back online.
vpc_id: # Cluster virtual private network to target.
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
Note
vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).zone_outage: # Scenario to create an outage of a zone by tweaking network ACL
cloud_type: gcp # cloud type on which Kubernetes/OpenShift runs. aws is only platform supported currently for this scenario.
duration: 600 # duration in seconds after which the zone will be back online
zone: <zone> # Zone of nodes to stop and then restart after the duration ends
Note
Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.AWS- Debugging steps in case of failures
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady
condition. Here is how to fix it:
- OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
- From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
- Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
- Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.
GCP - Debugging steps in case of failures
In case of failures during the steps which bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady
condition. Here is how to fix it:
- From the gcp web console, select one of the instances in the zone which is down
- Kraken just stops the node, so you’ll just have to select the stopped nodes and START them. This will get back the nodes in the targeted zone into Ready state
18.2 - Zone Outage Scenarios using Krkn-Hub
This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported cloud platforms | aws or gcp |
DURATION | Duration in seconds after which the zone will be back online | 600 |
VPC_ID | cluster virtual private network to target ( REQUIRED for AWS ) | "" |
SUBNET_ID | subnet-id to deny both ingress and egress traffic ( REQUIRED for AWS ). Format: [subenet1, subnet2] | "" |
ZONE | zone you want to target ( REQUIRED for GCP ) | "" |
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
Google Cloud Platform
export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.
For example:
```bash
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here