Node Scenarios
This scenario disrupts the node(s) matching the label or node name(s) on a Kubernetes/OpenShift cluster. These scenarios are performed in two different ways, either by the clusters cloud cli or by common/generic commands that can be performed on any cluster.
Actions
The following node chaos scenarios are supported:
- node_start_scenario: Scenario to start the node instance. Need access to cloud provider
- node_stop_scenario: Scenario to stop the node instance. Need access to cloud provider
- node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware. Need access to cloud provider
- node_termination_scenario: Scenario to terminate the node instance. Need access to cloud provider
- node_reboot_scenario: Scenario to reboot the node instance. Need access to cloud provider
- stop_kubelet_scenario: Scenario to stop the kubelet of the node instance. Need access to cloud provider
- stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance. Need access to cloud provider
- restart_kubelet_scenario: Scenario to restart the kubelet of the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
- node_crash_scenario: Scenario to crash the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
- stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status. Need access to cloud provider
- node_block_scenario: Scenario to block inbound and outbound traffic from other nodes to a specific node for a set duration (only for Azure). Need access to cloud provider
Clouds
Supported cloud supported:
Note
If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.Note
node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported on
- AWS
- Azure
- OpenStack
- BareMetal
- GCP
- VMware
- Alibaba
- IbmCloud
Recovery Times
In each node scenario, the end telemetry details of the run will show the time it took for each node to stop and recover depening on the scenario.
The details printed in telemetry:
- node_name: Node name
- node_id: Node id
- not_ready_time: Amount of time the node took to get to a not ready state after cloud provider has stopped node
- ready_time: Amount of time the node took to get to a ready state after cloud provider has become in started state
- stopped_time: Amount of time the cloud provider took to stop a node
- running_time: Amount of time the cloud provider took to get a node running
- terminating_time: Amount of time the cloud provider took for node to become terminated
Example:
"affected_nodes": [
{
"node_name": "cluster-name-**.438115.internal",
"node_id": "cluster-name-**",
"not_ready_time": 0.18194103240966797,
"ready_time": 0.0,
"stopped_time": 140.74104499816895,
"running_time": 0.0,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**-master-0.438115.internal",
"node_id": "cluster-name-**-master-0",
"not_ready_time": 0.1611928939819336,
"ready_time": 0.0,
"stopped_time": 146.72056317329407,
"running_time": 0.0,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**.438115.internal",
"node_id": "cluster-name-**",
"not_ready_time": 0.0,
"ready_time": 43.521320104599,
"stopped_time": 0.0,
"running_time": 12.305592775344849,
"terminating_time": 0.0
},
{
"node_name": "cluster-name-**-master-0.438115.internal",
"node_id": "cluster-name-**-master-0",
"not_ready_time": 0.0,
"ready_time": 48.33336925506592,
"stopped_time": 0.0,
"running_time": 12.052034854888916,
"terminating_time": 0.0
}
1 - Node Scenarios using Krkn
For any of the node scenarios, you’ll specify node_scenarios
as the scenario type.
See example config here:
chaos_scenarios:
- node_scenarios: # List of chaos node scenarios to load
- scenarios/***.yml
- scenarios/***.yml # Can specify multiple files here
Sample scenario file, you are able to specify multiple list items under node_scenarios that will be ran serially
node_scenarios:
- actions: # node chaos scenarios to be injected
- <action> # Can specify multiple actions here
node_name: <node_name> # node on which scenario has to be injected; can set multiple names separated by comma
label_selector: <label> # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection; can specify multiple by a comma separated list
instance_count: <instance_number> # Number of nodes to perform action/select that match the label selector
runs: <run_int> # number of times to inject each scenario under actions (will perform on same node each time)
timeout: <timeout> # duration to wait for completion of node scenario injection
duration: <duration> # duration to stop the node before running the start action
cloud_type: <cloud> # cloud type on which Kubernetes/OpenShift runs
parallel: <true_or_false> # Run action on label or node name in parallel or sequential, defaults to sequential
AWS
Cloud setup instructions can be found here.
Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be aws
Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be bm
Note
Baremetal requires setting the IPMI user and password to power on, off, and reboot nodes, using the config options bm_user
and bm_password
. It can either be set in the root of the entry in the scenarios config, or it can be set per machine.
If no per-machine addresses are specified, kraken attempts to use the BMC value in the BareMetalHost object. To list them, you can do ‘oc get bmh -o wide –all-namespaces’. If the BMC values are blank, you must specify them per-machine using the config option ‘bmc_addr’ as specified below.
For per-machine settings, add a “bmc_info” section to the entry in the scenarios config. Inside there, add a configuration section using the node name. In that, add per-machine settings. Valid settings are ‘bmc_user’, ‘bmc_password’, and ‘bmc_addr’.
See the example node scenario or the example below.
Note
Baremetal requires oc (openshift client) be installed on the machine running Kraken.Note
Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.Docker
The Docker provider can be used to run node scenarios against kind clusters.
kind is a tool for running local Kubernetes clusters using Docker container “nodes”.
kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
GCP
Cloud setup instructions can be found here. Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be gcp
Openstack
How to set up Openstack cli to run node scenarios is defined here.
The cloud type in the scenario yaml file needs to be openstack
The supported node level chaos scenarios on an OPENSTACK cloud are only: node_stop_start_scenario
, stop_start_kubelet_scenario
and node_reboot_scenario
.
Note
For
stop_start_helper_node_scenario
, visit
here to learn more about the helper node and its usage.
To execute the scenario, ensure the value for ssh_private_key
in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.
Azure
Cloud setup instructions can be found here. Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be azure
Alibaba
How to set up Alibaba cli to run node scenarios is defined here.
Note
There is no “terminating” idea in Alibaba, so any scenario with terminating will “release” the node
. Releasing a node is 2 steps, stopping the node and then releasing it.The cloud type in the scenario yaml file needs to be alibaba
VMware
How to set up VMware vSphere to run node scenarios is defined here
The cloud type in the scenario yaml file needs to be vmware
IBMCloud
How to set up IBMCloud to run node scenarios is defined here
See a sample of ibm cloud node scenarios example config file
The cloud type in the scenario yaml file needs to be ibm
General
Note
The node_crash_scenario
and stop_kubelet_scenario
scenarios are supported independent of the cloud platform.Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.
2 - Node Scenarios using Krknctl
krknctl run node-scenarios (optional: --<parameter>:<value> )
Can also set any global variable listed here
Scenario specific parameters: (be sure to scroll to right)
Parameter | Description | Type | Default | Possible Values |
---|
--action | action performed on the node, visit https://github.com/krkn-chaos/krkn/blob/main/docs/node_scenarios.md for more infos | enum | | node_start_scenario,node_stop_scenario,node_stop_start_scenario,node_termination_scenario,node_reboot_scenario,stop_kubelet_scenario,stop_start_kubelet_scenario,restart_kubelet_scenario,node_crash_scenario,stop_start_helper_node_scenario |
--label-selector | Node label to target | string | node-role.kubernetes.io/worker | |
--node-name | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | string | | |
--instance-count | Targeted instance count matching the label selector | number | 1 | |
--runs | Iterations to perform action on a single node | number | 1 | |
--cloud-type | Cloud platform on top of which cluster is running, supported platforms - aws, azure, gcp, vmware, ibmcloud, bm | enum | aws | |
--timeout | Duration to wait for completion of node scenario injection | number | 180 | |
--duration | Duration to wait for completion of node scenario injection | number | 120 | |
--vsphere-ip | VSpere IP Address | string | | |
--vsphere-username | VSpere IP Address | string (secret) | | |
--vsphere-password | VSpere password | string (secret) | | |
--aws-access-key-id | AWS Access Key Id | string (secret) | | |
--aws-secret-access-key | AWS Secret Access Key | string (secret) | | |
--aws-default-region | AWS default region | string | | |
--bmc-user | Only needed for Baremetal ( bm ) - IPMI/bmc username | string(secret) | | |
--bmc-password | Only needed for Baremetal ( bm ) - IPMI/bmc password | string(secret) | | |
--bmc-address | Only needed for Baremetal ( bm ) - IPMI/bmc address | string | | |
--ibmc-address | IBM Cloud URL | string | | |
--ibmc-api-key | IBM Cloud API Key | string (secret) | | |
--azure-tenant | Azure Tenant | string | | |
--azure-client-secret | Azure Client Secret | string(secret) | | |
--azure-client-id | Azure Client ID | string(secret) | | |
--azure-subscription-id | Azure Subscription ID | string (secret) | | |
--gcp-application-credentials | GCP application credentials file location | file | | |
NOTE: The secret string types will be masked when scenario is ran
To see all available scenario options
krknctl run node-scenarios --help
3 - Node Scenarios using Krkn-Hub
This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
ACTION | Action can be one of the following | node_stop_start_scenario |
LABEL_SELECTOR | Node label to target | node-role.kubernetes.io/worker |
NODE_NAME | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | "" |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
RUNS | Iterations to perform action on a single node | 1 |
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported platforms - aws, vmware, ibmcloud, bm | aws |
TIMEOUT | Duration to wait for completion of node scenario injection | 180 |
DURATION | Duration to stop the node before running the start action - not supported for vmware and ibm cloud type | 120 |
VERIFY_SESSION | Only needed for vmware - Set to True if you want to verify the vSphere client session using certificates | False |
SKIP_OPENSHIFT_CHECKS | Only needed for vmware - Set to True if you don’t want to wait for the status of the nodes to change on OpenShift before passing the scenario | False |
BMC_USER | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
BMC_PASSWORD | Only needed for Baremetal ( bm ) - IPMI/bmc password | "" |
BMC_ADDR | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.For example:
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
VMware Vsphere
$ export VSPHERE_IP=<vSphere_client_IP_address>
$ export VSPHERE_USERNAME=<vSphere_client_username>
$ export VSPHERE_PASSWORD=<vSphere_client_password>
Ibmcloud
$ export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
$ export IBMC_APIKEY=<ibmcloud_api_key>
Baremetal
$ export BMC_USER=<bmc/IPMI user>
$ export BMC_PASSWORD=<bmc/IPMI password>
$ export BMC_ADDR=<bmc address>
Google Cloud Platform
$ export GOOGLE_APPLICATION_CREDENTIALS=<GCP Json>
Azure
$ export AZURE_TENANT_ID=<>
$ export AZURE_CLIENT_SECRET=<>
$ export AZURE_CLIENT_ID=<>
OpenStack
export OS_USERNAME=username
export OS_PASSWORD=password
export OS_TENANT_NAME=projectName
export OS_AUTH_URL=https://identityHost:portNumber/v2.0
export OS_TENANT_ID=tenantIDString
export OS_REGION_NAME=regionName
export OS_CACERT=/path/to/cacertFile
Demo
You can find a link to a demo of the scenario here