Chaos and resiliency testing tool for Kubernetes with a focus on improving performance under failure conditions. A CNCF sandbox project. Github
This is the multi-page printable view of this section. Click here to print.
krkn-chaos
- 1: krkn
- 2: Installation
- 3: Scenarios
- 3.1: Application Outage Scenarios
- 3.2: Arcaflow Scenarios
- 3.3: Container Scenarios
- 3.4: CPU Hog Scenario
- 3.5: IO Hog Scenario
- 3.6: ManagedCluster Scenarios
- 3.7: Memory Hog Scenario
- 3.8: Network Chaos Scenario
- 3.9: Node Scenarios
- 3.9.1: Node Scenarios using Krkn
- 3.9.2: Node Scenarios using Krkn-Hub
- 3.10: Pod Network Scenarios
- 3.11: Pod Scenarios
- 3.11.1: Pod Scenarios using Krkn
- 3.11.2: Pod Scenarios using Krkn-hub
- 3.12: Power Outage Scenarios
- 3.13: PVC Scenario
- 3.13.1: PVC Scenario using Krkn
- 3.13.2: PVC Scenario using Krkn-Hub
- 3.14: Service Disruption Scenarios
- 3.15: Service Hijacking Scenario
- 3.16: Time Scenarios
- 3.16.1: Time Scenarios using Krkn
- 3.16.2: Time Skew Scenarios using Krkn-Hub
- 3.17: Zone Outage Scenarios
- 3.18: All Scenarios Variables
- 3.19: Supported Cloud Providers
- 4: Chaos Testing Guide
- 5: Cerberus
- 5.1: Installation
- 5.2: Config
- 5.3: Example Report
- 5.4: Usage
- 5.5: Alerts
- 5.6: Node Problem Detector
- 5.7: Slack Integration
- 5.8: Contribute
- 6: Chaos Recommendation Tool
- 7: Contribution Guidelines
- 7.1: Testing your changes
- 7.2: Contributions
- 8: Krkn Roadmap
- 9:
- 10:
- 11:
1 - krkn
krkn is a chaos and resiliency testing tool for Kubernetes. Kraken injects deliberate failures into Kubernetes clusters to check if it is resilient to turbulent conditions.
Why do I want it?
There are a couple of false assumptions that users might have when operating and running their applications in distributed systems:
- The network is reliable
- There is zero latency
- Bandwidth is infinite
- The network is secure
- Topology never changes
- The network is homogeneous
- Consistent resource usage with no spikes
- All shared resources are available from all places
Various assumptions led to a number of outages in production environments in the past. The services suffered from poor performance or were inaccessible to the customers, leading to missing Service Level Agreement uptime promises, revenue loss, and a degradation in the perceived reliability of said services.
How can we best avoid this from happening? This is where Chaos testing can add value
Workflow
How to Get Started
Instructions on how to setup, configure and run Kraken can be found at Installation.
You may consider utilizing the chaos recommendation tool prior to initiating the chaos runs to profile the application service(s) under test. This tool discovers a list of Krkn scenarios with a high probability of causing failures or disruptions to your application service(s). The tool can be accessed at Chaos-Recommender.
See the getting started doc on support on how to get started with your own custom scenario or editing current scenarios for your specific usage.
After installation, refer back to the below sections for supported scenarios and how to tweak the kraken config to load them on your cluster.
Running Kraken with minimal configuration tweaks
For cases where you want to run Kraken with minimal configuration changes, refer to krkn-hub. One use case is CI integration where you do not want to carry around different configuration files for the scenarios.
Config
Instructions on how to setup the config and the options supported can be found at Config.
Kraken scenario pass/fail criteria and report
It is important to make sure to check if the targeted component recovered from the chaos injection and also if the Kubernetes cluster is healthy as failures in one component can have an adverse impact on other components. Kraken does this by:
- Having built in checks for pod and node based scenarios to ensure the expected number of replicas and nodes are up. It also supports running custom scripts with the checks.
- Leveraging Cerberus to monitor the cluster under test and consuming the aggregated go/no-go signal to determine pass/fail post chaos. It is highly recommended to turn on the Cerberus health check feature available in Kraken. Instructions on installing and setting up Cerberus can be found here or can be installed from Kraken using the instructions. Once Cerberus is up and running, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the Kraken config file. Cerberus can monitor application routes during the chaos and fails the run if it encounters downtime as it is a potential downtime in a customers, or users environment as well. It is especially important during the control plane chaos scenarios including the API server, Etcd, Ingress etc. It can be enabled by setting
check_applicaton_routes: True
in the Kraken config provided application routes are being monitored in the cerberus config. - Leveraging built-in alert collection feature to fail the runs in case of critical alerts.
Signaling
In CI runs or any external job it is useful to stop Kraken once a certain test or state gets reached. We created a way to signal to kraken to pause the chaos or stop it completely using a signal posted to a port of your choice.
For example if we have a test run loading the cluster running and kraken separately running; we want to be able to know when to start/stop the kraken run based on when the test run completes or gets to a certain loaded state.
More detailed information on enabling and leveraging this feature can be found here.
Performance monitoring
Monitoring the Kubernetes/OpenShift cluster to observe the impact of Kraken chaos scenarios on various components is key to find out the bottlenecks as it is important to make sure the cluster is healthy in terms if both recovery as well as performance during/after the failure has been injected. Instructions on enabling it can be found here.
SLOs validation during and post chaos
- In addition to checking the recovery and health of the cluster and components under test, Kraken takes in a profile with the Prometheus expressions to validate and alerts, exits with a non-zero return code depending on the severity set. This feature can be used to determine pass/fail or alert on abnormalities observed in the cluster based on the metrics.
- Kraken also provides ability to check if any critical alerts are firing in the cluster post chaos and pass/fail’s.
Information on enabling and leveraging this feature can be found here
OCM / ACM integration
Kraken supports injecting faults into Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM) managed clusters through ManagedCluster Scenarios.
Where should I go next?
- Installation: Get started using krkn!
- Scenarios: Check out the scenarios we offer!
2 - Installation
The following ways are supported to run Kraken:
- Standalone python program through Git.
- Containerized version using either Podman or Docker as the runtime via Krkn-hub
- Kubernetes or OpenShift deployment ( unsupported )
Note
It is recommended to run Kraken external to the cluster ( Standalone or Containerized ) hitting the Kubernetes/OpenShift API as running it internal to the cluster might be disruptive to itself and also might not report back the results if the chaos leads to cluster’s API server instability.Note
To run Kraken on Power (ppc64le) architecture, build and run a containerized version by following the instructions given here.Note
Helper functions for interactions in Krkn are part of krkn-lib. Please feel free to reuse and expand them as you see fit when adding a new scenario or expanding the capabilities of the current supported scenarios.2.1 - Krkn
Installation
Git
Clone the repository
$ git clone https://github.com/krkn-chaos/krkn.git --branch <release version>
$ cd krkn
Install the dependencies
$ python3.9 -m venv chaos
$ source chaos/bin/activate
$ pip3.9 install -r requirements.txt
Note
Make sure python3-devel and latest pip versions are installed on the system. The dependencies install has been tested with pip >= 21.1.3 versions.Running Krkn
$ python3.9 run_kraken.py --config <config_file_location>
Run containerized version
Krkn-hub is a wrapper that allows running Krkn chaos scenarios via podman or docker runtime with scenario parameters/configuration defined as environment variables.
2.2 - krkn-hub
Hosts container images and wrapper for running scenarios supported by Krkn, a chaos testing tool for Kubernetes clusters to ensure it is resilient to failures. All we need to do is run the containers with the respective environment variables defined as supported by the scenarios without having to maintain and tweak files!
Set Up
You can use docker or podman to run kraken-hub
Install Podman your certain operating system based on these instructions
or
Install Docker on your system.
Docker is also supported but all variables you want to set (separate from the defaults) need to be set at the command line In the form -e <VARIABLE>=<value>
You can take advantage of the get_docker_params.sh script to create your parameters string This will take all environment variables and put them in the form “-e =” to make a long string that can get passed to the command
For example: docker run $(./get_docker_params.sh) --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/redhat-chaos/krkn-hub:power-outages
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view –flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) –name=<container_name> –net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
3 - Scenarios
Supported chaos scenarios
Scenario | Description |
---|---|
Pod failures | Injects pod failures |
Container failures | Injects container failures based on the provided kill signal |
Node failures | Injects node failure through OpenShift/Kubernetes, cloud API’s |
zone outages | Creates zone outage to observe the impact on the cluster, applications |
time skew | Skews the time and date |
Node cpu hog | Hogs CPU on the targeted nodes |
Node memory hog | Hogs memory on the targeted nodes |
Node IO hog | Hogs io on the targeted nodes |
Service Disruption | Deleting all objects within a namespace |
Application outages | Isolates application Ingress/Egress traffic to observe the impact on dependent applications and recovery/initialization timing |
Power Outages | Shuts down the cluster for the specified duration and turns it back on to check the cluster health |
PVC disk fill | Fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it |
Network Chaos | Introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using tc and Netem |
Pod Network Chaos | Introduces network chaos at pod level |
Service Hijacking | Hijacks a service http traffic to simulate custom HTTP responses |
3.1 - Application Outage Scenarios
Application outages
Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
3.1.1 - Application Outage Scenarios using Krkn
Sample scenario config
application_outage: # Scenario to create an outage of an application by blocking traffic
duration: 600 # Duration in seconds after which the routes will be accessible
namespace: <namespace-with-application> # Namespace to target - all application routes will go inaccessible if pod selector is empty
pod_selector: {app: foo} # Pods to target
block: [Ingress, Egress] # It can be Ingress or Egress or Ingress, Egress
Debugging steps in case of failures
Kraken creates a network policy blocking the ingress/egress traffic to create an outage, in case of failures before reverting back the network policy, you can delete it manually by executing the following commands to stop the outage:
$ oc delete networkpolicy/kraken-deny -n <targeted-namespace>
3.1.2 - Application outage Scenario using Krkn-hub
This scenario disrupts the traffic to the specified application to be able to understand the impact of the outage on the dependent service/user experience. Refer docs for more details.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
DURATION | Duration in seconds after which the routes will be accessible | 600 |
NAMESPACE | Namespace to target - all application routes will go inaccessible if pod selector is empty ( Required ) | No default |
POD_SELECTOR | Pods to target. For example “{app: foo}” | No default |
BLOCK_TRAFFIC_TYPE | It can be Ingress or Egress or Ingress, Egress ( needs to be a list ) | [Ingress, Egress] |
Note
Defining theNAMESPACE
parameter is required for running this scenario while the pod_selector is optional. In case of using pod selector to target a particular application, make sure to define it using the following format with a space between key and value: “{key: value}”.Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:application-outages
Demo
You can find a link to a demo of the scenario here
3.2 - Arcaflow Scenarios
Arcaflow is a workflow engine in development which provides the ability to execute workflow steps in sequence, in parallel, repeatedly, etc. The main difference to competitors such as Netflix Conductor is the ability to run ad-hoc workflows without an infrastructure setup required.
The engine uses containers to execute plugins and runs them either locally in Docker/Podman or remotely on a Kubernetes cluster. The workflow system is strongly typed and allows for generating JSON schema and OpenAPI documents for all data formats involved.
Available Scenarios
Hog scenarios:
Prequisites
Arcaflow supports three deployment technologies:
- Docker
- Podman
- Kubernetes
Docker
In order to run Arcaflow Scenarios with the Docker deployer, be sure that:
- Docker is correctly installed in your Operating System (to find instructions on how to install docker please refer to Docker Documentation)
- The Docker daemon is running
Podman
The podman deployer is built around the podman CLI and doesn’t need necessarily to be run along with the podman daemon. To run Arcaflow Scenarios in your Operating system be sure that:
- podman is correctly installed in your Operating System (to find instructions on how to install podman refer to Podman Documentation)
- the podman CLI is in your shell PATH
Kubernetes
The kubernetes deployer integrates directly the Kubernetes API Client and needs only a valid kubeconfig file and a reachable Kubernetes/OpenShift Cluster.
3.2.1 - Arcaflow Scenarios using Krkn
Usage
To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named arcaflow_scenarios
then add the desired scenario
pointing to the input.yaml
file.
kraken:
...
chaos_scenarios:
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
input.yaml
The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target
config.yaml
The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
- debug
- info
- warning
- error
workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.
This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift
3.3 - Container Scenarios
Kraken uses the oc exec
command to kill
specific containers in a pod.
This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file.
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.
3.3.1 - Container Scenarios using Krkn
Example Config
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
scenarios:
- name: "<name of scenario>"
namespace: "<specific namespace>" # can specify "*" if you want to find in all namespaces
label_selector: "<label of pod(s)>"
container_name: "<specific container name>" # This is optional, can take out and will kill all containers in all pods found under namespace and label
pod_names: # This is optional, can take out and will select all pods with given namespace and label
- <pod_name>
count: <number of containers to disrupt, default=1>
action: <kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>
expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)
Post Action
In all scenarios we do a post chaos check to wait and verify the specific component.
Here there are two options:
- Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
See scenarios/post_action_etcd_container.py for an example.
- container_scenarios: # List of chaos pod scenarios to load.
- - scenarios/container_etcd.yml
- scenarios/post_action_etcd_container.py
- Allow kraken to wait and check the killed containers until they become ready again. Kraken keeps a list of the specific containers that were killed as well as the namespaces and pods to verify all containers that were affected recover properly.
expected_recovery_time: <seconds to wait for container to recover>
3.3.2 - Container Scenarios using Krkn-hub
This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
NAMESPACE | Targeted namespace in the cluster | openshift-etcd |
LABEL_SELECTOR | Label of the container(s) to target | k8s-app=etcd |
DISRUPTION_COUNT | Number of container to disrupt | 1 |
CONTAINER_NAME | Name of the container to disrupt | etcd |
ACTION | kill signal to run. For example 1 ( hang up ) or 9 | 1 |
EXPECTED_RECOVERY_TIME | Time to wait before checking if all containers that were affected recover properly | 60 |
Note
Set NAMESPACE environment variable toopenshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
3.4 - CPU Hog Scenario
This scenario is based on the arcaflow arcaflow-plugin-stressng plugin. The purpose of this scenario is to create cpu pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
3.4.1 - CPU Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/cpu-hog/input.yaml
as described in the
Usage section.
This scenario takes a list of objects named input_list
with the following properties:
- kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- namespace : string the namespace where the scenario container will be deployed
Note: this parameter will be automatically filled by kraken if the
kubeconfig_path
property is correctly set - node_selector : key-value map the node label that will be used as
nodeSelector
by the pod to target a specific cluster node - duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- cpu_count : int the number of CPU cores to be used (0 means all)
- cpu_method : string a fine-grained control of which cpu stressors to use (ackermann, cfloat etc. see manpage for all the cpu_method options)
- cpu_load_percentage : int the CPU load by percentage
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the input_list
with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism
in workload.yaml
file
Usage
To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named arcaflow_scenarios
then add the desired scenario
pointing to the input.yaml
file.
kraken:
...
chaos_scenarios:
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
input.yaml
The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target
config.yaml
The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
- debug
- info
- warning
- error
workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.
This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift
3.4.2 - CPU Hog Scenario using Krkn-Hub
This scenario hogs the cpu on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 60 |
NODE_CPU_CORE | Number of cores (workers) of node CPU to be consumed | 2 |
NODE_CPU_PERCENTAGE | Percentage of total cpu to be consumed | 50 |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTORS | Node selectors where the scenario containers will be scheduled in the format “<selector>=<value> ”. NOTE: This value can be specified as a list of node selectors separated by “; ”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node | "" |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-cpu-hog
Demo
You can find a link to a demo of the scenario here
3.5 - IO Hog Scenario
This scenario is based on the arcaflow arcaflow-plugin-stressng plugin.
The purpose of this scenario is to create disk pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
The scenario allows to attach a node path to the pod as a hostPath
volume.
3.5.1 - IO Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/io-hog/input.yaml
as described in the
Usage section.
This scenario takes a list of objects named input_list
with the following properties:
- kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- namespace : string the namespace where the scenario container will be deployed
Note: this parameter will be automatically filled by kraken if the
kubeconfig_path
property is correctly set - node_selector : key-value map the node label that will be used as
nodeSelector
by the pod to target a specific cluster node - duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- target_pod_folder : string the path in the pod where the volume is mounted
- target_pod_volume : object the
hostPath
volume definition in the Kubernetes/OpenShift format, that will be attached to the pod as a volume - io_write_bytes : string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
- io_block_size : string size of each write in bytes. Size can be from 1 byte to 4m.
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the input_list
with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism
in workload.yaml
file
Usage
To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named arcaflow_scenarios
then add the desired scenario
pointing to the input.yaml
file.
kraken:
...
chaos_scenarios:
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
input.yaml
The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target
config.yaml
The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
- debug
- info
- warning
- error
workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.
This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift
3.5.2 - IO Hog Scenario using Krkn-Hub
This scenario hogs the IO on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 180 |
IO_BLOCK_SIZE | string size of each write in bytes. Size can be from 1 byte to 4m | 1m |
IO_WORKERS | Number of stressorts | 5 |
IO_WRITE_BYTES | string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g | 10m |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTORS | Node selectors where the scenario containers will be scheduled in the format “<selector>=<value> ”. NOTE: This value can be specified as a list of node selectors separated by “; ”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node | "" |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/root/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/root/kraken/config/alerts -v <path-to-kube-config>:/root/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-io-hog
3.6 - ManagedCluster Scenarios
ManagedCluster scenarios provide a way to integrate kraken with Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes (ACM).
ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.
The following ManagedCluster chaos scenarios are supported:
- managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
- managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
- managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
- start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
- stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
- stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.
ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under managedcluster_scenarios
option in the Kraken config. Refer to managedcluster_scenarios_example config file.
managedcluster_scenarios:
- actions: # ManagedCluster chaos scenarios to be injected
- managedcluster_stop_start_scenario
managedcluster_name: cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma
# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injection
instance_count: 1 # Number of managedcluster to perform action/select that match the label selector
runs: 1 # Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)
timeout: 420 # Duration to wait for completion of ManagedCluster scenario injection
# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds
# (default leaseDurationSeconds = 60 sec)
- actions:
- stop_start_klusterlet_scenario
managedcluster_name: cluster1
# label_selector:
instance_count: 1
runs: 1
timeout: 60
3.7 - Memory Hog Scenario
This scenario is based on the arcaflow arcaflow-plugin-stressng plugin. The purpose of this scenario is to create Virtual Memory pressure on a particular node of the Kubernetes/OpenShift cluster for a time span.
3.7.1 - Memory Hog Scenarios using Krkn
To enable this plugin add the pointer to the scenario input file scenarios/arcaflow/memory-hog/input.yaml
as described in the
Usage section.
This scenario takes a list of objects named input_list
with the following properties:
- kubeconfig : string the kubeconfig needed by the deployer to deploy the sysbench plugin in the target cluster
- namespace : string the namespace where the scenario container will be deployed
Note: this parameter will be automatically filled by kraken if the
kubeconfig_path
property is correctly set - node_selector : key-value map the node label that will be used as
nodeSelector
by the pod to target a specific cluster node - duration : string stop stress test after N seconds. One can also specify the units of time in seconds, minutes, hours, days or years with the suffix s, m, h, d or y.
- vm_bytes : string N bytes per vm process or percentage of memory used (using the % symbol). The size can be expressed in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
- vm_workers : int Number of VM stressors to be run (0 means 1 stressor per CPU)
To perform several load tests in the same run simultaneously (eg. stress two or more nodes in the same run) add another item
to the input_list
with the same properties (and eventually different values eg. different node_selectors
to schedule the pod on different nodes). To reduce (or increase) the parallelism change the value parallelism
in workload.yaml
file
Usage
To enable arcaflow scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios
of the yaml structure
and add a new element to the list named arcaflow_scenarios
then add the desired scenario
pointing to the input.yaml
file.
kraken:
...
chaos_scenarios:
- arcaflow_scenarios:
- scenarios/arcaflow/cpu-hog/input.yaml
input.yaml
The implemented scenarios can be found in scenarios/arcaflow/<scenario_name> folder. The entrypoint of each scenario is the input.yaml file. In this file there are all the options to set up the scenario accordingly to the desired target
config.yaml
The arcaflow config file. Here you can set the arcaflow deployer and the arcaflow log level. The supported deployers are:
- Docker
- Podman (podman daemon not needed, suggested option)
- Kubernetes
The supported log levels are:
- debug
- info
- warning
- error
workflow.yaml
This file contains the steps that will be executed to perform the scenario against the target. Each step is represented by a container that will be executed from the deployer and its options. Note that we provide the scenarios as a template, but they can be manipulated to define more complex workflows. To have more details regarding the arcaflow workflows architecture and syntax it is suggested to refer to the Arcaflow Documentation.
This edit is no longer in quay image Working on fix in ticket: https://issues.redhat.com/browse/CHAOS-494 This will effect all versions 4.12 and higher of OpenShift
3.7.2 - Memory Hog Scenario using Krkn-Hub
This scenario hogs the memory on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
TOTAL_CHAOS_DURATION | Set chaos duration (in sec) as desired | 60 |
MEMORY_CONSUMPTION_PERCENTAGE | percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario | 90% |
NUMBER_OF_WORKERS | Total number of workers (stress-ng threads) | 1 |
NAMESPACE | Namespace where the scenario container will be deployed | default |
NODE_SELECTORS | Node selectors where the scenario containers will be scheduled in the format “<selector>=<value> ”. NOTE: This value can be specified as a list of node selectors separated by “; ”. Will be instantiated a container per each node selector with the same scenario options. This option is meant to run one or more stress scenarios simultaneously on different nodes, kubernetes will schedule the pods on the target node accordingly with the selector specified. Specifying the same selector multiple times will instantiate as many scenario container as the number of times the selector is specified on the same node | "" |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-memory-hog
Demo
You can find a link to a demo of the scenario here
3.8 - Network Chaos Scenario
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node’s host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
3.8.1 - Network Chaos Scenario using Krkn
Sample scenario config for egress traffic shaping
network_chaos: # Scenario to create an outage by simulating random variations in the network.
duration: 300 # In seconds - duration network chaos will be applied.
node_name: # Comma separated node names on which scenario has to be injected.
label_selector: node-role.kubernetes.io/master # When node_name is not specified, a node with matching label_selector is selected for running the scenario.
instance_count: 1 # Number of nodes in which to execute network chaos.
interfaces: # List of interface on which to apply the network restriction.
- "ens5" # Interface name would be the Kernel host network interface name.
execution: serial|parallel # Execute each of the egress options as a single scenario(parallel) or as separate scenario(serial).
egress:
latency: 500ms
loss: 50% # percentage
bandwidth: 10mbit
Sample scenario config for ingress traffic shaping (using a plugin)
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
ip-10-0-128-153.us-west-2.compute.internal:
- ens5
- genev_sys_6081
label_selector: node-role.kubernetes.io/master # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injection
instance_count: 1 # Number of nodes to perform action/select that match the label selector
kubeconfig_path: ~/.kube/config # Path to kubernetes config file. If not specified, it defaults to ~/.kube/config
execution_type: parallel # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).
network_params:
latency: 500ms
loss: '50%'
bandwidth: 10mbit
wait_duration: 120
test_duration: 60
'''
Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
##### Steps
- Pick the nodes to introduce the network anomaly either from node_name or label_selector.
- Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
- Set traffic shaping config on node's interface using tc and netem.
- Wait for the duration time.
- Remove traffic shaping config on node's interface.
- Remove the job that spawned the pod.
3.8.2 - Network Chaos Scenario using Krkn-Hub
This scenario introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using the tc and Netem. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
Note
export TRAFFIC_TYPE=egress
for Egress scenarios and export TRAFFIC_TYPE=ingress
for Ingress scenariosSee list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Egress Scenarios
Parameter | Description | Default |
---|---|---|
DURATION | Duration in seconds - during with network chaos will be applied. | 300 |
NODE_NAME | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | "" |
LABEL_SELECTOR | When NODE_NAME is not specified, a node with matching label_selector is selected for running. | node-role.kubernetes.io/master |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
INTERFACES | List of interface on which to apply the network restriction. | [] |
EXECUTION | Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial). | parallel |
EGRESS | Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit) | {bandwidth: 100mbit} |
Ingress Scenarios
Parameter | Description | Default |
---|---|---|
DURATION | Duration in seconds - during with network chaos will be applied. | 300 |
TARGET_NODE_AND_INTERFACE | # Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]} | "" |
LABEL_SELECTOR | When NODE_NAME is not specified, a node with matching label_selector is selected for running. | node-role.kubernetes.io/master |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
EXECUTION | Used to specify whether you want to apply filters on interfaces one at a time or all at once. | parallel |
NETWORK_PARAMS | latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: ‘0.02’} | "" |
WAIT_DURATION | Ensure that it is at least about twice of test_duration | 300 |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
3.9 - Node Scenarios
This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster.
3.9.1 - Node Scenarios using Krkn
The following node chaos scenarios are supported:
- node_start_scenario: Scenario to stop the node instance.
- node_stop_scenario: Scenario to stop the node instance.
- node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware.
- node_termination_scenario: Scenario to terminate the node instance.
- node_reboot_scenario: Scenario to reboot the node instance.
- stop_kubelet_scenario: Scenario to stop the kubelet of the node instance.
- stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance.
- restart_kubelet_scenario: Scenario to restart the kubelet of the node instance.
- node_crash_scenario: Scenario to crash the node instance.
- stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status.
Note
If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.Note
node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario , node_reboot_scenario and stop_start_kubelet_scenario are supported on AWS, Azure, OpenStack, BareMetal, GCP , VMware and Alibaba.AWS
Cloud setup instructions can be found here. Sample scenario config can be found here.
Baremetal
Sample scenario config can be found here.
Note
Baremetal requires setting the IPMI user and password to power on, off, and reboot nodes, using the config options bm_user
and bm_password
. It can either be set in the root of the entry in the scenarios config, or it can be set per machine.
If no per-machine addresses are specified, kraken attempts to use the BMC value in the BareMetalHost object. To list them, you can do ‘oc get bmh -o wide –all-namespaces’. If the BMC values are blank, you must specify them per-machine using the config option ‘bmc_addr’ as specified below.
For per-machine settings, add a “bmc_info” section to the entry in the scenarios config. Inside there, add a configuration section using the node name. In that, add per-machine settings. Valid settings are ‘bmc_user’, ‘bmc_password’, and ‘bmc_addr’. See the example node scenario or the example below.
Note
Baremetal requires oc (openshift client) be installed on the machine running Kraken.Note
Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.Docker
The Docker provider can be used to run node scenarios against kind clusters.
kind is a tool for running local Kubernetes clusters using Docker container “nodes”.
kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
GCP
Cloud setup instructions can be found here. Sample scenario config can be found here.
Openstack
How to set up Openstack cli to run node scenarios is defined here.
The supported node level chaos scenarios on an OPENSTACK cloud are node_stop_start_scenario
, stop_start_kubelet_scenario
and node_reboot_scenario
.
Note
Forstop_start_helper_node_scenario
, visit here to learn more about the helper node and its usage.To execute the scenario, ensure the value for ssh_private_key
in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.
Azure
Cloud setup instructions can be found here. Sample scenario config can be found here.
Alibaba
How to set up Alibaba cli to run node scenarios is defined here.
Note
There is no “terminating” idea in Alibaba, so any scenario with terminating will “release” the node . Releasing a node is 2 steps, stopping the node and then releasing it.VMware
How to set up VMware vSphere to run node scenarios is defined here
This cloud type uses a different configuration style, see actions below and example config file
- vmware-node-terminate
- vmware-node-reboot
- vmware-node-stop
- vmware-node-start
IBMCloud
How to set up IBMCloud to run node scenarios is defined here
This cloud type uses a different configuration style, see actions below and example config file
- ibmcloud-node-terminate
- ibmcloud-node-reboot
- ibmcloud-node-stop
- ibmcloud-node-start
General
Note
Thenode_crash_scenario
and stop_kubelet_scenario
scenario is supported independent of the cloud platform.Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.
3.9.2 - Node Scenarios using Krkn-Hub
This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
ACTION | Action can be one of the following | node_stop_start_scenario for aws and vmware-node-reboot for vmware, ibmcloud-node-reboot for ibmcloud |
LABEL_SELECTOR | Node label to target | node-role.kubernetes.io/worker |
NODE_NAME | Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma | "" |
INSTANCE_COUNT | Targeted instance count matching the label selector | 1 |
RUNS | Iterations to perform action on a single node | 1 |
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported platforms - aws, vmware, ibmcloud, bm | aws |
TIMEOUT | Duration to wait for completion of node scenario injection | 180 |
DURATION | Duration to stop the node before running the start action - not supported for vmware and ibm cloud type | 120 |
VERIFY_SESSION | Only needed for vmware - Set to True if you want to verify the vSphere client session using certificates | False |
SKIP_OPENSHIFT_CHECKS | Only needed for vmware - Set to True if you don’t want to wait for the status of the nodes to change on OpenShift before passing the scenario | False |
BMC_USER | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
BMC_PASSWORD | Only needed for Baremetal ( bm ) - IPMI/bmc password | "" |
BMC_ADDR | Only needed for Baremetal ( bm ) - IPMI/bmc username | "" |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
VMware Vsphere
$ export VSPHERE_IP=<vSphere_client_IP_address>
$ export VSPHERE_USERNAME=<vSphere_client_username>
$ export VSPHERE_PASSWORD=<vSphere_client_password>
Ibmcloud
$ export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
$ export IBMC_APIKEY=<ibmcloud_api_key>
Baremetal
$ export BMC_USER=<bmc/IPMI user>
$ export BMC_PASSWORD=<bmc/IPMI password>
$ export BMC_ADDR=<bmc address>
Google Cloud Platform
TBD
Azure
$ export AZURE_TENANT_ID=<>
$ export AZURE_CLIENT_SECRET=<>
$ export AZURE_CLIENT_ID=<>
OpenStack
TBD
Demo
You can find a link to a demo of the scenario here
3.10 - Pod Network Scenarios
Pod outage
Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc. With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
3.10.1 - Pod Scenarios using Krkn
Sample scenario config (using a plugin)
- id: pod_network_outage
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied
direction: # Optioinal - List of directions to apply filters
- ingress # Blocks ingress traffic, Default both egress and ingress
ingress_ports: # Optional - List of ports to block traffic on
- 8443 # Blocks 8443, Default [], i.e. all ports.
label_selector: 'component=ui' # Blocks access to openshift console
Pod Network shaping
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Pod’s network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
Sample scenario config for egress traffic shaping (using plugin)
- id: pod_egress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
Sample scenario config for ingress traffic shaping (using plugin)
- id: pod_ingress_shaping
config:
namespace: openshift-console # Required - Namespace of the pod to which filter need to be applied.
label_selector: 'component=ui' # Applies traffic shaping to access openshift console.
network_params:
latency: 500ms # Add 500ms latency to egress traffic from the pod.
Steps
- Pick the pods to introduce the network anomaly either from label_selector or pod_name.
- Identify the pod interface name on the node.
- Set traffic shaping config on pod’s interface using tc and netem.
- Wait for the duration time.
- Remove traffic shaping config on pod’s interface.
- Remove the job that spawned the pod.
3.10.2 - Pod Network Chaos Scenarios using Krkn-hub
This scenario runs network chaos at the pod level on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
NAMESPACE | Required - Namespace of the pod to which filter need to be applied | "" |
LABEL_SELECTOR | Label of the pod(s) to target | "" |
POD_NAME | When label_selector is not specified, pod matching the name will be selected for the chaos scenario | "" |
INSTANCE_COUNT | Number of pods to perform action/select that match the label selector | 1 |
TRAFFIC_TYPE | List of directions to apply filters - egress/ingress ( needs to be a list ) | [ingress, egress] |
INGRESS_PORTS | Ingress ports to block ( needs to be a list ) | [] i.e all ports |
EGRESS_PORTS | Egress ports to block ( needs to be a list ) | [] i.e all ports |
WAIT_DURATION | Ensure that it is at least about twice of test_duration | 300 |
TEST_DURATION | Duration of the test run | 120 |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-network-chaos
3.11 - Pod Scenarios
Krkn recently replaced PowerfulSeal with its own internal pod scenarios using a plugin system. This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
3.11.1 - Pod Scenarios using Krkn
Example Config
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
kraken:
chaos_scenarios:
- plugin_scenarios:
- path/to/scenario.yaml
You can then create the scenario file with the following contents:
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^kube-system$
label_selector: k8s-app=kube-scheduler
krkn_pod_recovery_time: 120
Please adjust the schema reference to point to the schema file. This file will give you code completion and documentation for the available options in your IDE.
Pod Chaos Scenarios
The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.
Component | Description | Working |
---|---|---|
Basic pod scenario | Kill a pod. | :heavy_check_mark: |
Etcd | Kills a single/multiple etcd replicas. | :heavy_check_mark: |
Kube ApiServer | Kills a single/multiple kube-apiserver replicas. | :heavy_check_mark: |
ApiServer | Kills a single/multiple apiserver replicas. | :heavy_check_mark: |
Prometheus | Kills a single/multiple prometheus replicas. | :heavy_check_mark: |
OpenShift System Pods | Kills random pods running in the OpenShift system namespaces. | :heavy_check_mark: |
3.11.2 - Pod Scenarios using Krkn-hub
This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pod-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
NAMESPACE | Targeted namespace in the cluster ( supports regex ) | openshift-.* |
POD_LABEL | Label of the pod(s) to target | "" |
NAME_PATTERN | Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified | .* |
DISRUPTION_COUNT | Number of pods to disrupt | 1 |
KILL_TIMEOUT | Timeout to wait for the target pod(s) to be removed in seconds | 180 |
EXPECTED_RECOVERY_TIME | Fails if the pod disrupted do not recover within the timeout set | 120 |
Note
Set NAMESPACE environment variable toopenshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
3.12 - Power Outage Scenarios
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy.
3.12.1 - Power Outage Scenario using Krkn
Power Outage/ Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to cluster_shut_down_scenario config file.
Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.
Current accepted cloud types:
cluster_shut_down_scenario: # Scenario to stop all the nodes for specified duration and restart the nodes.
runs: 1 # Number of times to execute the cluster_shut_down scenario.
shut_down_duration: 120 # Duration in seconds to shut down the cluster.
cloud_type: aws # Cloud type on which Kubernetes/OpenShift runs.
3.12.2 - Power Outage Scenario using Krkn-Hub
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy. More information can be found here
Right now power outage and cluster shutdown are one in the same. We originally created this scenario to stop all the nodes and then start them back up how a customer would shut their cluster down.
In a real life chaos scenario though, we figured this scenario was close to if the power went out on the aws side so all of our ec2 nodes would be stopped/powered off. We tried to look at if aws cli had a way to forcefully poweroff the nodes (not gracefully) and they don’t currently support so this scenario is as close as we can get to “pulling the plug”
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:power-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
SHUTDOWN_DURATION | Duration in seconds to shut down the cluster | 1200 |
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported cloud platforms | aws |
TIMEOUT | Time in seconds to wait for each node to be stopped or running after the cluster comes back | 600 |
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
Google Cloud Platform
TBD
Azure
TBD
OpenStack
TBD
Baremetal
TBD
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
3.13 - PVC Scenario
Scenario to fill up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults caused by the application using this volume.
3.13.1 - PVC Scenario using Krkn
Sample scenario config
pvc_scenario:
pvc_name: <pvc_name> # Name of the target PVC.
pod_name: <pod_name> # Name of the pod where the PVC is mounted. It will be ignored if the pvc_name is defined.
namespace: <namespace_name> # Namespace where the PVC is.
fill_percentage: 50 # Target percentage to fill up the cluster. Value must be higher than current percentage. Valid values are between 0 and 99.
duration: 60 # Duration in seconds for the fault.
Steps
- Get the pod name where the PVC is mounted.
- Get the volume name mounted in the container pod.
- Get the container name where the PVC is mounted.
- Get the mount path where the PVC is mounted in the pod.
- Get the PVC capacity and current used capacity.
- Calculate file size to fill the PVC to the target fill_percentage.
- Connect to the pod.
- Create a temp file
kraken.tmp
with random data on the mount path:dd bs=1024 count=$file_size </dev/urandom > /mount_path/kraken.tmp
- Wait for the duration time.
- Remove the temp file created:
rm kraken.tmp
3.13.2 - PVC Scenario using Krkn-Hub
This scenario fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults cause by the application using this volume. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
If both PVC_NAME
and POD_NAME
are defined, POD_NAME
value will be overridden from the Mounted By:
value on PVC definition.
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
PVC_NAME | Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required) | |
POD_NAME | Targeted pod in the cluster (if null, PVC_NAME is required) | |
NAMESPACE | Targeted namespace in the cluster (required) | |
FILL_PERCENTAGE | Targeted percentage to be filled up in the PVC | 50 |
DURATION | Duration in seconds with the PVC filled up | 60 |
Note
Set NAMESPACE environment variable toopenshift-.*
to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:pvc-scenarios
3.14 - Service Disruption Scenarios
Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
3.14.1 - Service Disruption Scenarios using Krkn
Configuration Options:
namespace: Specific namespace or regex style namespace of what you want to delete. Gets all namespaces if not specified; set to "" if you want to use the label_selector field.
Set to ‘^.*$’ and label_selector to "" to randomly select any namespace in your cluster.
label_selector: Label on the namespace you want to delete. Set to "" if you are using the namespace variable.
delete_count: Number of namespaces to kill in each run. Based on matching namespace and label specified, default is 1.
runs: Number of runs/iterations to kill namespaces, default is 1.
sleep: Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set
Refer to namespace_scenarios_example config file.
scenarios:
- namespace: "^.*$"
runs: 1
- namespace: "^.*ingress.*$"
runs: 1
sleep: 15
Steps
This scenario will select a namespace (or multiple) dependent on the configuration and will kill all of the below object types in that namespace and will wait for them to be Running in the post action
- Services
- Daemonsets
- Statefulsets
- Replicasets
- Deployments
Post Action
We do a post chaos check to wait and verify the specific objects in each namespace are Ready
Here there are two options:
- Pass a custom script in the main config scenario list that will run before the chaos and verify the output matches post chaos scenario.
See scenarios/post_action_namespace.py for an example
- namespace_scenarios:
- - scenarios/regex_namespace.yaml
- scenarios/post_action_namespace.py
- Allow kraken to wait and check all killed objects in the namespaces become ‘Running’ again. Kraken keeps a list of the specific objects in namespaces that were killed to verify all that were affected recover properly.
wait_time: <seconds to wait for namespace to recover>
3.14.2 - Service Disruption Scenario using Krkn-Hub
This scenario deletes main objects within a namespace in your Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
LABEL_SELECTOR | Label of the namespace to target. Set this parameter only if NAMESPACE is not set | "" |
NAMESPACE | Name of the namespace you want to target. Set this parameter only if LABEL_SELECTOR is not set | “openshift-etcd” |
SLEEP | Number of seconds to wait before polling to see if namespace exists again | 15 |
DELETE_COUNT | Number of namespaces to kill in each run, based on matching namespace and label specified | 1 |
RUNS | Number of runs to execute the action | 1 |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:service-disruption-scenarios
Demo
You can find a link to a demo of the scenario here
3.15 - Service Hijacking Scenario
Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.
3.15.1 - Service Hijacking Scenarios using Krkn
The web service’s source code is available here. It employs a time-based test plan from the scenario configuration file, which specifies the behavior of resources during the chaos scenario as follows:
service_target_port: http-web-svc # The port of the service to be hijacked (can be named or numeric, based on the workload and service configuration).
service_name: nginx-service # The name of the service that will be hijacked.
service_namespace: default # The namespace where the target service is located.
image: quay.io/krkn-chaos/krkn-service-hijacking:v0.1.3 # Image of the krkn web service to be deployed to receive traffic.
chaos_duration: 30 # Total duration of the chaos scenario in seconds.
plan:
- resource: "/list/index.php" # Specifies the resource or path to respond to in the scenario. For paths, both the path and query parameters are captured but ignored. For resources, only query parameters are captured.
steps: # A time-based plan consisting of steps can be defined for each resource.
GET: # One or more HTTP methods can be specified for each step. Note: Non-standard methods are supported for fully custom web services (e.g., using NONEXISTENT instead of POST).
- duration: 15 # Duration in seconds for this step before moving to the next one, if defined. Otherwise, this step will continue until the chaos scenario ends.
status: 500 # HTTP status code to be returned in this step.
mime_type: "application/json" # MIME type of the response for this step.
payload: | # The response payload for this step.
{
"status":"internal server error"
}
- duration: 15
status: 201
mime_type: "application/json"
payload: |
{
"status":"resource created"
}
POST:
- duration: 15
status: 401
mime_type: "application/json"
payload: |
{
"status": "unauthorized"
}
- duration: 15
status: 404
mime_type: "text/plain"
payload: "not found"
The scenario will focus on the service_name
within the service_namespace
,
substituting the selector with a randomly generated one, which is added as a label in the mock service manifest.
This allows multiple scenarios to be executed in the same namespace, each targeting different services without
causing conflicts.
The newly deployed mock web service will expose a service_target_port
,
which can be either a named or numeric port based on the service configuration.
This ensures that the Service correctly routes HTTP traffic to the mock web service during the chaos run.
Each step will last for duration
seconds from the deployment of the mock web service in the cluster.
For each HTTP resource, defined as a top-level YAML property of the plan
(it could be a specific resource, e.g., /list/index.php, or a path-based resource typical in MVC frameworks),
one or more HTTP request methods can be specified. Both standard and custom request methods are supported.
During this time frame, the web service will respond with:
status
: The HTTP status code (can be standard or custom).mime_type
: The MIME type (can be standard or custom).payload
: The response body to be returned to the client.
At the end of the step duration
, the web service will proceed to the next step (if available) until
the global chaos_duration
concludes. At this point, the original service will be restored,
and the custom web service and its resources will be undeployed.
NOTE: Some clients (e.g., cURL, jQuery) may optimize queries using lightweight methods (like HEAD or OPTIONS)
to probe API behavior. If these methods are not defined in the test plan, the web service may respond with
a 405
or 404
status code. If you encounter unexpected behavior, consider this use case.
3.15.2 - Service Hijacking Scenario using Krkn-Hub
This scenario reroutes traffic intended for a target service to a custom web service that is automatically deployed by Krkn. This web service responds with user-defined HTTP statuses, MIME types, and bodies. For more details, please refer to the following documentation.
Run
Unlike other krkn-hub scenarios, this one requires a specific configuration due to its unique structure. You must set up the scenario in a local file following the scenario syntax, and then pass this file’s base64-encoded content to the container via the SCENARIO_BASE64 variable.
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs.
Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> \
-e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
-v <path_to_kubeconfig>:/home/krkn/.kube/config:Z quay.io/krkn-chaos/krkn-hub:service-hijacking
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ export SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"
$ docker run $(./get_docker_params.sh) --name=<container_name> \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
OR
$ docker run --name=<container_name> -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
ecause the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description |
---|---|
SCENARIO_BASE64 | Base64 encoded service-hijacking scenario file. Note that the -w0 option in the command substitution SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" is mandatory in order to remove line breaks from the base64 command output |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" \
--name=<container_name> \
--net=host \
--env-host=true \
-v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml \
-v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d quay.io/krkn-chaos/krkn-hub:service-hijacking
3.16 - Time Scenarios
Using this type of scenario configuration, one is able to change the time and/or date of the system for pods or nodes.
3.16.1 - Time Scenarios using Krkn
Configuration Options:
action: skew_time or skew_date.
object_type: pod or node.
namespace: namespace of the pods you want to skew. Needs to be set if setting a specific pod name.
label_selector: Label on the nodes or pods you want to skew.
container_name: Container name in pod you want to reset time on. If left blank it will randomly select one.
object_name: List of the names of pods or nodes you want to skew.
Refer to time_scenarios_example config file.
time_scenarios:
- action: skew_time
object_type: pod
object_name:
- apiserver-868595fcbb-6qnsc
- apiserver-868595fcbb-mb9j5
namespace: openshift-apiserver
container_name: openshift-apiserver
- action: skew_date
object_type: node
label_selector: node-role.kubernetes.io/worker
3.16.2 - Time Skew Scenarios using Krkn-Hub
This scenario skews the date and time of the nodes and pods matching the label on a Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:time-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
OBJECT_TYPE | Object to target. Supported options: pod, node | pod |
LABEL_SELECTOR | Label of the container(s) or nodes to target | k8s-app=etcd |
ACTION | Action to run. Supported actions: skew_time, skew_date | skew_date |
OBJECT_NAME | List of the names of pods or nodes you want to skew ( optional parameter ) | [] |
CONTAINER_NAME | Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty | "" |
NAMESPACE | Namespace of the pods you want to skew, need to be set only if setting a specific pod name | "" |
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
3.17 - Zone Outage Scenarios
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone. It tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.
3.17.1 - Zone Outage Scenarios using Krkn
Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.
Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to shut down.
Current accepted cloud types:
Sample scenario config
zone_outage: # Scenario to create an outage of a zone by tweaking network ACL.
cloud_type: aws # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.
duration: 600 # Duration in seconds after which the zone will be back online.
vpc_id: # Cluster virtual private network to target.
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
Note
vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).Note
Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.Debugging steps in case of failures
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady
condition. Here is how to fix it:
- OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
- From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
- Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
- Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.
3.17.2 - Zone Outage Scenarios using Krkn-Hub
This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
OR
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:zone-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host -v ~kubeconfig:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|---|---|
CLOUD_TYPE | Cloud platform on top of which cluster is running, supported cloud platforms | aws |
DURATION | Duration in seconds after which the zone will be back online | 600 |
VPC_ID | cluster virtual private network to target ( REQUIRED ) | "" |
SUBNET_ID | subnet-id to deny both ingress and egress traffic ( REQUIRED ). Format: [subenet1, subnet2] | "" |
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:
Amazon Web Services
$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>
Google Cloud Platform
TBD
Azure
TBD
OpenStack
TBD
Baremetal
TBD
Note
In case of using custom metrics profile or alerts profile whenCAPTURE_METRICS
or ENABLE_ALERTS
is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml
and /home/krkn/kraken/config/alerts
.$ podman run --name=<container_name> --net=host --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:container-scenarios
Demo
You can find a link to a demo of the scenario here
3.18 - All Scenarios Variables
These variables are to be used for the top level configuration template that are shared by all the scenarios
See the description and default values below
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
Parameter | Description | Default |
---|---|---|
CERBERUS_ENABLED | Set this to true if cerberus is running and monitoring the cluster | False |
CERBERUS_URL | URL to poll for the go/no-go signal | http://0.0.0.0:8080 |
WAIT_DURATION | Duration in seconds to wait between each chaos scenario | 60 |
ITERATIONS | Number of times to execute the scenarios | 1 |
DAEMON_MODE | Iterations are set to infinity which means that the kraken will cause chaos forever | False |
PUBLISH_KRAKEN_STATUS | If you want | True |
SIGNAL_ADDRESS | Address to print kraken status to | 0.0.0.0 |
PORT | Port to print kraken status to | 8081 |
SIGNAL_STATE | Waits for the RUN signal when set to PAUSE before running the scenarios, refer docs for more details | RUN |
DEPLOY_DASHBOARDS | Deploys mutable grafana loaded with dashboards visualizing performance metrics pulled from in-cluster prometheus. The dashboard will be exposed as a route. | False |
CAPTURE_METRICS | Captures metrics as specified in the profile from in-cluster prometheus. Default metrics captures are listed here | False |
ENABLE_ALERTS | Evaluates expressions from in-cluster prometheus and exits 0 or 1 based on the severity set. Default profile. More details can be found here | False |
ALERTS_PATH | Path to the alerts file to use when ENABLE_ALERTS is set | config/alerts |
CHECK_CRITICAL_ALERTS | When enabled will check prometheus for critical alerts firing post chaos | False |
TELEMETRY_ENABLED | Enable/disables the telemetry collection feature | False |
TELEMETRY_API_URL | telemetry service endpoint | https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production |
TELEMETRY_USERNAME | telemetry service username | redhat-chaos |
TELEMETRY_PASSWORD | No default | |
TELEMETRY_PROMETHEUS_BACKUP | enables/disables prometheus data collection | True |
TELEMTRY_FULL_PROMETHEUS_BACKUP | if is set to False only the /prometheus/wal folder will be downloaded | False |
TELEMETRY_BACKUP_THREADS | number of telemetry download/upload threads | 5 |
TELEMETRY_ARCHIVE_PATH | local path where the archive files will be temporarly stored | /tmp |
TELEMETRY_MAX_RETRIES | maximum number of upload retries (if 0 will retry forever) | 0 |
TELEMETRY_RUN_TAG | if set, this will be appended to the run folder in the bucket (useful to group the runs | chaos |
TELEMETRY_GROUP | if set will archive the telemetry in the S3 bucket on a folder named after the value | default |
TELEMETRY_ARCHIVE_SIZE | the size of the prometheus data archive size in KB. The lower the size of archive is | 1000 |
TELEMETRY_LOGS_BACKUP | Logs backup to s3 | False |
TELEMETRY_FILTER_PATTER | Filter logs based on certain time stamp patterns | ["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+",“kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+”,"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"] |
TELEMETRY_CLI_PATH | OC Cli path, if not specified will be search in $PATH | blank |
ELASTIC_SERVER | Be able to track telemtry data in elasticsearch, this is the url of the elasticsearch data storage | blank |
ELASTIC_INDEX | Elastic search index pattern to post results to | blank |
Note
For setting the TELEMETRY_ARCHIVE_SIZE,the higher the number of archive files will be produced and uploaded (and processed by backup_thread simultaneously).For unstable/slow connection is better to keep this value low increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the failed chunk without affecting the whole upload.3.19 - Supported Cloud Providers
AWS
NOTE: For clusters with AWS make sure AWS CLI is installed and properly configured using an AWS account
GCP
NOTE: For clusters with GCP make sure GCP CLI is installed.
A google service account is required to give proper authentication to GCP for node actions. See here for how to create a service account.
NOTE: A user with ‘resourcemanager.projects.setIamPolicy’ permission is required to grant project-level permissions to the service account.
After creating the service account you will need to enable the account using the following: export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"
Openstack
NOTE: For clusters with Openstack Cloud, ensure to create and source the OPENSTACK RC file to set the OPENSTACK environment variables from the server where Kraken runs.
Azure
NOTE: You will need to create a service principal and give it the correct access, see here for creating the service principal and setting the proper permissions.
To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.
Before running you will need to set the following:
export AZURE_SUBSCRIPTION_ID=<subscription_id>
export AZURE_TENANT_ID=<tenant_id>
export AZURE_CLIENT_SECRET=<client secret>
export AZURE_CLIENT_ID=<client id>
Alibaba
See the Installation guide to install alicloud cli.
export ALIBABA_ID=<access_key_id>
export ALIBABA_SECRET=<access key secret>
export ALIBABA_REGION_ID=<region id>
Refer to region and zone page to get the region id for the region you are running on.
Set cloud_type to either alibaba or alicloud in your node scenario yaml file.
VMware
Set the following environment variables
export VSPHERE_IP=<vSphere_client_IP_address>
export VSPHERE_USERNAME=<vSphere_client_username>
export VSPHERE_PASSWORD=<vSphere_client_password>
These are the credentials that you would normally use to access the vSphere client.
IBMCloud
If no api key is set up with proper VPC resource permissions, use the following to create:
- Access group
- Service id with the following access
- With policy VPC Infrastructure Services
- Resources = All
- Roles:
- Editor
- Administrator
- Operator
- Viewer
- API Key
Set the following environment variables
export IBMC_URL=https://<region>.iaas.cloud.ibm.com/v1
export IBMC_APIKEY=<ibmcloud_api_key>
4 - Chaos Testing Guide
Table of Contents
- Test Stratagies and Methodology
- Best Practices
- Tooling
- Scenarios
- Test Environment Recommendations - how and where to run chaos tests
- Chaos testing in Practice
Test Strategies and Methodology
Failures in production are costly. To help mitigate risk to service health, consider the following strategies and approaches to service testing:
Be proactive vs reactive. We have different types of test suites in place - unit, integration and end-to-end - that help expose bugs in code in a controlled environment. Through implementation of a chaos engineering strategy, we can discover potential causes of service degradation. We need to understand the systems’ behavior under unpredictable conditions in order to find the areas to harden, and use performance data points to size the clusters to handle failures in order to keep downtime to a minimum.
Test the resiliency of a system under turbulent conditions by running tests that are designed to disrupt while monitoring the systems adaptability and performance:
- Establish and define your steady state and metrics - understand the behavior and performance under stable conditions and define the metrics that will be used to evaluate the system’s behavior. Then decide on acceptable outcomes before injecting chaos.
- Analyze the statuses and metrics of all components during the chaos test runs.
- Improve the areas that are not resilient and performant by comparing the key metrics and Service Level Objectives (SLOs) to the stable conditions before the chaos. For example: evaluating the API server latency or application uptime to see if the key performance indicators and service level indicators are still within acceptable limits.
Best Practices
Now that we understand the test methodology, let us take a look at the best practices for an Kubernetes cluster. On that platform there are user applications and cluster workloads that need to be designed for stability and to provide the best user experience possible:
Alerts with appropriate severity should get fired.
- Alerts are key to identify when a component starts degrading, and can help focus the investigation effort on affected system components.
- Alerts should have proper severity, description, notification policy, escalation policy, and SOP in order to reduce MTTR for responding SRE or Ops resources.
- Detailed information on the alerts consistency can be found here.
Minimal performance impact - Network, CPU, Memory, Disk, Throughput etc.
- The system, as well as the applications, should be designed to have minimal performance impact during disruptions to ensure stability and also to avoid hogging resources that other applications can use. We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
- We want to look at this in terms of CPU, Memory, Disk, Throughput, Network etc.
Appropriate CPU/Memory limits set to avoid performance throttling and OOM kills.
- There might be rogue applications hogging resources ( CPU/Memory ) on the nodes which might lead to applications underperforming or worse getting OOM killed. It is important to ensure that applications and system components have reserved resources for the kube-scheduler to take into consideration in order to keep them performing at the expected levels.
Services dependent on the system under test need to handle the failure gracefully to avoid performance degradation and downtime - appropriate timeouts.
- In a distributed system, services deployed coordinate with each other and might have external dependencies. Each of the services deployed as a deployment, pod, or container, need to handle the downtime of other dependent services gracefully instead of crashing due to not having appropriate timeouts, fallback logic etc.
Proper node sizing to avoid cascading failures and ensure cluster stability especially when the cluster is large and dense
- The platform needs to be sized taking into account the resource usage spikes that might occur during chaotic events. For example, if one of the main nodes goes down, the other two main nodes need to have enough resources to handle the load. The resource usage depends on the load or number of objects that are running being managed by the Control Plane ( Api Server, Etcd, Controller and Scheduler ). As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the applications stable during unplanned events without the control plane undergoing cascading failures which can potentially bring down the entire cluster.
Proper node sizing to avoid application failures and maintain stability.
- An application pod might use more resources during reinitialization after a crash, so it is important to take that into account for sizing the nodes in the cluster to accommodate it. For example, monitoring solutions like Prometheus need high amounts of memory to replay the write ahead log ( WAL ) when it restarts. As such, it’s critical to test such conditions, understand the behavior, and leverage the data to size the platform appropriately. This can help keep the application stable during unplanned events without undergoing degradation in performance or even worse hog the resources on the node which can impact other applications and system pods.
Minimal initialization time and fast recovery logic.
- The controller watching the component should recognize a failure as soon as possible. The component needs to have minimal initialization time to avoid extended downtime or overloading the replicas if it is a highly available configuration. The cause of failure can be because of issues with the infrastructure on top of which it is running, application failures, or because of service failures that it depends on.
High Availability deployment strategy.
- There should be multiple replicas ( both Kubernetes and application control planes ) running preferably in different availability zones to survive outages while still serving the user/system requests. Avoid single points of failure.
Backed by persistent storage
- It is important to have the system/application backed by persistent storage. This is especially important in cases where the application is a database or a stateful application given that a node, pod, or container failure will wipe off the data.
There should be fallback routes to the backend in case of using CDN, for example, Akamai in case of console.redhat.com - a managed service deployed on top of Kubernetes dedicated:
- Content delivery networks (CDNs) are commonly used to host resources such as images, JavaScript files, and CSS. The average web page is nearly 2 MB in size, and offloading heavy resources to third-parties is extremely effective for reducing backend server traffic and latency. However, this makes each CDN an additional point of failure for every site that relies on it. If the CDN fails, its customers could also fail.
- To test how the application reacts to failures, drop all network traffic between the system and CDN. The application should still serve the content to the user irrespective of the failure.
Appropriate caching and Content Delivery Network should be enabled to be performant and usable when there is a latency on the client side.
- Not every user or machine has access to unlimited bandwidth, there might be a delay on the user side ( client ) to access the API’s due to limited bandwidth, throttling or latency depending on the geographic location. It is important to inject latency between the client and API calls to understand the behavior and optimize things including caching wherever possible, using CDN’s or opting for different protocols like HTTP/2 or HTTP/3 vs HTTP.
Tooling
Now that we looked at the best practices, In this section, we will go through how Kraken - a chaos testing framework can help test the resilience of Kubernetes and make sure the applications and services are following the best practices.
Cluster recovery checks, metrics evaluation and pass/fail criteria
Most of the scenarios have built in checks to verify if the targeted component recovered from the failure after the specified duration of time but there might be cases where other components might have an impact because of a certain failure and it’s extremely important to make sure that the system/application is healthy as a whole post chaos. This is exactly where Cerberus comes to the rescue. If the monitoring tool, cerberus is enabled it will consume the signal and continue running chaos or not based on that signal.
Apart from checking the recovery and cluster health status, it’s equally important to evaluate the performance metrics like latency, resource usage spikes, throughput, etcd health like disk fsync, leader elections etc. To help with this, Kraken has a way to evaluate promql expressions from the incluster prometheus and set the exit status to 0 or 1 based on the severity set for each of the query. Details on how to use this feature can be found here.
The overall pass or fail of kraken is based on the recovery of the specific component (within a certain amount of time), the cerberus health signal which tracks the health of the entire cluster and metrics evaluation from incluster prometheus.
Scenarios
Let us take a look at how to run the chaos scenarios on your Kubernetes clusters using Kraken-hub - a lightweight wrapper around Kraken to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:
Pod Scenarios (Documentation)
- Disrupts Kubernetes/Kubernetes and applications deployed as pods:
- Helps understand the availability of the application, the initialization timing and recovery status.
- Demo
- Disrupts Kubernetes/Kubernetes and applications deployed as pods:
Container Scenarios (Documentation)
- Disrupts Kubernetes/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
- Helps understand the impact and recovery timing when the program/process running in the containers are disrupted - hangs, paused, killed etc., using various kill signals, i.e. SIGHUP, SIGTERM, SIGKILL etc.
- Demo
- Disrupts Kubernetes/Kubernetes and applications deployed as containers running as part of a pod(s) using a specified kill signal to mimic failures:
Node Scenarios (Documentation)
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
- Terminate nodes
- Fork bomb inside the node
- Stop the node
- Crash the kubelet running on the node
- etc.
- Demo
- Disrupts nodes as part of the cluster infrastructure by talking to the cloud API. AWS, Azure, GCP, OpenStack and Baremetal are the supported platforms as of now. Possible disruptions include:
Zone Outages (Documentation)
- Creates outage of availability zone(s) in a targeted region in the public cloud where the Kubernetes cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
- Helps understand the impact on both Kubernetes/Kubernetes control plane as well as applications and services running on the worker nodes in that zone.
- Currently, only set up for AWS cloud platform: 1 VPC and multiples subnets within the VPC can be specified.
- Demo
- Creates outage of availability zone(s) in a targeted region in the public cloud where the Kubernetes cluster is running by tweaking the network acl of the zone to simulate the failure, and that in turn will stop both ingress and egress traffic from all nodes in a particular zone for the specified duration and reverts it back to the previous state.
Application Outages (Documentation)
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
- Helps understand how the dependent services react to the unavailability.
- Demo
- Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during the downtime.
Power Outages (Documentation)
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
- There are various use cases in the customer environments. For example, when some of the clusters are shutdown in cases where the applications are not needed to run in a particular time/season in order to save costs.
- The nodes are stopped in parallel to mimic a power outage i.e., pulling off the plug
- Demo
- This scenario imitates a power outage by shutting down of the entire cluster for a specified duration of time, then restarts all the nodes after the specified time and checks the health of the cluster.
Resource Hog (Documenattion)
- Hogs CPU, Memory and IO on the targeted nodes
- Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
- CPU Hog (Documentation, Demo)
- Memory Hog (Documentation, Demo)
- Helps understand if the application/system components have reserved resources to not get disrupted because of rogue applications, or get performance throttled.
- Hogs CPU, Memory and IO on the targeted nodes
Time Skewing (Documentation)
- Manipulate the system time and/or date of specific pods/nodes.
- Verify scheduling of objects so they continue to work.
- Verify time gets reset properly.
- Manipulate the system time and/or date of specific pods/nodes.
Namespace Failures (Documentation)
- Delete namespaces for the specified duration.
- Helps understand the impact on other components and tests/improves recovery time of the components in the targeted namespace.
- Delete namespaces for the specified duration.
Persistent Volume Fill (Documentation)
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
- Helps understand how an application deals when it is no longer able to write data to the disk. For example, kafka’s behavior when it is not able to commit data to the disk.
- Fills up the persistent volumes, up to a given percentage, used by the pod for the specified duration.
Network Chaos (Documentation)
- Scenarios supported includes:
- Network latency
- Packet loss
- Interface flapping
- DNS errors
- Packet corruption
- Bandwidth limitation
- Scenarios supported includes:
Pod Network Scenario (Documentation)
- Scenario to block the traffic ( Ingress/Egress ) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
- With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
Service Disruption Scenarios (Documentation)
- Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
Service Hijacking Scenarios (Documentation)
- Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.
Test Environment Recommendations - how and where to run chaos tests
Let us take a look at few recommendations on how and where to run the chaos tests:
Run the chaos tests continuously in your test pipelines:
- Software, systems, and infrastructure does change – and the condition/health of each can change pretty rapidly. A good place to run tests is in your CI/CD pipeline running on a regular cadence.
Run the chaos tests manually to learn from the system:
- When running a Chaos scenario or Fault tests, it is more important to understand how the system responds and reacts, rather than mark the execution as pass or fail.
- It is important to define the scope of the test before the execution to avoid some issues from masking others.
Run the chaos tests in production environments or mimic the load in staging environments:
- As scary as a thought about testing in production is, production is the environment that users are in and traffic spikes/load are real. To fully test the robustness/resilience of a production system, running Chaos Engineering experiments in a production environment will provide needed insights. A couple of things to keep in mind:
- Minimize blast radius and have a backup plan in place to make sure the users and customers do not undergo downtime.
- Mimic the load in a staging environment in case Service Level Agreements are too tight to cover any downtime.
- As scary as a thought about testing in production is, production is the environment that users are in and traffic spikes/load are real. To fully test the robustness/resilience of a production system, running Chaos Engineering experiments in a production environment will provide needed insights. A couple of things to keep in mind:
Enable Observability:
- Chaos Engineering Without Observability … Is Just Chaos.
- Make sure to have logging and monitoring installed on the cluster to help with understanding the behaviour as to why it is happening. In case of running the tests in the CI where it is not humanly possible to monitor the cluster all the time, it is recommended to leverage Cerberus to capture the state during the runs and metrics collection in Kraken to store metrics long term even after the cluster is gone.
- Kraken ships with dashboards that will help understand API, Etcd and Kubernetes cluster level stats and performance metrics.
- Pay attention to Prometheus alerts. Check if they are firing as expected.
Run multiple chaos tests at once to mimic the production outages:
- For example, hogging both IO and Network at the same time instead of running them separately to observe the impact.
- You might have existing test cases, be it related to Performance, Scalability or QE. Run the chaos in the background during the test runs to observe the impact. Signaling feature in Kraken can help with coordinating the chaos runs i.e., start, stop, pause the scenarios based on the state of the other test jobs.
Chaos testing in Practice
OpenShift organization
Within the OpenShift organization we use kraken to perform chaos testing throughout a release before the code is available to customers.
1. We execute kraken during our regression test suite.
i. We cover each of the chaos scenarios across different clouds.
a. Our testing is predominantly done on AWS, Azure and GCP.
2. We run the chaos scenarios during a long running reliability test.
i. During this test we perform different types of tasks by different users on the cluster.
ii. We have added the execution of kraken to perform at certain times throughout the long running test and monitor the health of the cluster.
iii. This test can be seen here: https://github.com/openshift/svt/tree/master/reliability-v2
3. We are starting to add in test cases that perform chaos testing during an upgrade (not many iterations of this have been completed).
startx-lab
NOTE: Requests for enhancements and any issues need to be filed at the mentioned links given that they are not natively supported in Kraken.
The following content covers the implementation details around how Startx is leveraging Kraken:
- Using kraken as part of a tekton pipeline
You can find on artifacthub.io the
kraken-scenario tekton-task
which can be used to start a kraken chaos scenarios as part of a chaos pipeline.
To use this task, you must have :
- Openshift pipeline enabled (or tekton CRD loaded for Kubernetes clusters)
- 1 Secret named
kraken-aws-creds
for scenarios using aws - 1 ConfigMap named
kraken-kubeconfig
with credentials to the targeted cluster - 1 ConfigMap named
kraken-config-example
with kraken configuration file (config.yaml) - 1 ConfigMap named
kraken-common-example
with all kraken related files - The
pipeline
SA with be autorized to run with priviveged SCC
You can create theses resources using the following sequence :
oc project default
oc adm policy add-scc-to-user privileged -z pipeline
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/common.yaml
Then you must change content of kraken-aws-creds
secret, kraken-kubeconfig
and kraken-config-example
configMap
to reflect your cluster configuration. Refer to the kraken configuration
and configuration examples
for details on how to configure theses resources.
- Start as a single taskrun
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/taskrun.yaml
- Start as a pipelinerun
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/pipelinerun.yaml
- Deploying kraken using a helm-chart
You can find on artifacthub.io the
chaos-kraken helm-chart
which can be used to deploy a kraken chaos scenarios.
Default configuration create the following resources :
- 1 project named chaos-kraken
- 1 scc with privileged context for kraken deployment
- 1 configmap with kraken 21 generic scenarios, various scripts and configuration
- 1 configmap with kubeconfig of the targeted cluster
- 1 job named kraken-test-xxx
- 1 service to the kraken pods
- 1 route to the kraken service
# Install the startx helm repository
helm repo add startx https://startxfr.github.io/helm-repository/packages/
# Install the kraken project
helm install --set project.enabled=true chaos-kraken-project startx/chaos-kraken
# Deploy the kraken instance
helm install \
--set kraken.enabled=true \
--set kraken.aws.credentials.region="eu-west-3" \
--set kraken.aws.credentials.key_id="AKIAXXXXXXXXXXXXXXXX" \
--set kraken.aws.credentials.secret="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
--set kraken.kubeconfig.token.server="https://api.mycluster:6443" \
--set kraken.kubeconfig.token.token="sha256~XXXXXXXXXX_PUT_YOUR_TOKEN_HERE_XXXXXXXXXXXX" \
-n chaos-kraken \
chaos-kraken-instance startx/chaos-kraken
5 - Cerberus
Cerberus
Guardian of Kubernetes and OpenShift Clusters
Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly.
Workflow
Installation
Instructions on how to setup, configure and run Cerberus can be found at Installation.
What Kubernetes/OpenShift components can Cerberus monitor?
Following are the components of Kubernetes/OpenShift that Cerberus can monitor today, we will be adding more soon.
Component | Description | Working |
---|---|---|
Nodes | Watches all the nodes including masters, workers as well as nodes created using custom MachineSets | :heavy_check_mark: |
Namespaces | Watches all the pods including containers running inside the pods in the namespaces specified in the config | :heavy_check_mark: |
Cluster Operators | Watches all Cluster Operators | :heavy_check_mark: |
Masters Schedulability | Watches and warns if masters nodes are marked as schedulable | :heavy_check_mark: |
Routes | Watches specified routes | :heavy_check_mark: |
CSRs | Warns if any CSRs are not approved | :heavy_check_mark: |
Critical Alerts | Warns the user on observing abnormal behavior which might effect the health of the cluster | :heavy_check_mark: |
Bring your own checks | Users can bring their own checks and Ceberus runs and includes them in the reporting as wells as go/no-go signal | :heavy_check_mark: |
An explanation of all the components that Cerberus can monitor are explained here
How does Cerberus report cluster health?
Cerberus exposes the cluster health and failures through a go/no-go signal, report and metrics API.
Go or no-go signal
When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a light weight http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.
Report
The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. Refer report for example.
You can use the “-o <file_path_name>” option to change the location of the created report
Metrics API
Cerberus exposes the metrics including the failures observed during the run through an API. Tools consuming Cerberus can query the API to get a blob of json with the observed failures to scrape and act accordingly. For example, we can query for etcd failures within a start and end time and take actions to determine pass/fail for test cases or report whether the cluster is healthy or unhealthy for that duration.
- The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
- The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=
. - The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.
Slack integration
Cerberus supports reporting failures in slack. Refer slack integration for information on how to set it up.
Node Problem Detector
Cerberus also consumes node-problem-detector to detect various failures in Kubernetes/OpenShift nodes. More information on setting it up can be found at node-problem-detector
Bring your own checks
Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {‘status’:status, ‘message’:message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it’s optional to return a value. Refer to example_check for an example custom check file.
Alerts
Monitoring metrics and alerting on abnormal behavior is critical as they are the indicators for clusters health. Information on supported alerts can be found at alerts.
Use cases
There can be number of use cases, here are some of them:
We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable.
When running chaos experiments on a kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components which means that the chaos experiment won’t be able to find it. The go/no-go signal can be used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.
Tools consuming Cerberus
Benchmark Operator: The intent of this Operator is to deploy common workloads to establish a performance baseline of Kubernetes cluster on your provider. Benchmark Operator consumes Cerberus to determine if the cluster was healthy during the benchmark run. More information can be found at cerberus-integration.
Kraken: Tool to inject deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient. Kraken consumes Cerberus to determine if the cluster is healthy as a whole in addition to the targeted component during chaos testing. More information can be found at cerberus-integration.
Blogs and other useful resources
- https://www.openshift.com/blog/openshift-scale-ci-part-4-introduction-to-cerberus-guardian-of-kubernetes/openshift-clouds
- https://www.openshift.com/blog/reinforcing-cerberus-guardian-of-openshift/kubernetes-clusters
Contributions
We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.
More information on how to Contribute
Community
Key Members(slack_usernames): paige, rook, mffiedler, mohit, dry923, rsevilla, ravi
Credits
Thanks to Mary Shakshober ( https://github.com/maryshak1996 ) for designing the logo.
5.1 - Installation
Following ways are supported to run Cerberus:
- Standalone python program through Git or python package
- Containerized version using either Podman or Docker as the runtime
- Kubernetes or OpenShift deployment
Note
Only OpenShift 4.x versions are tested.Git
Pick the latest stable release to install here.
$ git clone https://github.com/redhat-chaos/cerberus.git --branch <release>
Install the dependencies
NOTE: Recommended to use a virtual environment(pyenv,venv) so as to prevent conflicts with already installed packages.
$ pip3 install -r requirements.txt
Configure and Run
Setup the config according to your requirements. Information on the available options can be found at usage.
Run
$ python3 start_cerberus.py --config <config_file_location>
NOTE: When config file location is not passed, default config is used.
Python Package
Cerberus is also available as a python package to ease the installation and setup.
To install the lastest release:
$ pip3 install cerberus-client
Configure and Run
Setup the config according to your requirements. Information on the available options can be found at usage.
Run
$ cerberus_client -c <config_file_location>`
Note
When config_file_location is not passed, default config is used.Note
It’s recommended to run Cerberus either using the containerized or github version to be able to use the latest enhancements and fixes.Containerized version
Assuming docker ( 17.05 or greater with multi-build support ) is intalled on the host, run:
$ docker pull quay.io/redhat-chaos/cerberus
# Setup the [config](https://github.com/redhat-chaos/cerberus/tree/master/config) according to your requirements. Information on the available options can be found at [usage](usage.md).
$ docker run --name=cerberus --net=host -v <path_to_kubeconfig>:/root/.kube/config -v <path_to_cerberus_config>:/root/cerberus/config/config.yaml -d quay.io/redhat-chaos/cerberus:latest
$ docker logs -f cerberus
Similarly, podman can be used to achieve the same:
$ podman pull quay.io/redhat-chaos/cerberus
# Setup the [config](https://github.com/redhat-chaos/cerberus/tree/master/config) according to your requirements. Information on the available options can be found at [usage](usage.md).
$ podman run --name=cerberus --net=host -v <path_to_kubeconfig>:/root/.kube/config:Z -v <path_to_cerberus_config>:/root/cerberus/config/config.yaml:Z -d quay.io/redhat-chaos/cerberus:latest
$ podman logs -f cerberus
The go/no-go signal ( True or False ) gets published at http://<hostname>
:8080. Note that the cerberus will only support ipv4 for the time being.
Note
The report is generated at /root/cerberus/cerberus.report inside the container, it can mounted to a directory on the host in case we want to capture it.If you want to build your own Cerberus image, see here. To run Cerberus on Power (ppc64le) architecture, build and run a containerized version by following the instructions given here.
Run containerized Cerberus as a Kubernetes/OpenShift deployment
Refer to the instructions for information on how to run cerberus as a Kubernetes or OpenShift application.
5.2 - Config
Cerberus Config Components Explained
- Sample Config
- Watch Nodes
- Watch Operators
- Watch Routes
- Watch Master Schedulable Status
- Watch Namespaces
- Watch Terminating Namespaces
- Publish Status
- Inpsect Components
- Custom Checks
Config
Set the components to monitor and the tunings like duration to wait between each check in the config file located at config/config.yaml. A sample config looks like:
cerberus:
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: /root/.kube/config # Path to kubeconfig
port: 8081 # http server port where cerberus status is published
watch_nodes: True # Set to True for the cerberus to monitor the cluster nodes
watch_cluster_operators: True # Set to True for cerberus to monitor cluster operators
watch_terminating_namespaces: True # Set to True to monitor if any namespaces (set below under 'watch_namespaces' start terminating
watch_url_routes:
# Route url's you want to monitor, this is a double array with the url and optional authorization parameter
watch_master_schedulable: # When enabled checks for the schedulable master nodes with given label.
enabled: True
label: node-role.kubernetes.io/master
watch_namespaces: # List of namespaces to be monitored
- openshift-etcd
- openshift-apiserver
- openshift-kube-apiserver
- openshift-monitoring
- openshift-kube-controller-manager
- openshift-machine-api
- openshift-kube-scheduler
- openshift-ingress
- openshift-sdn # When enabled, it will check for the cluster sdn and monitor that namespace
watch_namespaces_ignore_pattern: [] # Ignores pods matching the regex pattern in the namespaces specified under watch_namespaces
cerberus_publish_status: True # When enabled, cerberus starts a light weight http server and publishes the status
inspect_components: False # Enable it only when OpenShift client is supported to run
# When enabled, cerberus collects logs, events and metrics of failed components
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
# This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.
slack_integration: False # When enabled, cerberus reports the failed iterations in the slack channel
# The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
# When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
watcher_slack_ID: # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
Monday:
Tuesday:
Wednesday:
Thursday:
Friday:
Saturday:
Sunday:
slack_team_alias: # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned
custom_checks:
- custom_checks/custom_check_sample.py # Relative paths of files conataining additional user defined checks
tunings:
timeout: 20 # Number of seconds before requests fail
iterations: 1 # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
sleep_time: 3 # Sleep duration between each iteration
kube_api_request_chunk_size: 250 # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
daemon_mode: True # Iterations are set to infinity which means that the cerberus will monitor the resources forever
cores_usage_percentage: 0.5 # Set the fraction of cores to be used for multiprocessing
database:
database_path: /tmp/cerberus.db # Path where cerberus database needs to be stored
reuse_database: False # When enabled, the database is reused to store the failures
Watch Nodes
This flag returns any nodes where the KernelDeadlock is not set to False and does not have a Ready
status
Watch Cluster Operators
When watch_cluster_operators
is set to True, this will monitor the degraded status of all the cluster operators and report a failure if any are degraded.
If set to False will not query or report the status of the cluster operators
Watch Routes
This parameter expects a double array with each item having the url and an optional bearer token or authorization for each of the url’s to properly connect
For example:
watch_url_routes:
- - <url>
- <authorization> (optional)
- - https://prometheus-k8s-openshift-monitoring.apps.****.devcluster.openshift.com
- Bearer ****
- - http://nodejs-mongodb-example-default.apps.****.devcluster.openshift.com
Watch Master Schedulable Status
When this check is enabled, cerberus queries each of the nodes for the given label and verifies the taint effect does not equal “NoSchedule”
watch_master_schedulable: # When enabled checks for the schedulable master nodes with given label.
enabled: True
label: <label of master nodes>
Watch Namespaces
It supports monitoring pods in any namespaces specified in the config, the watch is enabled for system components mentioned in the config by default as they are critical for running the operations on Kubernetes/OpenShift clusters.
watch_namespaces
support regex patterns. Any valid regex pattern can be used to watch all the namespaces matching the regex pattern.
For example, ^openshift-.*$
can be used to watch all namespaces that start with openshift-
or openshift
can be used to watch all namespaces that have openshift
in it.
Or you can use ^.*$
to watch all namespaces in your cluster
Watch Terminating Namespaces
When watch_terminating_namespaces
is set to True, this will monitor the status of all the namespaces defind under watch namespaces and report a failure if any are terminating.
If set to False will not query or report the status of the terminating namespaces
Publish Status
Parameter to set if you want to publish the go/no-go signal to the http server
Inspect Components
inspect_components
if set to True will perform an oc adm inspect namespace <namespace>
when any namespace has any failing pods
Custom Checks
Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {‘status’:status, ‘message’:message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it’s optional to return a value.
Refer to example_check for an example custom check file.
5.3 - Example Report
2020-03-26 22:05:06,393 [INFO] Starting ceberus
2020-03-26 22:05:06,401 [INFO] Initializing client to talk to the Kubernetes cluster
2020-03-26 22:05:06,434 [INFO] Fetching cluster info
2020-03-26 22:05:06,739 [INFO] Publishing cerberus status at http://0.0.0.0:8080
2020-03-26 22:05:06,753 [INFO] Starting http server at http://0.0.0.0:8080
2020-03-26 22:05:06,753 [INFO] Daemon mode enabled, cerberus will monitor forever
2020-03-26 22:05:06,753 [INFO] Ignoring the iterations set
2020-03-26 22:05:25,104 [INFO] Iteration 4: Node status: True
2020-03-26 22:05:25,133 [INFO] Iteration 4: Etcd member pods status: True
2020-03-26 22:05:25,161 [INFO] Iteration 4: OpenShift apiserver status: True
2020-03-26 22:05:25,546 [INFO] Iteration 4: Kube ApiServer status: True
2020-03-26 22:05:25,717 [INFO] Iteration 4: Monitoring stack status: True
2020-03-26 22:05:25,720 [INFO] Iteration 4: Kube controller status: True
2020-03-26 22:05:25,746 [INFO] Iteration 4: Machine API components status: True
2020-03-26 22:05:25,945 [INFO] Iteration 4: Kube scheduler status: True
2020-03-26 22:05:25,963 [INFO] Iteration 4: OpenShift ingress status: True
2020-03-26 22:05:26,077 [INFO] Iteration 4: OpenShift SDN status: True
2020-03-26 22:05:26,077 [INFO] HTTP requests served: 0
2020-03-26 22:05:26,077 [INFO] Sleeping for the specified duration: 5
2020-03-26 22:05:31,134 [INFO] Iteration 5: Node status: True
2020-03-26 22:05:31,162 [INFO] Iteration 5: Etcd member pods status: True
2020-03-26 22:05:31,190 [INFO] Iteration 5: OpenShift apiserver status: True
127.0.0.1 - - [26/Mar/2020 22:05:31] "GET / HTTP/1.1" 200 -
2020-03-26 22:05:31,588 [INFO] Iteration 5: Kube ApiServer status: True
2020-03-26 22:05:31,759 [INFO] Iteration 5: Monitoring stack status: True
2020-03-26 22:05:31,763 [INFO] Iteration 5: Kube controller status: True
2020-03-26 22:05:31,788 [INFO] Iteration 5: Machine API components status: True
2020-03-26 22:05:31,989 [INFO] Iteration 5: Kube scheduler status: True
2020-03-26 22:05:32,007 [INFO] Iteration 5: OpenShift ingress status: True
2020-03-26 22:05:32,118 [INFO] Iteration 5: OpenShift SDN status: False
2020-03-26 22:05:32,118 [INFO] HTTP requests served: 1
2020-03-26 22:05:32,118 [INFO] Sleeping for the specified duration: 5
+--------------------------------------------------Failed Components--------------------------------------------------+
2020-03-26 22:05:37,123 [INFO] Failed openshfit sdn components: ['sdn-xmqhd']
2020-05-23 23:26:43,041 [INFO] ------------------------- Iteration Stats ---------------------------------------------
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_nodes in iteration 1: 0.0996248722076416 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3672499656677246 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run watch_namespaces in iteration 1: 1.085144281387329 seconds
2020-05-23 23:26:43,041 [INFO] Time taken to run entire_iteration in iteration 1: 4.107403039932251 seconds
2020-05-23 23:26:43,041 [INFO] ---------------------------------------------------------------------------------------
5.4 - Usage
Config
Set the supported components to monitor and the tunings like number of iterations to monitor and duration to wait between each check in the config file located at config/config.yaml. A sample config looks like:
cerberus:
distribution: openshift # Distribution can be kubernetes or openshift
kubeconfig_path: ~/.kube/config # Path to kubeconfig
port: 8080 # http server port where cerberus status is published
watch_nodes: True # Set to True for the cerberus to monitor the cluster nodes
watch_cluster_operators: True # Set to True for cerberus to monitor cluster operators. Parameter is optional, will set to True if not specified
watch_url_routes: # Route url's you want to monitor
- - https://...
- Bearer **** # This parameter is optional, specify authorization need for get call to route
- - http://...
watch_master_schedulable: # When enabled checks for the schedulable
enabled: True master nodes with given label.
label: node-role.kubernetes.io/master
watch_namespaces: # List of namespaces to be monitored
- openshift-etcd
- openshift-apiserver
- openshift-kube-apiserver
- openshift-monitoring
- openshift-kube-controller-manager
- openshift-machine-api
- openshift-kube-scheduler
- openshift-ingress
- openshift-sdn
cerberus_publish_status: True # When enabled, cerberus starts a light weight http server and publishes the status
inspect_components: False # Enable it only when OpenShift client is supported to run.
# When enabled, cerberus collects logs, events and metrics of failed components
prometheus_url: # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
# This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.
slack_integration: False # When enabled, cerberus reports status of failed iterations in the slack channel
# The following env vars need to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
# When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
watcher_slack_ID: # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
Monday:
Tuesday:
Wednesday:
Thursday:
Friday:
Saturday:
Sunday:
slack_team_alias: # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned
custom_checks: # Relative paths of files conataining additional user defined checks
- custom_checks/custom_check_sample.py
- custom_check.py
tunings:
iterations: 5 # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
sleep_time: 60 # Sleep duration between each iteration
kube_api_request_chunk_size: 250 # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
daemon_mode: True # Iterations are set to infinity which means that the cerberus will monitor the resources forever
cores_usage_percentage: 0.5 # Set the fraction of cores to be used for multiprocessing
database:
database_path: /tmp/cerberus.db # Path where cerberus database needs to be stored
reuse_database: False # When enabled, the database is reused to store the failures
Note
watch_namespaces support regex patterns. Any valid regex pattern can be used to watch all the namespaces matching the regex pattern. For example,^openshift-.*$
can be used to watch all namespaces that start with openshift-
or openshift
can be used to watch all namespaces that have openshift
in it.Note
The current implementation can monitor only one cluster from one host. It can be used to monitor multiple clusters provided multiple instances of Cerberus are launched on different hosts.Note
The components especially the namespaces needs to be changed depending on the distribution i.e Kubernetes or OpenShift. The default specified in the config assumes that the distribution is OpenShift. A config file for Kubernetes is located at config/kubernetes_config.yaml5.5 - Alerts
Cerberus consumes the metrics from Prometheus deployed on the cluster to report the alerts.
When provided the prometheus url and bearer token in the config, Cerberus reports the following alerts:
KubeAPILatencyHigh: alerts at the end of each iteration and warns if 99th percentile latency for given requests to the kube-apiserver is above 1 second. It is the official SLI/SLO defined for Kubernetes.
High number of etcd leader changes: alerts the user when an increase in etcd leader changes are observed on the cluster. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.
NOTE: The prometheus url and bearer token are automatically picked from the cluster if the distribution is OpenShift since it’s the default metrics solution. In case of Kubernetes, they need to be provided in the config if prometheus is deployed.
5.6 - Node Problem Detector
node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack.
Installation
Please follow the instructions in the installation section to setup Node Problem Detector on Kubernetes. The following instructions are setting it up on OpenShift:
- Create
openshift-node-problem-detector
namespace ns.yaml withoc create -f ns.yaml
- Add cluster role with
oc adm policy add-cluster-role-to-user system:node-problem-detector -z default -n openshift-node-problem-detector
- Add security context constraints with
oc adm policy add-scc-to-user privileged system:serviceaccount:openshift-node-problem-detector:default
- Edit node-problem-detector.yaml to fit your environment.
- Edit node-problem-detector-config.yaml to configure node-problem-detector.
- Create the ConfigMap with
oc create -f node-problem-detector-config.yaml
- Create the DaemonSet with
oc create -f node-problem-detector.yaml
Once installed you will see node-problem-detector pods in openshift-node-problem-detector namespace.
Now enable openshift-node-problem-detector in the config.yaml.
Cerberus just monitors KernelDeadlock
condition provided by the node problem detector as it is system critical and can hinder node performance.
5.7 - Slack Integration
The user has the option to enable/disable the slack integration ( disabled by default ). To use the slack integration, the user has to first create an app and add a bot to it on slack. SLACK_API_TOKEN and SLACK_CHANNEL environment variables have to be set. SLACK_API_TOKEN refers to Bot User OAuth Access Token and SLACK_CHANNEL refers to the slack channel ID the user wishes to receive the notifications. Make sure the Slack Bot Token Scopes contains this permission [calls:read] [channels:read] [chat:write] [groups:read] [im:read] [mpim:read]
- Reports when cerberus starts monitoring a cluster in the specified slack channel.
- Reports the component failures in the slack channel.
- A watcher can be assigned for each day of the week. The watcher of the day is tagged while reporting failures in the slack channel instead of everyone. (NOTE: Defining the watcher id’s is optional and when the watcher slack id’s are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
Go or no-go signal
When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a simple http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.
Failures in a time window
- The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
- The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=
. - The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.
Sample Slack Config
This is a snippet of how would your slack config could look like within your cerberus_config.yaml
.
watcher_slack_ID:
Monday: U1234ABCD # replace with your Slack ID from Profile-> More -> Copy Member ID
Tuesday: # Same or different ID can be used for remaining days depending on who you want to tag
Wednesday:
Thursday:
Friday:
Saturday:
Sunday:
slack_team_alias: @group_or_team_id
5.8 - Contribute
How to contribute
Contributions are always appreciated.
How to:
Pull request
In order to submit a change or a PR, please fork the project and follow instructions:
$ git clone http://github.com/<me>/cerberus
$ cd cerberus
$ git checkout -b <branch_name>
$ <make change>
$ git add <changes>
$ git commit -a
$ <insert good message>
$ git push
Fix Formatting
Cerberus uses pre-commit framework to maintain the code linting and python code styling. The CI would run the pre-commit check on each pull request. We encourage our contributors to follow the same pattern, while contributing to the code.
The pre-commit configuration file is present in the repository .pre-commit-config.yaml
It contains the different code styling and linting guide which we use for the application.
Following command can be used to run the pre-commit:
pre-commit run --all-files
If pre-commit is not installed in your system, it can be install with : pip install pre-commit
Squash Commits
If there are mutliple commits, please rebase/squash multiple commits before creating the PR by following:
$ git checkout <my-working-branch>
$ git rebase -i HEAD~<num_of_commits_to_merge>
-OR-
$ git rebase -i <commit_id_of_first_change_commit>
In the interactive rebase screen, set the first commit to pick
and all others to squash
(or whatever else you may need to do).
Push your rebased commits (you may need to force), then issue your PR.
$ git push origin <my-working-branch> --force
6 - Chaos Recommendation Tool
This tool, designed for Redhat Kraken, operates through the command line and offers recommendations for chaos testing. It suggests probable chaos test cases that can disrupt application services by analyzing their behavior and assessing their susceptibility to specific fault types.
This tool profiles an application and gathers telemetry data such as CPU, Memory, and Network usage, analyzing it to suggest probable chaos scenarios. For optimal results, it is recommended to activate the utility while the application is under load.
Pre-requisites
- Openshift Or Kubernetes Environment where the application is hosted
- Access to the metrics via the exposed Prometheus endpoint
- Python3.9
Usage
To run
$ python3.9 -m venv chaos $ source chaos/bin/activate $ git clone https://github.com/krkn-chaos/krkn.git $ cd krkn $ pip3 install -r requirements.txt Edit configuration file: $ vi config/recommender_config.yaml $ python3.9 utils/chaos_recommender/chaos_recommender.py -c utils/chaos_recommender/recommender_config.yaml
Follow the prompts to provide the required information.
Configuration
To run the recommender with a config file specify the config file path with the -c
argument.
You can customize the default values by editing the recommender_config.yaml
file. The configuration file contains the following options:
application
: Specify the application name.namespaces
: Specify the namespaces names (separated by coma or space). If you want to profilelabels
: Specify the labels (not used).kubeconfig
: Specify the location of the kubeconfig file (not used).prometheus_endpoint
: Specify the prometheus endpoint (must).auth_token
: Auth token to connect to prometheus endpoint (must).scrape_duration
: For how long data should be fetched, e.g., ‘1m’ (must).chaos_library
: “kraken” (currently it only supports kraken).json_output_file
: True or False (by default False).json_output_folder_path
: Specify folder path where output should be saved. If empty the default path is used.chaos_tests
: (for output purpose only do not change if not needed)GENERAL
: list of general purpose tests available in KrknMEM
: list of memory related tests available in KrknNETWORK
: list of network related tests available in KrknCPU
: list of memory related tests available in Krkn
threshold
: Specify the threshold to use for comparison and identifying outlierscpu_threshold
: Specify the cpu threshold to compare with the cpu limits set on the pods and identify outliersmem_threshold
: Specify the memory threshold to compare with the memory limits set on the pods and identify outliers
TIP: to collect prometheus endpoint and token from your OpenShift cluster you can run the following commands:
prometheus_url=$(kubectl get routes -n openshift-monitoring prometheus-k8s --no-headers | awk '{print $2}') #TO USE YOUR CURRENT SESSION TOKEN token=$(oc whoami -t) #TO CREATE A NEW TOKEN token=$(kubectl create token -n openshift-monitoring prometheus-k8s --duration=6h || oc sa new-token -n openshift-monitoring prometheus-k8s)
You can also provide the input values through command-line arguments launching the recommender with -o
option:
-o, --options Evaluate command line options
-a APPLICATION, --application APPLICATION
Kubernetes application name
-n NAMESPACES, --namespaces NAMESPACE
Kubernetes application namespaces separated by space
-l LABELS, --labels LABELS
Kubernetes application labels
-p PROMETHEUS_ENDPOINT, --prometheus-endpoint PROMETHEUS_ENDPOINT
Prometheus endpoint URI
-k KUBECONFIG, --kubeconfig KUBECONFIG
Kubeconfig path
-t TOKEN, --token TOKEN
Kubernetes authentication token
-s SCRAPE_DURATION, --scrape-duration SCRAPE_DURATION
Prometheus scrape duration
-i LIBRARY, --library LIBRARY
Chaos library
-L LOG_LEVEL, --log-level LOG_LEVEL
log level (DEBUG, INFO, WARNING, ERROR, CRITICAL
-J [FOLDER_PATH], --json-output-file [FOLDER_PATH]
Create output file, the path to the folder can be specified, if not specified the default folder is used.
-M MEM [MEM ...], --MEM MEM [MEM ...]
Memory related chaos tests (space separated list)
-C CPU [CPU ...], --CPU CPU [CPU ...]
CPU related chaos tests (space separated list)
-N NETWORK [NETWORK ...], --NETWORK NETWORK [NETWORK ...]
Network related chaos tests (space separated list)
-G GENERIC [GENERIC ...], --GENERIC GENERIC [GENERIC ...]
Memory related chaos tests (space separated list)
--threshold THRESHOLD
Threshold
--cpu_threshold CPU_THRESHOLD
CPU threshold to compare with the cpu limits
--mem_threshold MEM_THRESHOLD
Memory threshold to compare with the memory limits
If you provide the input values through command-line arguments, the corresponding config file inputs would be ignored.
Podman & Docker image
To run the recommender image please visit the krkn-hub for further infos.
How it works
After obtaining telemetry data, sourced either locally or from Prometheus, the tool conducts a comprehensive data analysis to detect anomalies. Employing the Z-score method and heatmaps, it identifies outliers by evaluating CPU, memory, and network usage against established limits. Services with Z-scores surpassing a specified threshold are categorized as outliers. This categorization classifies services as network, CPU, or memory-sensitive, consequently leading to the recommendation of relevant test cases.
Customizing Thresholds and Options
You can customize the thresholds and options used for data analysis and identifying the outliers by setting the threshold, cpu_threshold and mem_threshold parameters in the config.
Additional Files
recommender_config.yaml
: The configuration file containing default values for application, namespace, labels, and kubeconfig.
Happy Chaos!
7 - Contribution Guidelines
Adding New Scenarios/Testing Changes
Refer to the 2 docs below to be able to test your own images with any changes and be able to contribute them to the repository
7.1 - Testing your changes
How to Test Your Changes/Additions
Install Podman/Docker Compose
You can use either podman-compose or docker-compose for this step
NOTE: Podman might not work on Mac’s
pip3 install docker-compose
OR
To get latest podman-compose features we need, use this installation command
pip3 install https://github.com/containers/podman-compose/archive/devel.tar.gz
Current list of Scenario Types
Scenario Types:
- pod-scenarios
- node-scenarios
- zone-outages
- time-scenarios
- cerberus
- cluster-shutdown
- container-scenarios
- node-cpu-hog
- node-io-hog
- node-memory-hog
- application-outages
Adding a New Scenario
Create folder with scenario name
Create generic scenario template with enviornment variables
a. See scenario.yaml for example
b. Almost all parameters should be set using a variable (these will be set in the env.sh file or through the command line environment variables)
Add defaults for any environment variables in an “env.sh” file
a. See env.sh for example
Create script to run.sh chaos scenario a. See run.sh for example
b. edit line 16 with your scenario yaml template
c. edit line 17 and 23 with your yaml config location
Create Dockerfile template
a. See dockerfile for example
b. Lines to edit
i. 15: replace "application-outages" with your folder name ii. 17: replace "application-outages" with your folder name iii. 19: replace "application-outages" with your folder name and config file name
Add service/scenario to docker-compose.yaml file following syntax of other services
Point the dockerfile parameter in your docker-compose to the Dockerfile file in your new folder
Update this doc and main README with new scenario type
Add CI test for new scenario
a. See test_application_outages.sh for example
b. Lines to change
i. 14 and 31: Give a new function name ii. 19: Give it a meaningful container name iii. Edit line 20 to give scenario type defined in docker-compose file
c. Add test name to all_tests file
NOTE:
- If you added any variables or new sections be sure to update config.yaml.template
- Similar to above, also add the default parameter values to env.sh
Build Your Changes
- Run build.sh to get Dockerfiles for each scenario
- Edit the docker-compose.yaml file to point to your quay.io repository (optional)
- Build your image(s) from base kraken-hub directory
Builds all images in docker-compose file
docker-compose build
Builds single image defined by service/scenario name
docker-compose build <scenario_type>
OR
Builds all images in podman-compose file
podman-compose build
Builds single image defined by service/scenario name
podman-compose build <scenario_type>
Push Images to your quay.io
All Images
docker image push --all-tags quay.io/<username>/kraken-hub
Single image
docker image push quay.io/<username>/kraken-hub:<scenario_type>
OR
Single Image (have to go one by one to push images through podman)
podman image push quay.io/<username>/kraken-hub:<scenario_type>
Run your scenario
docker run -d -v <kube_config_path>:/root/.kube/config:Z quay.io/<username>/kraken-hub:<scenario_type>
OR
podman run -d -v <kube_config_path>:/root/.kube/config:Z quay.io/<username>/kraken-hub:<scenario_type>
Follow Contribute guide
Once all you’re happy with your changes, follow the contribution guide on how to create your own branch and squash your commits
7.2 - Contributions
How to contribute
Contributions are always appreciated.
How to:
Pull request
In order to submit a change or a PR, please fork the project and follow instructions:
$ git clone http://github.com/<me>/kraken-hub
$ cd kraken-hub
$ git checkout -b <branch_name>
$ <make change>
$ git add <changes>
$ git commit -a
$ <insert good message>
$ git push
Squash Commits
If there are mutliple commits, please rebase/squash multiple commits before creating the PR by following:
$ git checkout <my-working-branch>
$ git rebase -i HEAD~<num_of_commits_to_merge>
-OR-
$ git rebase -i <commit_id_of_first_change_commit>
In the interactive rebase screen, set the first commit to pick
and all others to squash
(or whatever else you may need to do).
Push your rebased commits (you may need to force), then issue your PR.
$ git push origin <my-working-branch> --force
8 - Krkn Roadmap
Following are a list of enhancements that we are planning to work on adding support in Krkn. Of course any help/contributions are greatly appreciated.
- Ability to run multiple chaos scenarios in parallel under load to mimic real world outages
- Centralized storage for chaos experiments artifacts
- Support for causing DNS outages
- Chaos recommender to suggest scenarios having probability of impacting the service under test using profiling results
- Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
- Support for pod level network traffic shaping
- Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch
- Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- Continue to improve Chaos Testing Guide in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- Switch documentation references to Kubernetes
- OCP and Kubernetes functionalities segregation
- Krknctl - client for running Krkn scenarios with ease
9 -
Config
Set the scenarios to inject and the tunings like duration to wait between each scenario in the config file located at config/config.yaml.
NOTE: config can be used if leveraging the automated way to install the infrastructure pieces.
Config components:
Kraken
This section defines scenarios and specific data to the chaos run
Distribution
Either openshift or kubernetes depending on the type of cluster you want to run chaos on. The prometheus url/route and bearer token are automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
Exit on failure
exit_on_failure: Exit when a post action check or cerberus run fails
Publish kraken status
publish_kraken_status: Can be accessed at http://0.0.0.0:8081 (or what signal_address and port you set in signal address section) signal_state: State you want kraken to start at; will wait for the RUN signal to start running a chaos iteration. When set to PAUSE before running the scenarios, refer to signal.md for more details
Signal Address
signal_address: Address to listen/post the signal state to port: port to listen/post the signal state to
Chaos Scenarios
chaos_scenarios: List of different types of chaos scenarios you want to run with paths to their specific yaml file configurations
If a scenario has a post action check script, it will be run before and after each scenario to validate the component under test starts and ends at the same state
Currently the scenarios are run one after another (in sequence) and will exit if one of the scenarios fail, without moving onto the next one
Chaos scenario types:
- container_scenarios
- plugin_scenarios
- node_scenarios
- time_scenarios
- cluster_shut_down_scenarios
- namespace_scenarios
- zone_outages
- application_outages
- pvc_scenarios
- network_chaos
Cerberus
Parameters to set for enabling of cerberus checks at the end of each executed scenario. The given url will pinged after the scenario and post action check have been completed for each scenario and iteration. cerberus_enabled: Enable it when cerberus is previously installed cerberus_url: When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal check_applicaton_routes: When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
Performance Monitoring
There are 2 main sections defined in this part of the config metrics and alerts; read more about each of these configurations in their respective docs
Tunings
wait_duration: Duration to wait between each chaos scenario iterations: Number of times to execute the scenarios daemon_mode: True or False; If true, iterations are set to infinity which means that the kraken will cause chaos forever and number of iterations is ignored
10 -
Getting Started Running Chaos Scenarios
Adding New Scenarios
Adding a new scenario is as simple as adding a new config file under scenarios directory and defining it in the main kraken config. You can either copy an existing yaml file and make it your own, or fill in one of the templates below to suit your needs.
Templates
Pod Scenario Yaml Template
For example, for adding a pod level scenario for a new application, refer to the sample scenario below to know what fields are necessary and what to add in each location:
# yaml-language-server: $schema=../plugin.schema.json
- id: kill-pods
config:
namespace_pattern: ^<namespace>$
label_selector: <pod label>
kill: <number of pods to kill>
krkn_pod_recovery_time: <expected time for the pod to become ready>
Node Scenario Yaml Template
node_scenarios:
- actions: # Node chaos scenarios to be injected.
- <chaos scenario>
- <chaos scenario>
node_name: <node name> # Can be left blank.
label_selector: <node label>
instance_kill_count: <number of nodes on which to perform action>
timeout: <duration to wait for completion>
cloud_type: <cloud provider>
Time Chaos Scenario Template
time_scenarios:
- action: 'skew_time' or 'skew_date'
object_type: 'pod' or 'node'
label_selector: <label of pod or node>
Common Scenario Edits
If you just want to make small changes to pre-existing scenarios, feel free to edit the scenario file itself.
Example of Quick Pod Scenario Edit:
If you want to kill 2 pods instead of 1 in any of the pre-existing scenarios, you can either edit the number located at filters -> randomSample -> size or the runs under the config -> runStrategy section.
Example of Quick Nodes Scenario Edit:
If your cluster is build on GCP instead of AWS, just change the cloud type in the node_scenarios_example.yml file.
11 -
Signaling to Kraken
This functionality allows a user to be able to pause or stop the kraken run at any time no matter the number of iterations or daemon_mode set in the config.
If publish_kraken_status is set to True in the config, kraken will start up a connection to a url at a certain port to decide if it should continue running.
By default, it will get posted to http://0.0.0.0:8081/
An example use case for this feature would be coordinating kraken runs based on the status of the service installation or load on the cluster.
States
There are 3 states in the kraken status:
PAUSE
: When the Kraken signal is ‘PAUSE’, this will pause the kraken test and wait for the wait_duration until the signal returns to RUN.
STOP
: When the Kraken signal is ‘STOP’, end the kraken run and print out report.
RUN
: When the Kraken signal is ‘RUN’, continue kraken run based on iterations.
Configuration
In the config you need to set these parameters to tell kraken which port to post the kraken run status to.
As well if you want to publish and stop running based on the kraken status or not.
The signal is set to RUN
by default, meaning it will continue to run the scenarios. It can set to PAUSE
for Kraken to act as listener and wait until set to RUN
before injecting chaos.
port: 8081
publish_kraken_status: True
signal_state: RUN
Setting Signal
You can reset the kraken status during kraken execution with a set_stop_signal.py
script with the following contents:
import http.client as cli
conn = cli.HTTPConnection("0.0.0.0", "<port>")
conn.request("POST", "/STOP", {})
# conn.request('POST', '/PAUSE', {})
# conn.request('POST', '/RUN', {})
response = conn.getresponse()
print(response.read().decode())
Make sure to set the correct port number in your set_stop_signal script.
Url Examples
To stop run:
curl -X POST http:/0.0.0.0:8081/STOP
To pause run:
curl -X POST http:/0.0.0.0:8081/PAUSE
To start running again:
curl -X POST http:/0.0.0.0:8081/RUN