Container Scenarios
Kraken uses the `oc exec` command to `kill` specific containers in a pod.
This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file.
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.Recovery Time Metrics in Krkn Telemetry
Krkn tracks three key recovery time metrics for each affected container:
pod_rescheduling_time - The time (in seconds) that the Kubernetes cluster took to reschedule the pod after it was killed. This measures the cluster’s scheduling efficiency and includes the time from pod deletion until the replacement pod is scheduled on a node. In some cases when the container gets killed, the pod won’t fully reschedule so the pod rescheduling might be 0.0 seconds
pod_readiness_time - The time (in seconds) the pod took to become ready after being scheduled. This measures application startup time, including container image pulls, initialization, and readiness probe success.
total_recovery_time - The total amount of time (in seconds) from pod deletion until the replacement pod became fully ready and available to serve traffic. This is the sum of rescheduling time and readiness time.
These metrics appear in the telemetry output under PodsStatus.recovered for successfully recovered pods. Pods that fail to recover within the timeout period appear under PodsStatus.unrecovered without timing data.
Example telemetry output:
{
"recovered": [
{
"pod_name": "backend-7d8f9c-xyz",
"namespace": "production",
"pod_rescheduling_time": 43.62235879898071,
"pod_readiness_time": 0.0,
"total_recovery_time": 43.62235879898071
}
],
"unrecovered": []
}
1 - Container Scenarios using Krkn
Example Config
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
scenarios:
- name: "<name of scenario>"
namespace: "<specific namespace>" # can specify "*" if you want to find in all namespaces
label_selector: "<label of pod(s)>"
container_name: "<specific container name>" # This is optional, can take out and will kill all containers in all pods found under namespace and label
pod_names: # This is optional, can take out and will select all pods with given namespace and label
- <pod_name>
exclude_label: "<label to exclude pods from chaos>" # Optional: pods matching this label will be excluded from disruption
count: <number of containers to disrupt, default=1>
action: <kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>
expected_recovery_time: <number of seconds to wait for container to be running again> (defaults to 120seconds)
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:
kubeconfig_path: ~/.kube/config # Path to kubeconfig
..
chaos_scenarios:
- container_scenarios:
- scenarios/<scenario_name>.yaml
2 - Container Scenarios using Krkn-hub
This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
| Parameter | Description | Default |
|---|
| NAMESPACE | Targeted namespace in the cluster | openshift-etcd |
| LABEL_SELECTOR | Label of the container(s) to target | k8s-app=etcd |
| EXCLUDE_LABEL | Pods to exclude after getting list of pods from LABEL_SELECTOR to target. For example “app=foo” | No default |
| DISRUPTION_COUNT | Number of container to disrupt | 1 |
| CONTAINER_NAME | Name of the container to disrupt | etcd |
| ACTION | kill signal to run. For example 1 ( hang up ) or 9 | 1 |
| EXPECTED_RECOVERY_TIME | Time to wait before checking if all containers that were affected recover properly | 60 |
Note
Set NAMESPACE environment variable to openshift-.* to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.For example:
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
Demo
See a demo of this scenario:
3 - Container Scenarios using Krknctl
krknctl run container-scenarios (optional: --<parameter>:<value> )
Can also set any global variable listed here
Scenario specific parameters:
| Parameter | Description | Type | Default |
|---|
--namespace | Targeted namespace in the cluster | string | openshift-etcd |
--label-selector | Label of the container(s) to target | string | k8s-app=etcd |
--exclude-label | Label of pod/container to exclude after using label-selector to target. For example “app=foo” | string | False |
--disruption-count | Number of container to disrupt | number | 1 |
--container-name | Name of the container to disrupt | string | etcd |
--action | kill signal to run. For example 1 ( hang up ) or 9 | string | 1 |
--expected-recovery-time | Time to wait before checking if all containers that were affected recover properly | number | 60 |
To see all available scenario options
krknctl run container-scenarios --help