This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Zone Outage Scenarios

1: Zone Outage Scenarios using Krkn
2: Zone Outage Scenarios using Krkn-Hub
3: Zone Outage Scenario using Krknctl

Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone.

There are 2 ways these scenarios run: For AWS, it tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.

For GCP, it in a specific zone you want to target and finds the nodes (master, worker, and infra) and stops the nodes for the set duration and then starts them back up. The reason we do it this way is because any edits to the nodes require you to first stop the node before performing any updates. So, editing the network as the AWS way would still require you to stop the nodes first.

1 - Zone Outage Scenarios using Krkn

Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.

Refer to cloud setup to configure your cli properly for the cloud provider of the cluster you want to run zone outages on

Current accepted cloud types:

Sample scenario config

zone_outage:                                         # Scenario to create an outage of a zone by tweaking network ACL.
  cloud_type: aws                                    # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.
  duration: 600                                      # Duration in seconds after which the zone will be back online.
  vpc_id:                                            # Cluster virtual private network to target.
  subnet_id: [subnet1, subnet2]                      # List of subnet-id's to deny both ingress and egress traffic.

Note

vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).

zone_outage:                                         # Scenario to create an outage of a zone by tweaking network ACL
  cloud_type: gcp                                    # cloud type on which Kubernetes/OpenShift runs. aws is only platform supported currently for this scenario.
  duration: 600                                      # duration in seconds after which the zone will be back online
  zone: <zone>                                       # Zone of nodes to stop and then restart after the duration ends
  kube_check: True                                   # Run kubernetes api calls to see if the node gets to a certain state during the scenario

Note

Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.

AWS- Debugging steps in case of failures

In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:

OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.

GCP - Debugging steps in case of failures

In case of failures during the steps which bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:

From the gcp web console, select one of the instances in the zone which is down
Kraken just stops the node, so you’ll just have to select the stopped nodes and START them. This will get back the nodes in the targeted zone into Ready state

How to Use Plugin Name

Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    .. 
    chaos_scenarios:
        - zone_outages_scenarios:
            - scenarios/<scenario_name>.yaml

2 - Zone Outage Scenarios using Krkn-Hub

This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here

Run

If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.

$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Note

–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines. Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>

$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
OR 
$ docker run -e <VARIABLE>=<value> --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages

$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario

Tip

Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:

kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>

Supported parameters

The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:

Example if –env-host is used:

export <parameter_name>=<value>

OR on the command line like example:

-e <VARIABLE>=<value>

See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables

Parameter	Description	Default
CLOUD_TYPE	Cloud platform on top of which cluster is running, supported cloud platforms	aws or gcp
DURATION	Duration in seconds after which the zone will be back online	600
VPC_ID	cluster virtual private network to target ( REQUIRED for AWS )	""
SUBNET_ID	subnet-id to deny both ingress and egress traffic ( REQUIRED for AWS ). Format: [subenet1, subnet2]	""
ZONE	zone you want to target ( REQUIRED for GCP )	""
KUBE_CHECK	Connect to the kubernetes api to see if the node gets to a certain state during the scenario	True
The following environment variables need to be set for the scenarios that requires intereacting with the cloud platform API to perform the actions:

Amazon Web Services

$ export AWS_ACCESS_KEY_ID=<>
$ export AWS_SECRET_ACCESS_KEY=<>
$ export AWS_DEFAULT_REGION=<>

Google Cloud Platform

export GOOGLE_APPLICATION_CREDENTIALS="<serviceaccount.json>"




Note

    In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.



 For example:
```bash
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios

Demo

You can find a link to a demo of the scenario here

3 - Zone Outage Scenario using Krknctl

krknctl run zone-outages (optional: --<parameter>:<value> )

Can also set any global variable listed here

Scenario specific parameters:

Parameter	Description	Type	Default
--cloud-type	Cloud platform on top of which cluster is running, supported platforms - aws, gcp	enum	aws
--duration	Duration in seconds after which the zone will be back online	number	600
--vpc-id	cluster virtual private network to target	string
--subnet-id	subnet-id to deny both ingress and egress traffic ( REQUIRED ). Format: [subnet1, subnet2]	string
--zone	cluster zone to target (only for gcp cloud type )	string
--kube-check	Connecting to the kubernetes api to check the node status, set to False for SNO	enum
--aws-access-key-id	AWS Access Key Id	string (secret)
--aws-secret-access-key	AWS Secret Access Key	string (secret)
--aws-default-region	AWS default region	string
--gcp-application-credentials	GCP application credentials file location	file

NOTE: The secret string types will be masked when scenario is ran

To see all available scenario options

krknctl run zone-outages --help