KubeVirt VM Outage Scenario
Simulating VM-level disruptions in KubeVirt/OpenShift CNV environments
This scenario enables the simulation of VM-level disruptions in clusters where KubeVirt or OpenShift Containerized Network Virtualization (CNV) is installed. It allows users to delete a Virtual Machine Instance (VMI) to simulate a VM crash and test recovery capabilities.
Purpose
The kubevirt_vm_outage
scenario deletes a specific KubeVirt Virtual Machine Instance (VMI) to simulate a VM crash or outage. This helps users:
- Test the resilience of applications running inside VMs
- Verify that VM monitoring and recovery mechanisms work as expected
- Validate high availability configurations for VM workloads
- Understand the impact of sudden VM failures on workloads and the overall system
Prerequisites
Before using this scenario, ensure the following:
- KubeVirt or OpenShift CNV is installed in your cluster
- The target VMI exists and is running in the specified namespace
- Your cluster credentials have sufficient permissions to delete and create VMIs
Parameters
The scenario supports the following parameters:
Parameter | Description | Required | Default |
---|
vm_name | The name of the VMI to delete | Yes | N/A |
namespace | The namespace where the VMI is located | No | “default” |
timeout | How long to wait (in seconds) before attempting recovery for VMI to start running again | No | 60 |
Expected Behavior
When executed, the scenario will:
- Validate that KubeVirt is installed and the target VMI exists
- Save the initial state of the VMI
- Delete the VMI
- Wait for the VMI to become running or hit the timeout
- Attempt to recover the VMI:
- If the VMI is managed by a VirtualMachine resource with runStrategy: Always, it will automatically recover
- If automatic recovery doesn’t occur, the plugin will manually recreate the VMI using the saved state
- Validate that the VMI is running again
Note
If the VM is managed by a VirtualMachine resource with runStrategy: Always
, KubeVirt will automatically try to recreate the VMI after deletion. In this case, the scenario will wait for this automatic recovery to complete.Advanced Use Cases
Testing High Availability VM Configurations
This scenario is particularly useful for testing high availability configurations, such as:
- Clustered applications running across multiple VMs
- VMs with automatic restart policies
- Applications with cross-VM resilience mechanisms
Recovery Strategies
The plugin implements two recovery strategies:
Automated Recovery: If the VM is managed by a VirtualMachine resource with runStrategy: Always
, the plugin will wait for KubeVirt’s controller to automatically recreate the VMI.
Manual Recovery: If automatic recovery doesn’t occur within the timeout period, the plugin will attempt to manually recreate the VMI using the saved state from before the deletion.
Limitations
- The scenario currently supports deleting a single VMI at a time
- If VM spec changes during the outage window, the manual recovery may not reflect those changes
- The scenario doesn’t simulate partial VM failures (e.g., VM freezing) - only complete VM outage
Troubleshooting
If the scenario fails, check the following:
- Ensure KubeVirt/CNV is properly installed in your cluster
- Verify that the target VMI exists and is running
- Check that your credentials have sufficient permissions to delete and create VMIs
- Examine the logs for specific error messages
1 - KubeVirt VM Outage Scenario - Kraken
Detailed implementation of the KubeVirt VM Outage Scenario in Kraken
KubeVirt VM Outage Scenario in Kraken
The kubevirt_vm_outage
scenario in Kraken enables users to simulate VM-level disruptions by deleting a Virtual Machine Instance (VMI) to test resilience and recovery capabilities.
Implementation
This scenario is implemented in Kraken’s core repository, with the following key functionality:
- Finding and validating the target VMI
- Deleting the VMI using the KubeVirt API
- Monitoring the recovery process
- Implementing fallback recovery if needed
Usage
You can use this scenario in your Kraken configuration file as follows:
scenarios:
- name: "kubevirt vm outage"
scenario: kubevirt_vm_outage
parameters:
vm_name: <my-application-vm>
namespace: <vm-workloads>
timeout: 60
Detailed Parameters
Parameter | Description | Required | Default | Example Values |
---|
vm_name | The name of the VMI to delete | Yes | N/A | “database-vm”, “web-server-vm” |
namespace | The namespace where the VMI is located | No | “default” | “openshift-cnv”, “vm-workloads” |
timeout | How long to wait (in seconds) for VMI to become running before attempting recovery | No | 60 | 30, 120, 300 |
Execution Flow
When executed, the scenario follows this process:
- Initialization: Validates KubeVirt is installed and configures the KubeVirt client
- VMI Validation: Checks if the target VMI exists and is in Running state
- State Preservation: Saves the initial state of the VMI
- Chaos Injection: Deletes the VMI using the KubeVirt API
- Wait for Running: Waits for VMI to become running again, up to the timeout specified
- Recovery Monitoring: Checks if the VMI is automatically restored
- Manual Recovery: If automatic recovery doesn’t occur, manually recreates the VMI
- Validation: Confirms the VMI is running correctly
Sample Configuration
Here’s an example configuration for using the kubevirt_vm_outage
scenario:
scenarios:
- name: "kubevirt outage test"
scenario: kubevirt_vm_outage
parameters:
vm_name: my-vm
namespace: kubevirt
duration: 60
For multiple VMs in different namespaces:
scenarios:
- name: "kubevirt outage test - app VM"
scenario: kubevirt_vm_outage
parameters:
vm_name: app-vm
namespace: application
duration: 120
- name: "kubevirt outage test - database VM"
scenario: kubevirt_vm_outage
parameters:
vm_name: db-vm
namespace: database
duration: 180
Combining with Other Scenarios
For more comprehensive testing, you can combine this scenario with other Kraken scenarios in the list of chaos_scenarios in the config file:
kraken:
kubeconfig_path: ~/.kube/config # Path to kubeconfig
...
chaos_scenarios:
- hog_scenarios:
- scenarios/kube/cpu-hog.yml
- kubevirt_vm_outage:
- scenarios/kubevirt/kubevirt-vm-outage.yaml
2 - KubeVirt Outage Scenarios using Krkn-Hub
This scenario deletes a VMI matching the namespace and name on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ podman logs -f <container_name or container_id> # Streams Kraken logs
$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each enviornment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ docker logs -f <container_name or container_id> # Streams Kraken logs
$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}" # Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter | Description | Default |
---|
NAMESPACE | VMI Namespace to target | "" |
VMI_NAME | VMI name to delete, supports regex | "" |
TIMEOUT | Timeout to wait for VMI to start running again, will fail if timeout is hit | 120 |
NoteIn case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts . | | |
For example: | | |
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-custom-metrics-profile>:/home/krkn/kraken/config/metrics-aggregated.yaml -v <path-to-custom-alerts-profile>:/home/krkn/kraken/config/alerts -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
3 - Kubevirt Outage Scenarios using Krknctl
krknctl run kubevirt-outage (optional: --<parameter>:<value> )
Can also set any global variable listed here
Scenario specific parameters: (be sure to scroll to right)
Parameter | Description | Type | Default | Possible Values |
---|
--namespace | VMI Namespace to target | string | node-role.kubernetes.io/worker | |
--vmi-name | VMI name to inject faults in case of targeting a specific node | string | | |
--timeout | Duration to wait for completion of node scenario injection | number | 180 | |
To see all available scenario options
krknctl run kubevirt-outage --help