This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

KubeVirt VM Outage Scenario

Simulating VM-level disruptions in KubeVirt/OpenShift CNV environments

1: KubeVirt VM Outage Scenario - Kraken

This scenario enables the simulation of VM-level disruptions in clusters where KubeVirt or OpenShift Containerized Network Virtualization (CNV) is installed. It allows users to delete a Virtual Machine Instance (VMI) to simulate a VM crash and test recovery capabilities.

Purpose

The kubevirt_vm_outage scenario deletes a specific KubeVirt Virtual Machine Instance (VMI) to simulate a VM crash or outage. This helps users:

Test the resilience of applications running inside VMs
Verify that VM monitoring and recovery mechanisms work as expected
Validate high availability configurations for VM workloads
Understand the impact of sudden VM failures on workloads and the overall system

Prerequisites

Before using this scenario, ensure the following:

KubeVirt or OpenShift CNV is installed in your cluster
The target VMI exists and is running in the specified namespace
You have the kubevirt Python client installed (included in krkn requirements.txt)
Your cluster credentials have sufficient permissions to delete and create VMIs

Parameters

The scenario supports the following parameters:

Parameter	Description	Required	Default
vm_name	The name of the VMI to delete	Yes	N/A
namespace	The namespace where the VMI is located	No	“default”
duration	How long to wait (in seconds) before attempting recovery	No	60

Expected Behavior

When executed, the scenario will:

Validate that KubeVirt is installed and the target VMI exists
Save the initial state of the VMI
Delete the VMI using the KubeVirt API
Wait for the specified duration
Attempt to recover the VMI:
- If the VMI is managed by a VirtualMachine resource with runStrategy: Always, it will automatically recover
- If automatic recovery doesn’t occur, the plugin will manually recreate the VMI using the saved state
Validate that the VMI is running again

Note

If the VM is managed by a VirtualMachine resource with runStrategy: Always, KubeVirt will automatically try to recreate the VMI after deletion. In this case, the scenario will wait for this automatic recovery to complete.

Sample Configuration

Here’s an example configuration for using the kubevirt_vm_outage scenario:

scenarios:
  - name: "kubevirt outage test"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: my-vm
      namespace: kubevirt
      duration: 60

For multiple VMs in different namespaces:

scenarios:
  - name: "kubevirt outage test - app VM"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: app-vm
      namespace: application
      duration: 120
  
  - name: "kubevirt outage test - database VM"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: db-vm
      namespace: database
      duration: 180

Recovery Strategies

The plugin implements two recovery strategies:

Automated Recovery: If the VM is managed by a VirtualMachine resource with runStrategy: Always, the plugin will wait for KubeVirt’s controller to automatically recreate the VMI.
Manual Recovery: If automatic recovery doesn’t occur within the timeout period, the plugin will attempt to manually recreate the VMI using the saved state from before the deletion.

Limitations

The scenario currently supports deleting a single VMI at a time
If VM spec changes during the outage window, the manual recovery may not reflect those changes
The scenario doesn’t simulate partial VM failures (e.g., VM freezing) - only complete VM outage

Troubleshooting

If the scenario fails, check the following:

Ensure KubeVirt/CNV is properly installed in your cluster
Verify that the target VMI exists and is running
Check that your credentials have sufficient permissions to delete and create VMIs
Examine the logs for specific error messages

1 - KubeVirt VM Outage Scenario - Kraken

Detailed implementation of the KubeVirt VM Outage Scenario in Kraken

KubeVirt VM Outage Scenario in Kraken

The kubevirt_vm_outage scenario in Kraken enables users to simulate VM-level disruptions by deleting a Virtual Machine Instance (VMI) to test resilience and recovery capabilities.

Implementation

This scenario is implemented in Kraken’s core repository, with the following key functionality:

Finding and validating the target VMI
Deleting the VMI using the KubeVirt API
Monitoring the recovery process
Implementing fallback recovery if needed

Usage

You can use this scenario in your Kraken configuration file as follows:

scenarios:
  - name: "kubevirt vm outage"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: <my-application-vm>
      namespace: <vm-workloads>
      duration: 60

Detailed Parameters

Parameter	Description	Required	Default	Example Values
vm_name	The name of the VMI to delete	Yes	N/A	“database-vm”, “web-server-vm”
namespace	The namespace where the VMI is located	No	“default”	“openshift-cnv”, “vm-workloads”
duration	How long to wait (in seconds) before attempting recovery	No	60	30, 120, 300

Execution Flow

When executed, the scenario follows this process:

Initialization: Validates KubeVirt is installed and configures the KubeVirt client
VMI Validation: Checks if the target VMI exists and is in Running state
State Preservation: Saves the initial state of the VMI
Chaos Injection: Deletes the VMI using the KubeVirt API
Wait Period: Waits for the specified duration
Recovery Monitoring: Checks if the VMI is automatically restored
Manual Recovery: If automatic recovery doesn’t occur, manually recreates the VMI
Validation: Confirms the VMI is running correctly

Integration with KubeVirt API

The scenario utilizes the KubeVirt Python client to interact with the KubeVirt API. Key API operations include:

Reading VMI objects: kubevirt_api.read_namespaced_virtual_machine_instance()
Deleting VMI objects: kubevirt_api.delete_namespaced_virtual_machine_instance()
Creating VMI objects: kubevirt_api.create_namespaced_virtual_machine_instance()

Advanced Use Cases

Testing High Availability VM Configurations

This scenario is particularly useful for testing high availability configurations, such as:

Clustered applications running across multiple VMs
VMs with automatic restart policies
Applications with cross-VM resilience mechanisms

Combining with Other Scenarios

For more comprehensive testing, you can combine this scenario with other Kraken scenarios:

scenarios:
  - name: "node outage with vm recovery test"
    scenario: node_stop_start_scenario
    parameters:
      # Node scenario parameters
  
  - name: "vm outage during node recovery"
    scenario: kubevirt_vm_outage
    parameters:
      vm_name: <critical-vm>
      namespace: <production>
      duration: 120