Pod Scenarios
This scenario disrupts the pods matching the label, excluded label or pod name in the specified namespace on a Kubernetes/OpenShift cluster.
Why pod scenarios are important:
Modern applications demand high availability, low downtime, and resilient infrastructure. Kubernetes provides building blocks like Deployments, ReplicaSets, and Services to support fault tolerance, but understanding how these interact during disruptions is critical for ensuring reliability. Pod disruption scenarios test this reliability under various conditions, validating that the application and infrastructure respond as expected.
Use cases of pod scenarios
- Deleting multiple pods simultaneously
- Use Case: Simulates a larger failure event, such as a node crash or AZ outage.
- Why It’s Important: Tests whether the system has enough resources and policies to recover gracefully.
- Customer Impact: If all pods of a service fail, user experience is directly impacted.
- HA Indicator: Application can continue functioning from other replicas across zones/nodes.
- Pod Eviction (Soft Disruption)
- Use Case: Triggered by Kubernetes itself during node upgrades or scaling down.
- Why It’s Important: Ensures graceful termination and restart elsewhere without user impact.
- Customer Impact: Should be zero if readiness/liveness probes and PDBs are correctly configured.
- HA Indicator: Rolling disruption does not take down the whole application.
How to know if it is highly available
- Multiple Replicas Exist: Confirmed by checking
kubectl get deploy -n <namespace>and seeing atleast 1 replica. - Pods Distributed Across Nodes/availability zones: Using
topologySpreadConstraintsor observing pod distribution inkubectl get pods -o wide. See Health Checks for real time visibility into the impact of chaos scenarios on application availability and performance - Service Uptime Remains Unaffected: During chaos test, verify app availability (synthetic probes, Prometheus alerts, etc).
- Recovery Is Automatic: No manual intervention needed to restore service.
- Krkn Telemetry Indicators: End of run data includes recovery times, pod reschedule latency, and service downtime which are vital metrics for assessing HA.
Excluding Pods from Disruption
Employ exclude_label to designate the safe pods in a group, while the rest of the pods in a namespace are subjected to chaos. Some frequent use cases are:
- Turn off the backend pods but make sure the database replicas that are highly available remain untouched.
- Inject the fault in the application layer, do not stop the infrastructure/monitoring pods.
- Run a rolling disruption experiment with the control-plane or system-critical components that are not affected.
Format:
exclude_label: "key=value"
Mechanism:
- Pods are selected based on
namespace_pattern+label_selectororname_pattern. - Before deletion, the pods that match
exclude_labelare removed from the list. - Rest of the pods are subjected to chaos.
Example: Have the Leader Protected While Different etcd Replicas Are Killed
- id: kill_pods
config:
namespace_pattern: ^openshift-etcd$
label_selector: k8s-app=etcd
exclude_label: role=etcd-leader
krkn_pod_recovery_time: 120
kill: 1
Example: Disrupt Backend, Skip Monitoring
- id: kill_pods
config:
namespace_pattern: ^production$
label_selector: app=backend
exclude_label: component=monitoring
krkn_pod_recovery_time: 120
kill: 2
Recovery Time Metrics in Krkn Telemetry
Krkn tracks three key recovery time metrics for each affected pod:
pod_rescheduling_time - The time (in seconds) that the Kubernetes cluster took to reschedule the pod after it was killed. This measures the cluster’s scheduling efficiency and includes the time from pod deletion until the replacement pod is scheduled on a node.
pod_readiness_time - The time (in seconds) the pod took to become ready after being scheduled. This measures application startup time, including container image pulls, initialization, and readiness probe success.
total_recovery_time - The total amount of time (in seconds) from pod deletion until the replacement pod became fully ready and available to serve traffic. This is the sum of rescheduling time and readiness time.
These metrics appear in the telemetry output under PodsStatus.recovered for successfully recovered pods. Pods that fail to recover within the timeout period appear under PodsStatus.unrecovered without timing data.
Example telemetry output:
{
"recovered": [
{
"pod_name": "backend-7d8f9c-xyz",
"namespace": "production",
"pod_rescheduling_time": 2.3,
"pod_readiness_time": 5.7,
"total_recovery_time": 8.0
}
],
"unrecovered": []
}
See Krkn config examples and Krknctl parameters for full details.