krkn-visualize

Deployable grafana to help analyze cluster performance during chaos

krkn-visualize

The krkn-chaos/visualize repository deploys a Grafana instance to your cluster pre-loaded with dashboards for monitoring chaos engineering runs. Dashboards pull from two datasources:

  • Prometheus — cluster-level metrics (always available once Prometheus is installed)
  • Elasticsearch — per-run chaos data indexed by run UUID (requires Elasticsearch and krkn elastic enabled)
git clone https://github.com/krkn-chaos/visualize
cd visualize/krkn-visualize

# Kubernetes
./deploy.sh -e <elasticsearch_url> -p <grafana_password>

# OpenShift
./deploy.sh -e <elasticsearch_url> -p <grafana_password> -k oc

Note: Prometheus must be installed before deploying dashboards. On Kubernetes, install it manually. On OpenShift, it is included by default. If you need Prometheus or Elasticsearch, see these install commands.

Deploy with krknctl

If you have krknctl installed, you can deploy the dashboards with a single command without cloning the repo — krknctl pulls and runs the quay.io/krkn-chaos/krkn-visualize:latest container image and wires up the datasources automatically:

krknctl visualize --grafana-password <secret> --es-url http://elasticsearch:9200

# OpenShift
krknctl visualize --grafana-password <secret> --es-url http://elasticsearch:9200 --kubectl oc

# With optional Prometheus datasource
krknctl visualize --grafana-password <secret> --es-url http://elasticsearch:9200 --prometheus-url http://prometheus:9090

# Tear down
krknctl visualize --delete

See the krknctl visualize command reference for the full list of flags.

Additional dashboards can be imported after deployment:

cd visualize/krkn-visualize
./import.sh -i ../rendered/<folder>/<dashboard_name>.json

Dashboards by Category

There are 23 dashboards organized into three categories. Use the Chaos dashboards to analyze a specific run by UUID (needs Elasticsearch connection); use the General and K8s dashboards to monitor overall cluster health before, during, and after a scenario (via promethues conncection)


Chaos Dashboards

These dashboards filter by run UUID (from Elasticsearch) to show metrics specific to a single chaos run. Each includes a scenario details panel, UUID details, active alerts, and scenario-specific recovery or impact metrics.

DashboardFileKey PanelsUse When Running
Pod ScenariosChaos/pod-scenarios.jsonPod recovery time, console health, etcd WAL latency, alertspod-scenarios, application-outages
Node ScenariosChaos/node-scenarios.jsonNode ready/not-ready time, node running/stopped timenode-scenarios
Container ScenariosChaos/container-scenarios.jsonContainer recovery time, console health, etcd recoverycontainer-scenarios
Hog ScenariosChaos/hog-scenarios.jsonCPU hog duration, memory hog duration, IO hognode-cpu-hog, node-memory-hog, node-io-hog
Network Chaos ScenariosChaos/network-chaos-scenarios.jsonNetwork latency introduced, packet loss ratenetwork-chaos-ng
Pod Network ScenariosChaos/pod-network-scenarios.jsonPod network latency, pod packet losspod-network-chaos
Zone Outage ScenariosChaos/zone-outage-scenarios.jsonZone recovery time, affected node countzone-outages
Cluster Shut Down ScenariosChaos/cluster-shut-down-scenarios.jsonNode running time, node stopped timecluster-shut-down
Service Hijacking ScenariosChaos/service-hijacking-scenarios.jsonService hijacking metrics, service response timeservice-hijacking
PVC ScenariosChaos/pvc-scenarios.jsonPVC recovery time, attach/detach durationpvc-scenarios
Time ScenariosChaos/time-scenarios.jsonClock skew duration, NTP recovery timetime-scenarios
SYN Flood ScenariosChaos/syn-flood-scenarios.jsonActive connection count during flood, service recovery timesyn-flood
KubeVirt DisruptionChaos/kubevirt-disruption.jsonVM recovery time, OVN disruption impact, console healthkubevirt-vm-outage
Application Outage ScenariosChaos/app-scenarios.jsonConsole health/downtime duration, etcd latency, OVN master CPUapplication-outages

General / OpenShift Dashboards

These dashboards show cluster-wide health and performance metrics from Prometheus. They are not filtered by run UUID — use them to see the broader cluster impact of any chaos scenario.

DashboardFileKey PanelsBest Used For
API PerformanceGeneral/api-performance-overview.jsonRequest duration (p99) by instance/resource, request rate, read vs write latencyAny scenario that may impact API server responsiveness
EtcdGeneral/etcd-on-cluster-dashboard.jsonWAL fsync duration, backend commit duration, compact/defrag, network usagePod, node, cluster-shutdown scenarios; anything stressing etcd
Node OverviewGeneral/node-overview.jsonTotal/ready nodes, master vs worker breakdownNode scenarios, zone outages, cluster shutdowns
OCP PerformanceGeneral/ocp-performance.jsonCluster-at-a-glance, OVN stack, monitoring stack, kubeletGeneral health baseline; useful across all scenarios
OVN MonitoringGeneral/ovn-dashboard.jsonOVN resource usage, latency, workqueue depthNetwork chaos, pod network, zone outage, service hijacking
OpenShift Service HealthGeneral/service-health.jsonServices up/down, pods ready/not-readyAny scenario affecting workload availability
KubeVirt PerformanceGeneral/kubevirt-perf.jsonVMI phase status, CPU/memory/network metrics per VMKubeVirt disruption scenarios

K8s Dashboards

These dashboards are for generic Kubernetes clusters (non-OpenShift). They provide performance and networking baselines.

DashboardFileKey PanelsBest Used For
K8s Performancek8s/k8s-perf.jsonCluster details, per-node resource usageGeneral health baseline on vanilla Kubernetes
Networkingk8s/networking-dashboard.jsonReceived/transmit packets, bandwidth, dropped packetsNetwork chaos, SYN flood, pod network scenarios

Viewing Dashboards Per Scenario

Step 1 — Identify your scenario type

Use the table above to find the matching Chaos dashboard for your scenario. For example, if you ran node-cpu-hog, open the Hog Scenarios dashboard.

Step 2 — Filter by UUID

Each Chaos dashboard has a UUID variable at the top. Paste your run UUID (printed in krkn logs, or visible in the Krkn Dashboard Metrics page) to filter all panels to that specific run.

Step 3 — Cross-reference with cluster dashboards

While viewing your scenario-specific results, open a second tab with a General or K8s dashboard to correlate:

  • Etcd — check if etcd latency spiked during your run window
  • API Performance — check if API request duration increased
  • KubeVirt Performance — watch VMI’s on your cluster
  • Node Overview / OCP Performance — check cluster-wide health impact
  • OVN Monitoring — check networking stack for latency increases

Step 4 — Time range alignment

Set the Grafana time range to match your run’s start/end time. The Chaos dashboards show per-UUID events; the General dashboards show time-series metrics for the same window.


Editing and Adding Dashboards

Dashboards can be edited in the Grafana UI (log in as the admin user). Source dashboards are Jsonnet templates in the assets/ directory of the visualize repo and can be rebuilt with make.

To add a new dashboard, see the Adding a new dashboard guide.

Last modified April 10, 2026: adding more details for visualize (bff76b2)