Telemetry
Telemetry run details of the cluster and scenario
Telemetry Details
We wanted to gather some more insights regarding our Krkn runs that could have been post processed (eg. by a ML model) to have a better understanding about the behavior of the clusters hit by krkn, so we decided to include this as an opt-in feature that, based on the platform (Kubernetes/OCP), is able to gather different type of data and metadata in the time frame of each chaos run.
The telemetry service is currently able to gather several scenario and cluster metadata:
A json named telemetry.json containing:
- Chaos run metadata:
- the duration of the chaos run
- the config parameters with which the scenario has been setup
- any recovery time details (applicable to pod scenarios and node scenarios only)
- the exit status of the chaos run
- Cluster metadata:
- Node metadata (architecture, cloud instance type etc.)
- Node counts
- Number and type of objects deployed in the cluster
- Network plugins
- Cluster version
- A partial/full backup of the prometheus binary logs (currently available on OCP only)
- Any firing critical alerts on the cluster
This telemetry JSON is printed at the end of every krkn run and can optionally be stored long-term in Elasticsearch. See Elasticsearch Storage for details on storing and querying telemetry, metrics, and alerts.
Deploy your own telemetry AWS service
The krkn-telemetry project aims to provide a basic, but fully working example on how to deploy your own Krkn telemetry collection API. We currently do not support the telemetry collection as a service for community users and we discourage to handover your infrastructure telemetry metadata to third parties since may contain confidential infos.
The guide below will explain how to deploy the service automatically as an AWS lambda function, but you can easily deploy it as a flask application in a VM or in any python runtime environment. Then you can use it to store data to use in chaos-ai
https://github.com/krkn-chaos/krkn-telemetry
Sample telemetry config
telemetry:
enabled: False # enable/disables the telemetry collection feature
api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
username: username # telemetry service username
password: password # telemetry service password
prometheus_backup: True # enables/disables prometheus data collection
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads: 5 # number of telemetry download/upload threads
archive_path: /tmp # local path where the archive files will be temporarly stored
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size: 500000 # the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
# increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
# failed chunk without affecting the whole upload.
logs_backup: True
logs_filter_patterns:
- "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+" # Sep 9 11:20:36.123425532
- "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+" # kinit 2023/09/15 11:20:36 log
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
Sample output of telemetry
{
"telemetry": {
"scenarios": [
{
"start_timestamp": 1745343338,
"end_timestamp": 1745343683,
"scenario": "scenarios/network_chaos.yaml",
"scenario_type": "pod_disruption_scenarios",
"exit_status": 0,
"parameters_base64": "",
"parameters": [
{
"config": {
"execution_type": "parallel",
"instance_count": 1,
"kubeconfig_path": "/root/.kube/config",
"label_selector": "node-role.kubernetes.io/master",
"network_params": {
"bandwidth": "10mbit",
"latency": "500ms",
"loss": "50%"
},
"node_interface_name": null,
"test_duration": 300,
"wait_duration": 60
},
"id": "network_chaos"
}
],
"affected_pods": {
"recovered": [],
"unrecovered": [],
"error": null
},
"affected_nodes": [],
"cluster_events": []
}
],
"node_summary_infos": [
{
"count": 3,
"architecture": "amd64",
"instance_type": "n2-standard-4",
"nodes_type": "master",
"kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
"kubelet_version": "v1.31.6",
"os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
},
{
"count": 3,
"architecture": "amd64",
"instance_type": "n2-standard-4",
"nodes_type": "worker",
"kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
"kubelet_version": "v1.31.6",
"os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
}
],
"node_taints": [
{
"node_name": "prubenda-g-qdcvv-master-0.c.chaos-438115.internal",
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/master",
"value": null
},
{
"node_name": "prubenda-g-qdcvv-master-1.c.chaos-438115.internal",
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/master",
"value": null
},
{
"node_name": "prubenda-g-qdcvv-master-2.c.chaos-438115.internal",
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/master",
"value": null
}
],
"kubernetes_objects_count": {
"ConfigMap": 530,
"Pod": 294,
"Deployment": 69,
"Route": 8,
"Build": 1
},
"network_plugins": [
"OVNKubernetes"
],
"timestamp": "2025-04-22T17:35:37Z",
"health_checks": null,
"total_node_count": 6,
"cloud_infrastructure": "GCP",
"cloud_type": "self-managed",
"cluster_version": "4.18.0-0.nightly-2025-03-13-035622",
"major_version": "4.18",
"run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239",
"job_status": true
},
"critical_alerts": null
}
Telemetry Output Field Reference
Top-level fields
| Field | Type | Description |
|---|
run_uuid | string | Unique identifier for this krkn run, used to correlate records across indices in Elasticsearch |
timestamp | string (ISO 8601) | Time when the telemetry was captured at the end of the run |
job_status | bool | Overall pass/fail of the run (true = passed) |
cloud_infrastructure | string | Cloud provider detected (e.g. GCP, AWS, Azure, baremetal) |
cloud_type | string | Whether the cluster is self-managed or a managed cloud service |
cluster_version | string | Full cluster version string (OCP nightly build or Kubernetes version) |
major_version | string | Major.minor version extracted from cluster_version (e.g. 4.18) |
total_node_count | int | Total number of nodes in the cluster at the time of the run |
network_plugins | list | CNI plugins active on the cluster (e.g. OVNKubernetes) |
kubernetes_objects_count | map | Count of each Kubernetes resource type present cluster-wide (e.g. Pod, Deployment, ConfigMap) |
critical_alerts | list or null | Any critical Prometheus alerts firing at the end of the run; null if none |
health_checks | list or null | Results of URL health checks if configured — see Health Checks |
scenarios[] — one entry per scenario executed
| Field | Type | Description |
|---|
scenario | string | Path to the scenario config file that was run |
scenario_type | string | Category of scenario (e.g. pod_disruption_scenarios, node_scenarios) |
start_timestamp | float | Unix epoch when the scenario started |
end_timestamp | float | Unix epoch when the scenario finished |
exit_status | int | 0 = scenario passed; non-zero = scenario failed |
parameters | list | Scenario configuration parameters as parsed from the config file |
parameters_base64 | string | Base64-encoded copy of raw parameters (for lossless round-tripping) |
affected_pods | object | Pods disrupted during the scenario, split into recovered and unrecovered lists with per-pod timing |
affected_nodes | list | Nodes disrupted during the scenario with timing details (not_ready, ready, stopped, running, terminating durations) |
cluster_events | list | Kubernetes events captured during the scenario window |
affected_pods.recovered[] / affected_pods.unrecovered[]
| Field | Type | Description |
|---|
pod_name | string | Name of the affected pod |
namespace | string | Namespace the pod belongs to |
total_recovery_time | float | Seconds from disruption to pod being ready again |
pod_readiness_time | float | Seconds for the pod to pass its readiness probe after rescheduling |
pod_rescheduling_time | float | Seconds for the pod to be rescheduled onto a node |
affected_nodes[]
| Field | Type | Description |
|---|
node_name | string | Kubernetes node name |
node_id | string | Cloud provider instance ID |
not_ready_time | float | Seconds the node spent in NotReady state |
ready_time | float | Seconds the node spent in Ready state after recovery |
stopped_time | float | Seconds the node was stopped (cloud stop/start scenarios) |
running_time | float | Seconds the node was running after restart |
terminating_time | float | Seconds the node spent terminating |
node_summary_infos[] — one entry per node role/type group
| Field | Type | Description |
|---|
count | int | Number of nodes in this group |
nodes_type | string | Node role (e.g. master, worker) |
architecture | string | CPU architecture (e.g. amd64, arm64) |
instance_type | string | Cloud instance type (e.g. n2-standard-4, m5.xlarge) |
kernel_version | string | Linux kernel version running on these nodes |
kubelet_version | string | Kubelet version (e.g. v1.31.6) |
os_version | string | Full OS version string (e.g. RHCOS build identifier) |
node_taints[] — one entry per node taint
| Field | Type | Description |
|---|
node_name | string | Full hostname of the tainted node |
key | string | Taint key (e.g. node-role.kubernetes.io/master) |
value | string or null | Taint value; null if the taint has no value |
effect | string | Taint effect: NoSchedule, NoExecute, or PreferNoSchedule |