Telemetry

Telemetry run details of the cluster and scenario

Telemetry Details

We wanted to gather some more insights regarding our Krkn runs that could have been post processed (eg. by a ML model) to have a better understanding about the behavior of the clusters hit by krkn, so we decided to include this as an opt-in feature that, based on the platform (Kubernetes/OCP), is able to gather different type of data and metadata in the time frame of each chaos run. The telemetry service is currently able to gather several scenario and cluster metadata: A json named telemetry.json containing:

  • Chaos run metadata:
    • the duration of the chaos run
    • the config parameters with which the scenario has been setup
    • any recovery time details (applicable to pod scenarios and node scenarios only)
    • the exit status of the chaos run
  • Cluster metadata:
    • Node metadata (architecture, cloud instance type etc.)
    • Node counts
    • Number and type of objects deployed in the cluster
    • Network plugins
    • Cluster version
  • A partial/full backup of the prometheus binary logs (currently available on OCP only)
  • Any firing critical alerts on the cluster

This telemetry JSON is printed at the end of every krkn run and can optionally be stored long-term in Elasticsearch. See Elasticsearch Storage for details on storing and querying telemetry, metrics, and alerts.

Deploy your own telemetry AWS service

The krkn-telemetry project aims to provide a basic, but fully working example on how to deploy your own Krkn telemetry collection API. We currently do not support the telemetry collection as a service for community users and we discourage to handover your infrastructure telemetry metadata to third parties since may contain confidential infos.

The guide below will explain how to deploy the service automatically as an AWS lambda function, but you can easily deploy it as a flask application in a VM or in any python runtime environment. Then you can use it to store data to use in chaos-ai

https://github.com/krkn-chaos/krkn-telemetry

Sample telemetry config

telemetry:
    enabled: False                                           # enable/disables the telemetry collection feature
    api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
    username: username                                      # telemetry service username
    password: password                                      # telemetry service password
    prometheus_backup: True                                 # enables/disables prometheus data collection
    full_prometheus_backup: False                           # if is set to False only the /prometheus/wal folder will be downloaded.
    backup_threads: 5                                       # number of telemetry download/upload threads
    archive_path: /tmp                                      # local path where the archive files will be temporarly stored
    max_retries: 0                                          # maximum number of upload retries (if 0 will retry forever)
    run_tag: ''                                             # if set, this will be appended to the run folder in the bucket (useful to group the runs)
    archive_size: 500000                                     # the size of the prometheus data archive size in KB. The lower the size of archive is
                                                            # the higher the number of archive files will be produced and uploaded (and processed by backup_threads
                                                            # simultaneously).
                                                            # For unstable/slow connection is better to keep this value low
                                                            # increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
                                                            # failed chunk without affecting the whole upload.
    logs_backup: True
    logs_filter_patterns:
     - "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+"         # Sep 9 11:20:36.123425532
     - "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+"          # kinit 2023/09/15 11:20:36 log
     - "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+"      # 2023-09-15T11:20:36.123425532Z log
    oc_cli_path: /usr/bin/oc                                # optional, if not specified will be search in $PATH

Sample output of telemetry

{
    "telemetry": {
        "scenarios": [
            {
                "start_timestamp": 1745343338,
                "end_timestamp": 1745343683,
                "scenario": "scenarios/network_chaos.yaml",
                "scenario_type": "pod_disruption_scenarios",
                "exit_status": 0,
                "parameters_base64": "",
                "parameters": [
                    {
                        "config": {
                            "execution_type": "parallel",
                            "instance_count": 1,
                            "kubeconfig_path": "/root/.kube/config",
                            "label_selector": "node-role.kubernetes.io/master",
                            "network_params": {
                                "bandwidth": "10mbit",
                                "latency": "500ms",
                                "loss": "50%"
                            },
                            "node_interface_name": null,
                            "test_duration": 300,
                            "wait_duration": 60
                        },
                        "id": "network_chaos"
                    }
                ],
                "affected_pods": {
                    "recovered": [],
                    "unrecovered": [],
                    "error": null
                },
                "affected_nodes": [],
                "cluster_events": []
            }
        ],
        "node_summary_infos": [
            {
                "count": 3,
                "architecture": "amd64",
                "instance_type": "n2-standard-4",
                "nodes_type": "master",
                "kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
                "kubelet_version": "v1.31.6",
                "os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
            },
            {
                "count": 3,
                "architecture": "amd64",
                "instance_type": "n2-standard-4",
                "nodes_type": "worker",
                "kernel_version": "5.14.0-427.60.1.el9_4.x86_64",
                "kubelet_version": "v1.31.6",
                "os_version": "Red Hat Enterprise Linux CoreOS 418.94.202503121207-0"
            }
        ],
        "node_taints": [
            {
                "node_name": "prubenda-g-qdcvv-master-0.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            },
            {
                "node_name": "prubenda-g-qdcvv-master-1.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            },
            {
                "node_name": "prubenda-g-qdcvv-master-2.c.chaos-438115.internal",
                "effect": "NoSchedule",
                "key": "node-role.kubernetes.io/master",
                "value": null
            }
        ],
        "kubernetes_objects_count": {
            "ConfigMap": 530,
            "Pod": 294,
            "Deployment": 69,
            "Route": 8,
            "Build": 1
        },
        "network_plugins": [
            "OVNKubernetes"
        ],
        "timestamp": "2025-04-22T17:35:37Z",
        "health_checks": null,
        "total_node_count": 6,
        "cloud_infrastructure": "GCP",
        "cloud_type": "self-managed",
        "cluster_version": "4.18.0-0.nightly-2025-03-13-035622",
        "major_version": "4.18",
        "run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239",
        "job_status": true
    },
    "critical_alerts": null
}

Telemetry Output Field Reference

Top-level fields

FieldTypeDescription
run_uuidstringUnique identifier for this krkn run, used to correlate records across indices in Elasticsearch
timestampstring (ISO 8601)Time when the telemetry was captured at the end of the run
job_statusboolOverall pass/fail of the run (true = passed)
cloud_infrastructurestringCloud provider detected (e.g. GCP, AWS, Azure, baremetal)
cloud_typestringWhether the cluster is self-managed or a managed cloud service
cluster_versionstringFull cluster version string (OCP nightly build or Kubernetes version)
major_versionstringMajor.minor version extracted from cluster_version (e.g. 4.18)
total_node_countintTotal number of nodes in the cluster at the time of the run
network_pluginslistCNI plugins active on the cluster (e.g. OVNKubernetes)
kubernetes_objects_countmapCount of each Kubernetes resource type present cluster-wide (e.g. Pod, Deployment, ConfigMap)
critical_alertslist or nullAny critical Prometheus alerts firing at the end of the run; null if none
health_checkslist or nullResults of URL health checks if configured — see Health Checks

scenarios[] — one entry per scenario executed

FieldTypeDescription
scenariostringPath to the scenario config file that was run
scenario_typestringCategory of scenario (e.g. pod_disruption_scenarios, node_scenarios)
start_timestampfloatUnix epoch when the scenario started
end_timestampfloatUnix epoch when the scenario finished
exit_statusint0 = scenario passed; non-zero = scenario failed
parameterslistScenario configuration parameters as parsed from the config file
parameters_base64stringBase64-encoded copy of raw parameters (for lossless round-tripping)
affected_podsobjectPods disrupted during the scenario, split into recovered and unrecovered lists with per-pod timing
affected_nodeslistNodes disrupted during the scenario with timing details (not_ready, ready, stopped, running, terminating durations)
cluster_eventslistKubernetes events captured during the scenario window
affected_pods.recovered[] / affected_pods.unrecovered[]
FieldTypeDescription
pod_namestringName of the affected pod
namespacestringNamespace the pod belongs to
total_recovery_timefloatSeconds from disruption to pod being ready again
pod_readiness_timefloatSeconds for the pod to pass its readiness probe after rescheduling
pod_rescheduling_timefloatSeconds for the pod to be rescheduled onto a node
affected_nodes[]
FieldTypeDescription
node_namestringKubernetes node name
node_idstringCloud provider instance ID
not_ready_timefloatSeconds the node spent in NotReady state
ready_timefloatSeconds the node spent in Ready state after recovery
stopped_timefloatSeconds the node was stopped (cloud stop/start scenarios)
running_timefloatSeconds the node was running after restart
terminating_timefloatSeconds the node spent terminating

node_summary_infos[] — one entry per node role/type group

FieldTypeDescription
countintNumber of nodes in this group
nodes_typestringNode role (e.g. master, worker)
architecturestringCPU architecture (e.g. amd64, arm64)
instance_typestringCloud instance type (e.g. n2-standard-4, m5.xlarge)
kernel_versionstringLinux kernel version running on these nodes
kubelet_versionstringKubelet version (e.g. v1.31.6)
os_versionstringFull OS version string (e.g. RHCOS build identifier)

node_taints[] — one entry per node taint

FieldTypeDescription
node_namestringFull hostname of the tainted node
keystringTaint key (e.g. node-role.kubernetes.io/master)
valuestring or nullTaint value; null if the taint has no value
effectstringTaint effect: NoSchedule, NoExecute, or PreferNoSchedule