Elasticsearch Storage

Storing telemetry, metrics, and alerts in Elasticsearch or OpenSearch

Krkn can push the data it collects at the end of each run into Elasticsearch (or OpenSearch) so you can query, visualize, and compare results across many runs over time. Three separate indices are written: one for telemetry, one for Prometheus metrics snapshots, and one for firing alerts.

See equivalent config parameters: krknctl flags · krkn-hub variables · krkn config file


Configuration

elastic:
    enable_elastic: True           # set to True to enable all three indices
    verify_certs: False
    elastic_url: "https://my-elasticsearch.example.com"
    elastic_port: 9200
    username: "elastic"
    password: "changeme"
    telemetry_index: "krkn-telemetry"   # index for telemetry data
    metrics_index: "krkn-metrics"       # index for Prometheus metric snapshots
    alerts_index: "krkn-alerts"         # index for firing alerts
  • enable_elastic: Set to True to activate Elasticsearch storage. Telemetry is always written when enabled. Metrics are written only when performance_monitoring.enable_metrics: True. Alerts are written only when performance_monitoring.enable_alerts: True.
  • verify_certs: Set to True to enforce SSL certificate verification.
  • elastic_url: Full URL of your Elasticsearch or OpenSearch server (include https:// if using TLS).
  • elastic_port: Port to connect on. Default is 9200 for most deployments; some managed services use 443.
  • metrics_index: Index where Prometheus metric snapshots are stored.
  • alerts_index: Index where Prometheus alerts are stored.
  • telemetry_index: Index where the full chaos run telemetry document is stored.

Krkn auto-detects whether the backend is Elasticsearch or OpenSearch by querying the cluster info endpoint — no extra config is needed to switch between them.


Telemetry Index

One document per krkn run is written to telemetry_index. It contains all the fields printed in the telemetry output plus several additional fields that are only populated when running in a CI environment.

Top-level fields

FieldTypeDescription
run_uuidkeywordUnique run identifier — use this to join across all three indices
timestamptextISO 8601 timestamp when telemetry was captured
job_statusbooltrue = run passed overall
cloud_infrastructuretextCloud provider (e.g. GCP, AWS, Azure, baremetal)
cloud_typetextself-managed or managed service type
cluster_versiontextFull version string of the cluster
major_versiontextMajor.minor extracted from cluster_version
total_node_countintegerTotal node count at time of run
network_pluginstext (multi)CNI plugins active on the cluster
kubernetes_objects_countnestedMap of resource type → count (e.g. Pod: 294)
fips_enabledboolWhether FIPS mode was enabled on the cluster
etcd_encryption_enabledboolWhether etcd encryption at rest was enabled
ipsec_enabledboolWhether IPSec was enabled on the cluster
build_urltextCI build URL (e.g. Prow job link) — populated in CI runs
tagtextOptional run tag for grouping related runs

scenarios (nested)

One entry per scenario executed. See the scenario fields in the telemetry reference for the full list. Additional fields stored only in Elasticsearch:

FieldTypeDescription
affected_vmisnestedVirtualMachineInstances disrupted, with per-VMI recovery timing
affected_vmis[].vmi_nametextName of the affected VMI
affected_vmis[].namespacetextNamespace the VMI belongs to
affected_vmis[].total_recovery_timefloatSeconds from disruption to VMI ready
affected_vmis[].vmi_readiness_timefloatSeconds for VMI to pass readiness after rescheduling
affected_vmis[].vmi_rescheduling_timefloatSeconds for VMI to be rescheduled

node_summary_infos, node_taints (nested)

See node_summary_infos and node_taints in the telemetry reference — the fields are identical.

health_checks (nested)

Populated when health checks are configured.

FieldTypeDescription
urltextURL that was checked
statusbooltrue = URL was reachable and returned a success code
status_codetextHTTP status code returned
start_timestampdateWhen the health check window started
end_timestampdateWhen the health check window ended
durationfloatTotal duration of the health check window in seconds

virt_checks / post_virt_checks (nested)

Populated when virtual machine checks are enabled. virt_checks captures pre/during chaos state; post_virt_checks captures the post-chaos state.

FieldTypeDescription
vm_nametextName of the VirtualMachine
namespacetextNamespace the VM belongs to
ip_addresstextIP address before the chaos event
new_ip_addresstextIP address after recovery (may differ if pod was rescheduled)
node_nametextNode the VM was running on
ssh_statusboolWhether SSH was reachable on the VM
vmi_readyboolWhether the VMI reached ready state
statusboolOverall pass/fail for this VM check
check_typekeywordType of check performed
start_timestampdateWhen this check started
end_timestampdateWhen this check completed
durationfloatDuration of the check in seconds

error_logs (nested)

Error-level log lines captured from the krkn run.

FieldTypeDescription
timestamptextTimestamp from the log line
messagetextLog message content

overall_resiliency_report (nested)

Populated when Resiliency Scoring is enabled.

FieldTypeDescription
resiliency_scoreintegerScore from 0–100 representing system stability
passed_slosintegerNumber of SLOs that passed during the run
total_slosintegerTotal number of SLOs evaluated
scenariosnestedPer-scenario resiliency breakdown

Metrics Index

One document per metric data point is written to metrics_index. Metrics are collected from Prometheus based on the queries in your metrics_profile file. Requires performance_monitoring.enable_metrics: True.

FieldTypeDescription
run_uuidkeywordRun identifier — join with the telemetry index to get cluster context
timestampdateTimestamp of the metric sample
(dynamic)variesEach field from the Prometheus query result is stored directly as additional fields on the document (e.g. metricName, value, labels)

Example document shape (varies by metrics_profile queries):

{
  "run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239",
  "timestamp": "2025-04-22T17:35:00Z",
  "metricName": "apiserverRequestRate",
  "value": 0.42
}

Alerts Index

One document per firing alert is written to alerts_index. Alerts are collected from Prometheus based on your alert_profile file. Requires performance_monitoring.enable_alerts: True.

FieldTypeDescription
run_uuidkeywordRun identifier — join with the telemetry index to get cluster context
severitytextAlert severity label (e.g. critical, warning)
alerttextAlert name and labels as a string
created_atdateTimestamp when the alert was captured (ISO 8601)

Example document:

{
  "run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239",
  "severity": "critical",
  "alert": "KubeNodeNotReady{node=\"worker-0\"}",
  "created_at": "2025-04-22T17:20:15Z"
}

Querying across indices

All three indices share run_uuid as a common key, making it straightforward to correlate cluster state with metric behavior for a specific run:

GET krkn-metrics/_search
{
  "query": {
    "match": { "run_uuid": "96348571-0b06-459e-b654-a1bb6fd66239" }
  }
}

To visualize stored data with pre-built dashboards wired to both Elasticsearch and Prometheus, see krkn-visualize.