What is krkn-ai?
Krkn-AI lets you automatically run Chaos scenarios and discover the most effective experiments to evaluate your system’s resilience.
How does it work?
Krkn-AI leverages evolutionary algorithms to generate experiments based on Krkn scenarios. By using user-defined objectives such as SLOs and application health checks, it can identify the critical experiments that impact the cluster.
- Generate a Krkn-AI config file using discover. Running this command will generate a YAML file that is pre-populated with cluster component information and basic setup.
- The config file can be further customized to suit your requirements for Krkn-AI testing.
- Start Krkn-AI testing:
- The evolutionary algorithm will use the cluster components specified in the config file as possible inputs required to run the Chaos scenarios.
- User-defined SLOs and application health check feedback are taken into account to guide the algorithm.
- Analyze results, evaluate the impact of different Chaos scenarios on application liveness and their fitness scores.
Getting Started
Follow the installation steps to set up the Krkn-AI CLI.
1 - Getting Started
How to deploy sample microservice and run Krkn-AI test
Getting Started with Krkn-AI
This documentation details how to deploy a sample microservice application on Kubernetes Cluster and run Krkn-AI test.
Prerequisites
- Follow this guide to install Krkn-AI CLI.
- Krkn-AI uses Thanos Querier to fetch SLO metrics by PromQL. You can easily install it by setting up prometheus-operator in your cluster.
Deploy Sample Microservice
For demonstration purpose, we will deploy a sample microservice called robot-shop on the cluster:
# Change to Krkn-AI project directory
cd krkn-ai/
# Namespace where to deploy the microservice application
export DEMO_NAMESPACE=robot-shop
# Whether the K8s cluster is an OpenShift cluster
export IS_OPENSHIFT=true
./scripts/setup-demo-microservice.sh
# Set context to the demo namespace
oc config set-context --current --namespace=$DEMO_NAMESPACE
# If you are using kubectl:
# kubectl config set-context --current --namespace=$DEMO_NAMESPACE
# Check whether pods are running
oc get pods
We will deploy a NGINX reverse proxy and a LoadBalancer service in the cluster to expose the routes for some of the pods.
# Setup NGINX reverse proxy for external access
./scripts/setup-nginx.sh
# Check nginx pod
oc get pods -l app=nginx-proxy
# Test application endpoints
./scripts/test-nginx-routes.sh
export HOST="http://$(kubectl get service rs -o json | jq -r '.status.loadBalancer.ingress[0].hostname')"
Note
If your cluster uses Ingress or custom annotation to expose the services, make sure to follow those steps.📝 Generate Configuration
Krkn-AI uses YAML configuration files to define experiments. You can generate a sample config file dynamically by running Krkn-AI discover command.
$ uv run krkn_ai discover --help
Usage: krkn_ai discover [OPTIONS]
Discover components for Krkn-AI tests
Options:
-k, --kubeconfig TEXT Path to cluster kubeconfig file.
-o, --output TEXT Path to save config file. [default: ./krkn-ai.yaml]
-n, --namespace TEXT Namespace(s) to discover components in. Supports
Regex and comma separated values. [default: .*]
-pl, --pod-label TEXT Pod Label Keys(s) to filter. Supports Regex and
comma separated values. [default: .*]
-nl, --node-label TEXT Node Label Keys(s) to filter. Supports Regex and
comma separated values. [default: .*]
-v, --verbose Increase verbosity of output. [default: 0]
--help Show this message and exit.
# Discover components in cluster to generate the config
$ uv run krkn_ai discover -k ./path/to/kubeconfig.yaml -n "robot-shop" -pl "service" -o ./krkn-ai.yaml
Discover command generates a yaml
file as an output that contains the initial boilerplate for testing. You can modify this file to include custom SLO definitions, cluster components and configure algorithm settings as per your testing use-case.
# Path to your kubeconfig file
kubeconfig_file_path: "./path/to/kubeconfig.yaml"
# Genetic algorithm parameters
generations: 5
population_size: 10
composition_rate: 0.3
population_injection_rate: 0.1
# Fitness function configuration for defining SLO
# In the below example, we use Total Restarts in "robot-shop" namespace as the SLO
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
# Whether to include non-zero exit code status in the fitness function scoring
include_krkn_failure: true
# Health endpoints for synthetic monitoring of applications
health_checks:
stop_watcher_on_failure: false
applications:
- name: cart
url: "$HOST/cart/add/1/Watson/1"
- name: catalogue
url: "$HOST/catalogue/categories"
# Chaos scenarios to consider during testing
scenario:
pod-scenarios:
enable: true
application-outages:
enable: true
container-scenarios:
enable: false
node-cpu-hog:
enable: false
node-memory-hog:
enable: false
# Cluster components to consider for Krkn-AI testing
cluster_components:
namespaces:
- name: robot-shop
pods:
- containers:
- name: cart
labels:
service: cart
name: cart-7cd6c77dbf-j4gsv
- containers:
- name: catalogue
labels:
service: catalogue
name: catalogue-94df6b9b-pjgsr
nodes:
- labels:
kubernetes.io/hostname: node-1
name: node-1
- labels:
kubernetes.io/hostname: node-2
name: node-2
Running Krkn-AI
Once your test configuration is set, you can start Krkn-AI testing using the run
command. This command initializes a random population sample containing Chaos Experiments based on the Krkn-AI configuration, then starts the evolutionary algorithm to run the experiments, gather feedback, and continue evolving existing scenarios until the total number of generations defined in the config is met.
$ uv run krkn_ai run --help
Usage: krkn_ai run [OPTIONS]
Run Krkn-AI tests
Options:
-c, --config TEXT Path to Krkn-AI config file.
-o, --output TEXT Directory to save results.
-f, --format [json|yaml] Format of the output file. [default: yaml]
-r, --runner-type [krknctl|krknhub] Type of chaos engine to use.
-p, --param TEXT Additional parameters for config file in key=value format.
-v, --verbose Increase verbosity of output. [default: 0]
--help Show this message and exit.
# Configure Prometheus
# (Optional) In OpenShift cluster, the framework will automatically look for thanos querier in openshift-monitoring namespace.
export PROMETHEUS_URL='https://Thanos-Querier-url'
export PROMETHEUS_TOKEN='enter-access-token'
# Start Krkn-AI test
uv run krkn_ai run -vv -c ./krkn-ai.yaml -o ./tmp/results/ -p HOST=$HOST
Understanding the Results
In the ./tmp/results
directory, you will find the results from testing. The final results contain information about each scenario, their fitness evaluation scores, reports, and graphs, which you can use to further investigate.
.
└── results/
├── reports/
│ ├── best_scenarios.yaml
│ ├── health_check_report.csv
│ └── graphs/
│ ├── best_generation.png
│ ├── scenario_1.png
│ ├── scenario_2.png
│ └── ...
├── yaml/
│ ├── generation_0/
│ │ ├── scenario_1.yaml
│ │ ├── scenario_2.yaml
│ │ └── ...
│ └── generation_1/
│ └── ...
├── log/
│ ├── scenario_1.log
│ ├── scenario_2.log
│ └── ...
└── krkn-ai.yaml
Reports Directory:
health_check_report.csv
: Summary of application health checks containing details about the scenario, component, failure status and latency.best_scenarios.yaml
: YAML file containing information about best scenario identified in each generation.best_generation.png
: Visualization of best fitness score found in each generation.scenario_<ids>.png
: Visualization of response time line plot for health checks and heatmap for success and failures.
YAML:
scenario_<id>.yaml
: YAML file detailing about the Chaos scenario executed which includes the krknctl command, fitness scores, health check metrices, etc. These files are organised under each generation
folder.
Log:
scenario_<id>.log
: Logs captured from krknctl scenario.
2 - Cluster Discovery
Automatically discover cluster components for Krkn-AI testing.
Krkn-AI uses a genetic algorithm to generate Chaos scenarios. These scenarios require information about the components available in the cluster, which is obtained from the cluster_components
YAML field of the Krkn-AI configuration.
CLI Usage
$ uv run krkn_ai discover --help
Usage: krkn_ai discover [OPTIONS]
Discover components for Krkn-AI tests
Options:
-k, --kubeconfig TEXT Path to cluster kubeconfig file.
-o, --output TEXT Path to save config file.
-n, --namespace TEXT Namespace(s) to discover components in. Supports
Regex and comma separated values.
-pl, --pod-label TEXT Pod Label Keys(s) to filter. Supports Regex and
comma separated values.
-nl, --node-label TEXT Node Label Keys(s) to filter. Supports Regex and
comma separated values.
-v, --verbose Increase verbosity of output.
--help Show this message and exit.
Example
The example below filters cluster components from namespaces that match the patterns robot-.*
and etcd
. In addition to namespaces, we also provide filters for pod labels and node labels. This allows us to narrow down the necessary components to consider when running a Krkn-AI test.
$ uv run krkn_ai discover -k ./path/to/kubeconfig.yaml -n "robot-.*,etcd" -pl "service,env" -nl "disktype" -o ./krkn-ai.yaml
The above command generates a config file that contains the basic setup to help you get started. You can customize the parameters as described in the configs documentation. If you want to exclude any cluster components—such as a pod, node, or namespace—from being considered for Krkn-AI testing, simply remove them from the cluster_components
YAML field.
# Path to your kubeconfig file
kubeconfig_file_path: "./path/to/kubeconfig.yaml"
# Genetic algorithm parameters
generations: 5
population_size: 10
composition_rate: 0.3
population_injection_rate: 0.1
# Fitness function configuration for defining SLO
# In the below example, we use Total Restarts in "robot-shop" namespace as the SLO
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
include_krkn_failure: true
# Chaos scenarios to consider during testing
scenario:
pod-scenarios:
enable: true
application-outages:
enable: true
container-scenarios:
enable: false
node-cpu-hog:
enable: false
node-memory-hog:
enable: false
# Cluster components to consider for Krkn-AI testing
cluster_components:
namespaces:
- name: robot-shop
pods:
- containers:
- name: cart
labels:
service: cart
env: dev
name: cart-7cd6c77dbf-j4gsv
- containers:
- name: catalogue
labels:
service: catalogue
env: dev
name: catalogue-94df6b9b-pjgsr
- name: etcd
pods:
- containers:
- name: etcd
labels:
service: etcd
name: etcd-0
- containers:
- name: etcd
labels:
service: etcd
name: etcd-1
nodes:
- labels:
kubernetes.io/hostname: node-1
disktype: SSD
name: node-1
- labels:
kubernetes.io/hostname: node-2
disktype: HDD
name: node-2
3 - Configuration
Configuring Krkn-AI
Krkn-AI is configured using a simple declarative YAML file. This file can be automatically generated using Krkn-AI’s discover feature, which creates a config file from a boilerplate template. The generated config file will have the cluster components pre-populated based on your cluster.
3.1 - Evolutionary Algorithm
Configuring Evolutionary Algorithm
Krkn-AI uses an online learning approach by leveraging an evolutionary algorithm, where an agent runs tests on the actual cluster and gathers feedback by measuring various KPIs for your cluster and application. The algorithm begins by creating random population samples that contain Chaos scenarios. These scenarios are executed on the cluster, feedback is collected, and then the best samples (parents) are selected to undergo crossover and mutation operations to generate the next set of samples (offspring). The algorithm relies on heuristics to guide the exploration and exploitation of scenarios.

Terminologies
- Generation: A single iteration or cycle of the algorithm during which the population evolves. Each generation produces a new set of candidate solutions.
- Population: The complete set of candidate solutions (individuals) at a given generation.
- Sample (or Individual): A single candidate solution within the population, often represented as a chromosome or genome. In our case, this is equivalent to a Chaos experiment.
- Selection: The process of choosing individuals from the population (based on fitness) to serve as parents for producing the next generation.
- Crossover: The operation of combining two Chaos experiments to produce a new scenario, encouraging the exploration of new solutions.
- Mutation: A random alteration of parts of a Chaos experiment.
- Composition: The process of combining existing Chaos experiments into a grouped scenario to represent a single new scenario.
- Population Injection: The introduction of new individuals into the population to escape stagnation.
Configurations
The algorithm relies on specific configurations to guide its execution. These settings can be adjusted in the Krkn-AI config file, which you generate using the discover command.
generations
Total number of generation loop to run (Default: 20)
- The value for this field should be at least 1.
- Setting this to a higher value increases Krkn-AI testing coverage.
- Each scenario tested in the current generation retains some properties from the previous generation.
population_size
Minimum Population size in each generation (Default: 10)
- The value for this field should be at least 2.
- Setting this to a higher value will increase the number of scenarios tested per generation, which is helpful for running diverse test samples.
- A higher value is also preferred when you have a large set of objects in cluster components and multiple scenarios enabled.
- If you have a limited set of components to be evaluated, you can set a smaller population size and fewer generations.
crossover_rate
How often crossover should occur for each scenario parameter (Default: 0.6 and Range: [0.0, 1.0])
- A higher crossover rate increases the likelihood that a crossover operation will create two new candidate solutions from two existing candidates.
- Setting the crossover rate to
1.0
ensures that crossover always occurs during selection process.
mutation_rate
How often mutation should occur for each scenario parameter (Default: 0.7 and Range: [0.0, 1.0])
- This helps to control the diversification among the candidates. A higher value increases the likelihood that a mutation operation will be applied.
- Setting this to
1.0
ensures persistent mutation during the selection process.
composition_rate
How often a crossover would lead to composition (Default: 0.0 and Range: [0.0, 1.0])
- By default, this value is disabled, but you can set it to a higher rate to increase the likelihood of composition.
population_injection_rate
How often a random samples gets newly added to population (Default: 0.0 and Range: [0.0, 1.0])
- A higher injection rate increases the likelihood of introducing new candidates into the existing generation.
population_injection_size
What’s the size of random samples that gets added to new population (Default: 2)
- A higher injection size means that more diversified samples get added during the evolutionary algorithm loop.
- This is beneficial if you want to start with a smaller population test set and then increase the population size as you progress through the test.
3.2 - Fitness Function
Configuring Fitness Function
The fitness function is a crucial element in the Krkn-AI algorithm. It evaluates each Chaos experiment and generates a score. These scores are then used during the selection phase of the algorithm to identify the best candidate solutions in each generation.
- The fitness function can be defined as an SLO or as cluster metrics using a Prometheus query.
- Fitness scores are calculated for the time range during which the Chaos scenario is executed.
Example
Let’s look at a simple fitness function that calculates the total number of restarts in a namespace:
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
This fitness function calculates the number of restarts that occurred during the test in the specified namespace. The resulting value is referred to as the Fitness Function Score. These scores are computed for each scenario in every generation and can be found in the scenario YAML configuration within the results. Below is an example of a scenario YAML configuration:
generation_id: 0
scenario_id: 1
scenario:
name: node-memory-hog(60, 89, 8, kubernetes.io/hostname=node1,
[], 1, quay.io/krkn-chaos/krkn-hog)
cmd: 'krknctl run node-memory-hog --telemetry-prometheus-backup False --wait-duration
0 --kubeconfig ./tmp/kubeconfig.yaml --chaos-duration "60" --memory-consumption
"89%" --memory-workers "8" --node-selector "kubernetes.io/hostname=node1"
--taints "[]" --number-of-nodes "1" --image "quay.io/krkn-chaos/krkn-hog" '
log: ./results/logs/scenario_1.log
returncode: 0
start_time: '2025-09-01T16:55:12.607656'
end_time: '2025-09-01T16:58:35.204787'
fitness_result:
scores: []
fitness_score: 2
job_id: 1
health_check_results: {}
In the above result, the fitness score of 2
indicates that two restarts were observed in the namespace while running the node-memory-hog
scenario. The algorithm uses this score as feedback to prioritize this scenario for further testing.
Types of Fitness Function
There are two types of fitness functions available in Krkn-AI: point and range.
Point-Based Fitness Function
In the point-based fitness function type, we calculate the difference in the fitness function value between the end and the beginning of the Chaos experiment. This difference signifies the change that occurred during the experiment phase, allowing us to capture the delta. This approach is especially useful for Prometheus metrics that are counters and only increase, as the difference helps us determine the actual change during the experiment.
E.g SLO: Pod Restarts across “robot-shop” namespace.
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
Range-Based Fitness Function
Certain SLOs require us to consider changes that occur over a period of time by using aggregate values such as min, max, or average. For these types of value-based metrics in Prometheus, the range type of Fitness Function is useful.
Because the range type is calculated over a time interval—and the exact timing of each Chaos experiment may not be known in advance—we provide a $range$
parameter that must be used in the fitness function definition.
E.g SLO: Max CPU observed for a container.
fitness_function:
query: 'max_over_time(container_cpu_usage_seconds_total{namespace="robot-shop", container="mysql"}[$range$])'
type: range
Defining Multiple Fitness Functions
Krkn-AI allows you to define multiple fitness function items in the YAML configuration, enabling you to track how individual fitness values vary for different scenarios in the final outcome.
You can assign a weight
to each fitness function to specify how its value impacts the final score used during Genetic Algorithm selection. Each weight should be between 0 and 1. By default, if no weight is specified, it will be considered as 1.
fitness_function:
items:
- query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
weight: 0.3
- query: 'sum(kube_pod_container_status_restarts_total{namespace="etcd"})'
type: point
Krkn Failures
Krkn-AI uses krknctl under the hood to trigger Chaos testing experiments on the cluster. As part of the CLI, it captures various feedback and returns a non-zero status code when a failure occurs. By default, feedback from these failures is included in the Krkn-AI Fitness Score calculation.
You can disable this by setting the include_krkn_failure
to false
.
fitness_function:
include_krkn_failure: false
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
Health Check
Results from application health checks are also incorporated into the fitness score. You can learn more about health checks and how to configure them in more detail here.
How to Define a Good Fitness Function
Scoring: The higher the fitness score, the more priority will be given to that scenario for generating new sets of scenarios. This also means that scenarios with higher fitness scores are more likely to have an impact on the cluster and should be further investigated.
Normalization: Krkn-AI currently does not apply any normalization, except when a fitness function is assigned with weights. While this does not significantly impact the algorithm, from a user interpretation standpoint, it is beneficial to use normalized SLO queries in PromQL. For example, instead of using the maximum CPU for a pod as a fitness function, it may be more convenient to use the CPU percentage of a pod.
Use-Case Driven: The fitness function query should be defined based on your use case. If you want to optimize your cluster for maximum uptime, a good fitness function could be to capture restart counts or the number of unavailable pods. Similarly, if you are interested in optimizing your cluster to ensure no downtime due to resource constraints, a good fitness function would be to measure the maximum CPU or memory percentage.
3.3 - Application Health Checks
Configuring Application Health Checks
When defining the Chaos Config, you can provide details about your application endpoints. Krkn-AI can access these endpoints during the Chaos experiment to evaluate how the application’s uptime is impacted.
Note
Application endpoints must be accessible from the system where Krkn-AI is running in order to reach the service.Configuration
The following configuration options are available when defining an application for health checks:
- name: Name of the service.
- url: Service endpoint; supports parameterization with “$”.
- status_code: Expected status code returned when accessing the service.
- timeout: Timeout period after which the request is canceled.
- interval: How often to check the endpoint.
- stop_watcher_on_failure: This setting allows you to stop the health check watcher for an endpoint after it encounters a failure.
Example
health_checks:
stop_watcher_on_failure: false
applications:
- name: cart
url: "$HOST/cart/add/1/Watson/1"
status_code: 200
timeout: 10
interval: 2
- name: catalogue
url: "$HOST/catalogue/categories"
- name: shipping
url: "$HOST/shipping/codes"
- name: payment
url: "$HOST/payment/health"
- name: user
url: "$HOST/user/uniqueid"
- name: ratings
url: "$HOST/ratings/api/fetch/Watson"
URL Parameterization
When defining Krkn-AI config files, the URL entry for an application may vary depending on the cluster. To make the URL configuration more manageable, you can specify the values for these parameters at runtime using the --param
flag.
In the previous example, the $HOST
variable in the config can be dynamically replaced during the Krkn-AI experiment run, as shown below.
uv run krkn_ai run -c krkn-ai.yaml -o results/ -p HOST=http://example.cluster.url/nginx
By default, the results of health checks—including whether each check succeeded and the response times—are incorporated into the overall Fitness Function score. This allows Krkn-AI to use application health as part of its evaluation criteria.
If you want to exclude health check results from influencing the fitness score, you can set the include_health_check_failure
and include_health_check_response_time
fields to false
in your configuration.
fitness_function:
...
include_health_check_failure: false
include_health_check_response_time: false
3.4 - Scenarios
Available Kkrn-AI Scenarios
The following Krkn scenarios are currently supported by Kkrn-AI.
At least one scenario must be enabled for the Kkrn-AI experiment to run.
By default, scenarios are not enabled. Depending on your use case, you can enable or disable these scenarios in the krkn-ai.yaml
config file by setting the enable
field to true
or false
.
scenario:
pod-scenarios:
enable: true
application-outages:
enable: false
container-scenarios:
enable: false
node-cpu-hog:
enable: true
node-memory-hog:
enable: true
time-scenarios:
enable: true