Krkn-AI is configured using a simple declarative YAML file. This file can be automatically generated using Krkn-AI’s discover feature, which creates a config file from a boilerplate template. The generated config file will have the cluster components pre-populated based on your cluster.
This is the multi-page printable view of this section. Click here to print.
Configuration
1 - Evolutionary Algorithm
Krkn-AI uses an online learning approach by leveraging an evolutionary algorithm, where an agent runs tests on the actual cluster and gathers feedback by measuring various KPIs for your cluster and application. The algorithm begins by creating random population samples that contain Chaos scenarios. These scenarios are executed on the cluster, feedback is collected, and then the best samples (parents) are selected to undergo crossover and mutation operations to generate the next set of samples (offspring). The algorithm relies on heuristics to guide the exploration and exploitation of scenarios.
Terminologies
- Generation: A single iteration or cycle of the algorithm during which the population evolves. Each generation produces a new set of candidate solutions.
- Population: The complete set of candidate solutions (individuals) at a given generation.
- Sample (or Individual): A single candidate solution within the population, often represented as a chromosome or genome. In our case, this is equivalent to a Chaos experiment.
- Selection: The process of choosing individuals from the population (based on fitness) to serve as parents for producing the next generation.
- Crossover: The operation of combining two Chaos experiments to produce a new scenario, encouraging the exploration of new solutions.
- Mutation: A random alteration of parts of a Chaos experiment.
- Composition: The process of combining existing Chaos experiments into a grouped scenario to represent a single new scenario.
- Population Injection: The introduction of new individuals into the population to escape stagnation.
Configurations
The algorithm relies on specific configurations to guide its execution. These settings can be adjusted in the Krkn-AI config file, which you generate using the discover command.
generations
Total number of generation loop to run (Default: 20)
- The value for this field should be at least 1.
- Setting this to a higher value increases Krkn-AI testing coverage.
- Each scenario tested in the current generation retains some properties from the previous generation.
population_size
Minimum Population size in each generation (Default: 10)
- The value for this field should be at least 2.
- Setting this to a higher value will increase the number of scenarios tested per generation, which is helpful for running diverse test samples.
- A higher value is also preferred when you have a large set of objects in cluster components and multiple scenarios enabled.
- If you have a limited set of components to be evaluated, you can set a smaller population size and fewer generations.
crossover_rate
How often crossover should occur for each scenario parameter (Default: 0.6 and Range: [0.0, 1.0])
- A higher crossover rate increases the likelihood that a crossover operation will create two new candidate solutions from two existing candidates.
- Setting the crossover rate to
1.0
ensures that crossover always occurs during selection process.
mutation_rate
How often mutation should occur for each scenario parameter (Default: 0.7 and Range: [0.0, 1.0])
- This helps to control the diversification among the candidates. A higher value increases the likelihood that a mutation operation will be applied.
- Setting this to
1.0
ensures persistent mutation during the selection process.
composition_rate
How often a crossover would lead to composition (Default: 0.0 and Range: [0.0, 1.0])
- By default, this value is disabled, but you can set it to a higher rate to increase the likelihood of composition.
population_injection_rate
How often a random samples gets newly added to population (Default: 0.0 and Range: [0.0, 1.0])
- A higher injection rate increases the likelihood of introducing new candidates into the existing generation.
population_injection_size
What’s the size of random samples that gets added to new population (Default: 2)
- A higher injection size means that more diversified samples get added during the evolutionary algorithm loop.
- This is beneficial if you want to start with a smaller population test set and then increase the population size as you progress through the test.
2 - Fitness Function
The fitness function is a crucial element in the Krkn-AI algorithm. It evaluates each Chaos experiment and generates a score. These scores are then used during the selection phase of the algorithm to identify the best candidate solutions in each generation.
- The fitness function can be defined as an SLO or as cluster metrics using a Prometheus query.
- Fitness scores are calculated for the time range during which the Chaos scenario is executed.
Example
Let’s look at a simple fitness function that calculates the total number of restarts in a namespace:
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
This fitness function calculates the number of restarts that occurred during the test in the specified namespace. The resulting value is referred to as the Fitness Function Score. These scores are computed for each scenario in every generation and can be found in the scenario YAML configuration within the results. Below is an example of a scenario YAML configuration:
generation_id: 0
scenario_id: 1
scenario:
name: node-memory-hog(60, 89, 8, kubernetes.io/hostname=node1,
[], 1, quay.io/krkn-chaos/krkn-hog)
cmd: 'krknctl run node-memory-hog --telemetry-prometheus-backup False --wait-duration
0 --kubeconfig ./tmp/kubeconfig.yaml --chaos-duration "60" --memory-consumption
"89%" --memory-workers "8" --node-selector "kubernetes.io/hostname=node1"
--taints "[]" --number-of-nodes "1" --image "quay.io/krkn-chaos/krkn-hog" '
log: ./results/logs/scenario_1.log
returncode: 0
start_time: '2025-09-01T16:55:12.607656'
end_time: '2025-09-01T16:58:35.204787'
fitness_result:
scores: []
fitness_score: 2
job_id: 1
health_check_results: {}
In the above result, the fitness score of 2
indicates that two restarts were observed in the namespace while running the node-memory-hog
scenario. The algorithm uses this score as feedback to prioritize this scenario for further testing.
Types of Fitness Function
There are two types of fitness functions available in Krkn-AI: point and range.
Point-Based Fitness Function
In the point-based fitness function type, we calculate the difference in the fitness function value between the end and the beginning of the Chaos experiment. This difference signifies the change that occurred during the experiment phase, allowing us to capture the delta. This approach is especially useful for Prometheus metrics that are counters and only increase, as the difference helps us determine the actual change during the experiment.
E.g SLO: Pod Restarts across “robot-shop” namespace.
fitness_function:
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
Range-Based Fitness Function
Certain SLOs require us to consider changes that occur over a period of time by using aggregate values such as min, max, or average. For these types of value-based metrics in Prometheus, the range type of Fitness Function is useful.
Because the range type is calculated over a time interval—and the exact timing of each Chaos experiment may not be known in advance—we provide a $range$
parameter that must be used in the fitness function definition.
E.g SLO: Max CPU observed for a container.
fitness_function:
query: 'max_over_time(container_cpu_usage_seconds_total{namespace="robot-shop", container="mysql"}[$range$])'
type: range
Defining Multiple Fitness Functions
Krkn-AI allows you to define multiple fitness function items in the YAML configuration, enabling you to track how individual fitness values vary for different scenarios in the final outcome.
You can assign a weight
to each fitness function to specify how its value impacts the final score used during Genetic Algorithm selection. Each weight should be between 0 and 1. By default, if no weight is specified, it will be considered as 1.
fitness_function:
items:
- query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
weight: 0.3
- query: 'sum(kube_pod_container_status_restarts_total{namespace="etcd"})'
type: point
Krkn Failures
Krkn-AI uses krknctl under the hood to trigger Chaos testing experiments on the cluster. As part of the CLI, it captures various feedback and returns a non-zero status code when a failure occurs. By default, feedback from these failures is included in the Krkn-AI Fitness Score calculation.
You can disable this by setting the include_krkn_failure
to false
.
fitness_function:
include_krkn_failure: false
query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
type: point
Health Check
Results from application health checks are also incorporated into the fitness score. You can learn more about health checks and how to configure them in more detail here.
How to Define a Good Fitness Function
Scoring: The higher the fitness score, the more priority will be given to that scenario for generating new sets of scenarios. This also means that scenarios with higher fitness scores are more likely to have an impact on the cluster and should be further investigated.
Normalization: Krkn-AI currently does not apply any normalization, except when a fitness function is assigned with weights. While this does not significantly impact the algorithm, from a user interpretation standpoint, it is beneficial to use normalized SLO queries in PromQL. For example, instead of using the maximum CPU for a pod as a fitness function, it may be more convenient to use the CPU percentage of a pod.
Use-Case Driven: The fitness function query should be defined based on your use case. If you want to optimize your cluster for maximum uptime, a good fitness function could be to capture restart counts or the number of unavailable pods. Similarly, if you are interested in optimizing your cluster to ensure no downtime due to resource constraints, a good fitness function would be to measure the maximum CPU or memory percentage.
3 - Application Health Checks
When defining the Chaos Config, you can provide details about your application endpoints. Krkn-AI can access these endpoints during the Chaos experiment to evaluate how the application’s uptime is impacted.
Note
Application endpoints must be accessible from the system where Krkn-AI is running in order to reach the service.Configuration
The following configuration options are available when defining an application for health checks:
- name: Name of the service.
- url: Service endpoint; supports parameterization with “$
”. - status_code: Expected status code returned when accessing the service.
- timeout: Timeout period after which the request is canceled.
- interval: How often to check the endpoint.
- stop_watcher_on_failure: This setting allows you to stop the health check watcher for an endpoint after it encounters a failure.
Example
health_checks:
stop_watcher_on_failure: false
applications:
- name: cart
url: "$HOST/cart/add/1/Watson/1"
status_code: 200
timeout: 10
interval: 2
- name: catalogue
url: "$HOST/catalogue/categories"
- name: shipping
url: "$HOST/shipping/codes"
- name: payment
url: "$HOST/payment/health"
- name: user
url: "$HOST/user/uniqueid"
- name: ratings
url: "$HOST/ratings/api/fetch/Watson"
URL Parameterization
When defining Krkn-AI config files, the URL entry for an application may vary depending on the cluster. To make the URL configuration more manageable, you can specify the values for these parameters at runtime using the --param
flag.
In the previous example, the $HOST
variable in the config can be dynamically replaced during the Krkn-AI experiment run, as shown below.
uv run krkn_ai run -c krkn-ai.yaml -o results/ -p HOST=http://example.cluster.url/nginx
Configure Health Check Score into Fitness Function
By default, the results of health checks—including whether each check succeeded and the response times—are incorporated into the overall Fitness Function score. This allows Krkn-AI to use application health as part of its evaluation criteria.
If you want to exclude health check results from influencing the fitness score, you can set the include_health_check_failure
and include_health_check_response_time
fields to false
in your configuration.
fitness_function:
...
include_health_check_failure: false
include_health_check_response_time: false
4 - Scenarios
The following Krkn scenarios are currently supported by Kkrn-AI.
At least one scenario must be enabled for the Kkrn-AI experiment to run.
Scenario | Kkrn-AI Config (YAML) |
---|---|
Pod Scenario | scenario.pod-scenarios |
Application Outages | scenario.application-outages |
Container Scenario | scenario.container-scenarios |
Node CPU Hog | scenario.node-cpu-hog |
Node Memory Hog | scenario.node-memory-hog |
Time Scenario | scenario.time-scenarios |
By default, scenarios are not enabled. Depending on your use case, you can enable or disable these scenarios in the krkn-ai.yaml
config file by setting the enable
field to true
or false
.
scenario:
pod-scenarios:
enable: true
application-outages:
enable: false
container-scenarios:
enable: false
node-cpu-hog:
enable: true
node-memory-hog:
enable: true
time-scenarios:
enable: true