This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Configuration

Configuring Krkn-AI

1: Evolutionary Algorithm
2: Fitness Function
3: Application Health Checks
4: Scenarios
5: Output

Krkn-AI is configured using a simple declarative YAML file. This file can be automatically generated using Krkn-AI’s discover feature, which creates a config file from a boilerplate template. The generated config file will have the cluster components pre-populated based on your cluster.

1 - Evolutionary Algorithm

Configuring Evolutionary Algorithm

Krkn-AI uses an online learning approach by leveraging an evolutionary algorithm, where an agent runs tests on the actual cluster and gathers feedback by measuring various KPIs for your cluster and application. The algorithm begins by creating random population samples that contain Chaos scenarios. These scenarios are executed on the cluster, feedback is collected, and then the best samples (parents) are selected to undergo crossover and mutation operations to generate the next set of samples (offspring). The algorithm relies on heuristics to guide the exploration and exploitation of scenarios.

Genetic Algorithm

Terminologies

Generation: A single iteration or cycle of the algorithm during which the population evolves. Each generation produces a new set of candidate solutions.
Population: The complete set of candidate solutions (individuals) at a given generation.
Sample (or Individual): A single candidate solution within the population, often represented as a chromosome or genome. In our case, this is equivalent to a Chaos experiment.
Selection: The process of choosing individuals from the population (based on fitness) to serve as parents for producing the next generation.
Crossover: The operation of combining two Chaos experiments to produce a new scenario, encouraging the exploration of new solutions.
Mutation: A random alteration of parts of a Chaos experiment.
Scenario Mutation: The scenario itself is changed to a different one, introducing greater diversity in scenario execution while retaining the existing run properties.
Composition: The process of combining existing Chaos experiments into a grouped scenario to represent a single new scenario.
Population Injection: The introduction of new individuals into the population to escape stagnation.

Configurations

The algorithm relies on specific configurations to guide its execution. These settings can be adjusted in the Krkn-AI config file, which you generate using the discover command.

`generations`

Total number of generation loop to run (Default: 20)

The value for this field should be at least 1.
Setting this to a higher value increases Krkn-AI testing coverage.
Each scenario tested in the current generation retains some properties from the previous generation.

`population_size`

Minimum Population size in each generation (Default: 10)

The value for this field should be at least 2.
Setting this to a higher value will increase the number of scenarios tested per generation, which is helpful for running diverse test samples.
A higher value is also preferred when you have a large set of objects in cluster components and multiple scenarios enabled.
If you have a limited set of components to be evaluated, you can set a smaller population size and fewer generations.

`crossover_rate`

How often crossover should occur for each scenario parameter (Default: 0.6 and Range: [0.0, 1.0])

A higher crossover rate increases the likelihood that a crossover operation will create two new candidate solutions from two existing candidates.
Setting the crossover rate to 1.0 ensures that crossover always occurs during selection process.

`mutation_rate`

How often mutation should occur for each scenario parameter (Default: 0.7 and Range: [0.0, 1.0])

This helps to control the diversification among the candidates. A higher value increases the likelihood that a mutation operation will be applied.
Setting this to 1.0 ensures persistent mutation during the selection process.

`scenario_mutation_rate`

How often a mutation should result in a change to the scenario (Default: 0.6; Range: [0.0, 1.0])

A higher rate increases diversity between scenarios in each generation.
A lower rate gives priority to retaining the existing scenario across generations.

`composition_rate`

How often a crossover would lead to composition (Default: 0.0 and Range: [0.0, 1.0])

By default, this value is disabled, but you can set it to a higher rate to increase the likelihood of composition.

`population_injection_rate`

How often a random samples gets newly added to population (Default: 0.0 and Range: [0.0, 1.0])

A higher injection rate increases the likelihood of introducing new candidates into the existing generation.

`population_injection_size`

What’s the size of random samples that gets added to new population (Default: 2)

A higher injection size means that more diversified samples get added during the evolutionary algorithm loop.
This is beneficial if you want to start with a smaller population test set and then increase the population size as you progress through the test.

`wait_duration`

Time to wait after scenario execution. Sets Krkn’s --wait-duration parameter. (Default: 120 seconds)

2 - Fitness Function

Configuring Fitness Function

The fitness function is a crucial element in the Krkn-AI algorithm. It evaluates each Chaos experiment and generates a score. These scores are then used during the selection phase of the algorithm to identify the best candidate solutions in each generation.

The fitness function can be defined as an SLO or as cluster metrics using a Prometheus query.
Fitness scores are calculated for the time range during which the Chaos scenario is executed.

Example

Let’s look at a simple fitness function that calculates the total number of restarts in a namespace:

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

This fitness function calculates the number of restarts that occurred during the test in the specified namespace. The resulting value is referred to as the Fitness Function Score. These scores are computed for each scenario in every generation and can be found in the scenario YAML configuration within the results. Below is an example of a scenario YAML configuration:

generation_id: 0
scenario_id: 1
scenario:
  name: node-memory-hog(60, 89, 8, kubernetes.io/hostname=node1,
    [], 1, quay.io/krkn-chaos/krkn-hog)
cmd: 'krknctl run node-memory-hog --telemetry-prometheus-backup False --wait-duration
  0 --kubeconfig ./tmp/kubeconfig.yaml --chaos-duration "60" --memory-consumption
  "89%" --memory-workers "8" --node-selector "kubernetes.io/hostname=node1"
  --taints "[]" --number-of-nodes "1" --image "quay.io/krkn-chaos/krkn-hog" '
log: ./results/logs/scenario_1.log
returncode: 0
start_time: '2025-09-01T16:55:12.607656'
end_time: '2025-09-01T16:58:35.204787'
fitness_result:
  scores: []
  fitness_score: 2
job_id: 1
health_check_results: {}

In the above result, the fitness score of 2 indicates that two restarts were observed in the namespace while running the node-memory-hog scenario. The algorithm uses this score as feedback to prioritize this scenario for further testing.

Types of Fitness Function

There are two types of fitness functions available in Krkn-AI: point and range.

Point-Based Fitness Function

In the point-based fitness function type, we calculate the difference in the fitness function value between the end and the beginning of the Chaos experiment. This difference signifies the change that occurred during the experiment phase, allowing us to capture the delta. This approach is especially useful for Prometheus metrics that are counters and only increase, as the difference helps us determine the actual change during the experiment.

E.g SLO: Pod Restarts across “robot-shop” namespace.

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

Range-Based Fitness Function

Certain SLOs require us to consider changes that occur over a period of time by using aggregate values such as min, max, or average. For these types of value-based metrics in Prometheus, the range type of Fitness Function is useful.

Because the range type is calculated over a time interval—and the exact timing of each Chaos experiment may not be known in advance—we provide a $range$ parameter that must be used in the fitness function definition.

E.g SLO: Max CPU observed for a container.

fitness_function: 
  query: 'max_over_time(container_cpu_usage_seconds_total{namespace="robot-shop", container="mysql"}[$range$])'
  type: range

Defining Multiple Fitness Functions

Krkn-AI allows you to define multiple fitness function items in the YAML configuration, enabling you to track how individual fitness values vary for different scenarios in the final outcome.

You can assign a weight to each fitness function to specify how its value impacts the final score used during Genetic Algorithm selection. Each weight should be between 0 and 1. By default, if no weight is specified, it will be considered as 1.

fitness_function:
  items:
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point
    weight: 0.3
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="etcd"})'
    type: point

Krkn Failures

Krkn-AI uses krknctl under the hood to trigger Chaos testing experiments on the cluster. As part of the CLI, it captures various feedback and returns a non-zero status code (exit status 2) when a failure occurs. By default, feedback from these failures is included in the Krkn-AI Fitness Score calculation.

You can disable this by setting the include_krkn_failure to false.

fitness_function:
    include_krkn_failure: false
    query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point

Note: If a Krkn scenario exits with a non-zero status code other than 2, Krkn-AI assigns a fitness score of -1 and stops the calculation of health scores. This typically indicates a misconfiguration or another issue with the scenario. For more details, please refer to the Krkn logs for the scenario.

Health Check

Results from application health checks are also incorporated into the fitness score. You can learn more about health checks and how to configure them in more detail here.

How to Define a Good Fitness Function

Scoring: The higher the fitness score, the more priority will be given to that scenario for generating new sets of scenarios. This also means that scenarios with higher fitness scores are more likely to have an impact on the cluster and should be further investigated.
Normalization: Krkn-AI currently does not apply any normalization, except when a fitness function is assigned with weights. While this does not significantly impact the algorithm, from a user interpretation standpoint, it is beneficial to use normalized SLO queries in PromQL. For example, instead of using the maximum CPU for a pod as a fitness function, it may be more convenient to use the CPU percentage of a pod.
Use-Case Driven: The fitness function query should be defined based on your use case. If you want to optimize your cluster for maximum uptime, a good fitness function could be to capture restart counts or the number of unavailable pods. Similarly, if you are interested in optimizing your cluster to ensure no downtime due to resource constraints, a good fitness function would be to measure the maximum CPU or memory percentage.

3 - Application Health Checks

Configuring Application Health Checks

When defining the Chaos Config, you can provide details about your application endpoints. Krkn-AI can access these endpoints during the Chaos experiment to evaluate how the application’s uptime is impacted.

Note

Application endpoints must be accessible from the system where Krkn-AI is running in order to reach the service.

Configuration

The following configuration options are available when defining an application for health checks:

name: Name of the service.
url: Service endpoint; supports parameterization with “$”.
status_code: Expected status code returned when accessing the service.
timeout: Timeout period after which the request is canceled.
interval: How often to check the endpoint.
stop_watcher_on_failure: This setting allows you to stop the health check watcher for an endpoint after it encounters a failure.

Example

health_checks:
  stop_watcher_on_failure: false
  applications:
  - name: cart
    url: "$HOST/cart/add/1/Watson/1"
    status_code: 200
    timeout: 10
    interval: 2
  - name: catalogue
    url: "$HOST/catalogue/categories"
  - name: shipping
    url: "$HOST/shipping/codes"
  - name: payment
    url: "$HOST/payment/health"
  - name: user
    url: "$HOST/user/uniqueid"
  - name: ratings
    url: "$HOST/ratings/api/fetch/Watson"

URL Parameterization

When defining Krkn-AI config files, the URL entry for an application may vary depending on the cluster. To make the URL configuration more manageable, you can specify the values for these parameters at runtime using the --param flag.

In the previous example, the $HOST variable in the config can be dynamically replaced during the Krkn-AI experiment run, as shown below.

uv run krkn_ai run -c krkn-ai.yaml -o results/ -p HOST=http://example.cluster.url/nginx

Configure Health Check Score into Fitness Function

By default, the results of health checks—including whether each check succeeded and the response times—are incorporated into the overall Fitness Function score. This allows Krkn-AI to use application health as part of its evaluation criteria.

If you want to exclude health check results from influencing the fitness score, you can set the include_health_check_failure and include_health_check_response_time fields to false in your configuration.

fitness_function:
    ...
    include_health_check_failure: false
    include_health_check_response_time: false

4 - Scenarios

Available Kkrn-AI Scenarios

The following Krkn scenarios are currently supported by Kkrn-AI.

At least one scenario must be enabled for the Kkrn-AI experiment to run.

Scenario	Kkrn-AI Config (YAML)
Pod Scenario	scenario.pod-scenarios
Application Outages	scenario.application-outages
Container Scenario	scenario.container-scenarios
Node CPU Hog	scenario.node-cpu-hog
Node Memory Hog	scenario.node-memory-hog
Node IO Hog	scenario.node-io-hog
Syn Flood	scenario.syn-flood
Time Scenario	scenario.time-scenarios
Network Scenarios	scenario.network-scenarios
DNS Outage	scenario.dns-outage
PVC Scenario	scenario.pvc-scenarios

By default, scenarios are not enabled. Depending on your use case, you can enable or disable these scenarios in the krkn-ai.yaml config file by setting the enable field to true or false.

scenario:
  pod-scenarios:
    enable: true

  application-outages:
    enable: false

  container-scenarios:
    enable: false

  node-cpu-hog:
    enable: true

  node-memory-hog:
    enable: true

  node-io-hog:
    enable: false

  syn-flood:
    enable: false

  time-scenarios:
    enable: true

  network-scenarios:
    enable: false

  dns-outage:
    enable: true

  pvc-scenarios:
    enable: false

5 - Output

Configuring output formatters

Krkn-AI generates various output files during the execution of chaos experiments, including scenario YAML files, graph visualizations, and log files. By default, these files follow a standard naming convention, but you can customize the file names using format strings in the configuration file.

Available Parameters

The output section in your krkn-ai.yaml configuration file allows you to customize the naming format for different output file types:

`result_name_fmt`

Specifies the naming format for scenario result YAML files. These files contain the complete scenario configuration and execution results for each generated scenario.

Default: "scenario_%s.yaml"

`graph_name_fmt`

Specifies the naming format for graph visualization files. These files contain visual representations of the health check latency and success information.

Default: "scenario_%s.png"

`log_name_fmt`

Specifies the naming format for log files. These files contain execution logs for each scenario run.

Default: "scenario_%s.log"

Format String Placeholders

The format strings support the following placeholders:

%g - Generation number
%s - Scenario ID
%c - Scenario Name (e.g pod_scenarios)

Example

Here’s an example configuration that customizes all output file names:

output:
  result_name_fmt: "gen_%g_scenario_%s_%c.yaml"
  graph_name_fmt: "gen_%g_scenario_%s_%c.png"
  log_name_fmt: "gen_%g_scenario_%s_%c.log"

With this configuration, files will be named like:

gen_0_scenario_1_pod_scenarios.yaml
gen_0_scenario_1_pod_scenarios.png
gen_0_scenario_1_pod_scenarios.log