This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Configuration

Configuring Krkn-AI

Krkn-AI is configured using a simple declarative YAML file. This file can be automatically generated using Krkn-AI’s discover feature, which creates a config file from a boilerplate template. The generated config file will have the cluster components pre-populated based on your cluster.

1 - Evolutionary Algorithm

Configuring Evolutionary Algorithm

Krkn-AI uses an online learning approach by leveraging an evolutionary algorithm, where an agent runs tests on the actual cluster and gathers feedback by measuring various KPIs for your cluster and application. The algorithm begins by creating random population samples that contain Chaos scenarios. These scenarios are executed on the cluster, feedback is collected, and then the best samples (parents) are selected to undergo crossover and mutation operations to generate the next set of samples (offspring). The algorithm relies on heuristics to guide the exploration and exploitation of scenarios.

Genetic Algorithm

Terminologies

  • Generation: A single iteration or cycle of the algorithm during which the population evolves. Each generation produces a new set of candidate solutions.
  • Population: The complete set of candidate solutions (individuals) at a given generation.
  • Sample (or Individual): A single candidate solution within the population, often represented as a chromosome or genome. In our case, this is equivalent to a Chaos experiment.
  • Selection: The process of choosing individuals from the population (based on fitness) to serve as parents for producing the next generation.
  • Crossover: The operation of combining two Chaos experiments to produce a new scenario, encouraging the exploration of new solutions.
  • Mutation: A random alteration of parts of a Chaos experiment.
  • Composition: The process of combining existing Chaos experiments into a grouped scenario to represent a single new scenario.
  • Population Injection: The introduction of new individuals into the population to escape stagnation.

Configurations

The algorithm relies on specific configurations to guide its execution. These settings can be adjusted in the Krkn-AI config file, which you generate using the discover command.

generations

Total number of generation loop to run (Default: 20)

  • The value for this field should be at least 1.
  • Setting this to a higher value increases Krkn-AI testing coverage.
  • Each scenario tested in the current generation retains some properties from the previous generation.

population_size

Minimum Population size in each generation (Default: 10)

  • The value for this field should be at least 2.
  • Setting this to a higher value will increase the number of scenarios tested per generation, which is helpful for running diverse test samples.
  • A higher value is also preferred when you have a large set of objects in cluster components and multiple scenarios enabled.
  • If you have a limited set of components to be evaluated, you can set a smaller population size and fewer generations.

crossover_rate

How often crossover should occur for each scenario parameter (Default: 0.6 and Range: [0.0, 1.0])

  • A higher crossover rate increases the likelihood that a crossover operation will create two new candidate solutions from two existing candidates.
  • Setting the crossover rate to 1.0 ensures that crossover always occurs during selection process.

mutation_rate

How often mutation should occur for each scenario parameter (Default: 0.7 and Range: [0.0, 1.0])

  • This helps to control the diversification among the candidates. A higher value increases the likelihood that a mutation operation will be applied.
  • Setting this to 1.0 ensures persistent mutation during the selection process.

composition_rate

How often a crossover would lead to composition (Default: 0.0 and Range: [0.0, 1.0])

  • By default, this value is disabled, but you can set it to a higher rate to increase the likelihood of composition.

population_injection_rate

How often a random samples gets newly added to population (Default: 0.0 and Range: [0.0, 1.0])

  • A higher injection rate increases the likelihood of introducing new candidates into the existing generation.

population_injection_size

What’s the size of random samples that gets added to new population (Default: 2)

  • A higher injection size means that more diversified samples get added during the evolutionary algorithm loop.
  • This is beneficial if you want to start with a smaller population test set and then increase the population size as you progress through the test.

2 - Fitness Function

Configuring Fitness Function

The fitness function is a crucial element in the Krkn-AI algorithm. It evaluates each Chaos experiment and generates a score. These scores are then used during the selection phase of the algorithm to identify the best candidate solutions in each generation.

  • The fitness function can be defined as an SLO or as cluster metrics using a Prometheus query.
  • Fitness scores are calculated for the time range during which the Chaos scenario is executed.

Example

Let’s look at a simple fitness function that calculates the total number of restarts in a namespace:

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

This fitness function calculates the number of restarts that occurred during the test in the specified namespace. The resulting value is referred to as the Fitness Function Score. These scores are computed for each scenario in every generation and can be found in the scenario YAML configuration within the results. Below is an example of a scenario YAML configuration:

generation_id: 0
scenario_id: 1
scenario:
  name: node-memory-hog(60, 89, 8, kubernetes.io/hostname=node1,
    [], 1, quay.io/krkn-chaos/krkn-hog)
cmd: 'krknctl run node-memory-hog --telemetry-prometheus-backup False --wait-duration
  0 --kubeconfig ./tmp/kubeconfig.yaml --chaos-duration "60" --memory-consumption
  "89%" --memory-workers "8" --node-selector "kubernetes.io/hostname=node1"
  --taints "[]" --number-of-nodes "1" --image "quay.io/krkn-chaos/krkn-hog" '
log: ./results/logs/scenario_1.log
returncode: 0
start_time: '2025-09-01T16:55:12.607656'
end_time: '2025-09-01T16:58:35.204787'
fitness_result:
  scores: []
  fitness_score: 2
job_id: 1
health_check_results: {}

In the above result, the fitness score of 2 indicates that two restarts were observed in the namespace while running the node-memory-hog scenario. The algorithm uses this score as feedback to prioritize this scenario for further testing.

Types of Fitness Function

There are two types of fitness functions available in Krkn-AI: point and range.

Point-Based Fitness Function

In the point-based fitness function type, we calculate the difference in the fitness function value between the end and the beginning of the Chaos experiment. This difference signifies the change that occurred during the experiment phase, allowing us to capture the delta. This approach is especially useful for Prometheus metrics that are counters and only increase, as the difference helps us determine the actual change during the experiment.

E.g SLO: Pod Restarts across “robot-shop” namespace.

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

Range-Based Fitness Function

Certain SLOs require us to consider changes that occur over a period of time by using aggregate values such as min, max, or average. For these types of value-based metrics in Prometheus, the range type of Fitness Function is useful.

Because the range type is calculated over a time interval—and the exact timing of each Chaos experiment may not be known in advance—we provide a $range$ parameter that must be used in the fitness function definition.

E.g SLO: Max CPU observed for a container.

fitness_function: 
  query: 'max_over_time(container_cpu_usage_seconds_total{namespace="robot-shop", container="mysql"}[$range$])'
  type: range

Defining Multiple Fitness Functions

Krkn-AI allows you to define multiple fitness function items in the YAML configuration, enabling you to track how individual fitness values vary for different scenarios in the final outcome.

You can assign a weight to each fitness function to specify how its value impacts the final score used during Genetic Algorithm selection. Each weight should be between 0 and 1. By default, if no weight is specified, it will be considered as 1.

fitness_function:
  items:
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point
    weight: 0.3
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="etcd"})'
    type: point

Krkn Failures

Krkn-AI uses krknctl under the hood to trigger Chaos testing experiments on the cluster. As part of the CLI, it captures various feedback and returns a non-zero status code when a failure occurs. By default, feedback from these failures is included in the Krkn-AI Fitness Score calculation.

You can disable this by setting the include_krkn_failure to false.

fitness_function:
    include_krkn_failure: false
    query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point

Health Check

Results from application health checks are also incorporated into the fitness score. You can learn more about health checks and how to configure them in more detail here.

How to Define a Good Fitness Function

  • Scoring: The higher the fitness score, the more priority will be given to that scenario for generating new sets of scenarios. This also means that scenarios with higher fitness scores are more likely to have an impact on the cluster and should be further investigated.

  • Normalization: Krkn-AI currently does not apply any normalization, except when a fitness function is assigned with weights. While this does not significantly impact the algorithm, from a user interpretation standpoint, it is beneficial to use normalized SLO queries in PromQL. For example, instead of using the maximum CPU for a pod as a fitness function, it may be more convenient to use the CPU percentage of a pod.

  • Use-Case Driven: The fitness function query should be defined based on your use case. If you want to optimize your cluster for maximum uptime, a good fitness function could be to capture restart counts or the number of unavailable pods. Similarly, if you are interested in optimizing your cluster to ensure no downtime due to resource constraints, a good fitness function would be to measure the maximum CPU or memory percentage.

3 - Application Health Checks

Configuring Application Health Checks

When defining the Chaos Config, you can provide details about your application endpoints. Krkn-AI can access these endpoints during the Chaos experiment to evaluate how the application’s uptime is impacted.

Configuration

The following configuration options are available when defining an application for health checks:

  • name: Name of the service.
  • url: Service endpoint; supports parameterization with “$”.
  • status_code: Expected status code returned when accessing the service.
  • timeout: Timeout period after which the request is canceled.
  • interval: How often to check the endpoint.
  • stop_watcher_on_failure: This setting allows you to stop the health check watcher for an endpoint after it encounters a failure.

Example

health_checks:
  stop_watcher_on_failure: false
  applications:
  - name: cart
    url: "$HOST/cart/add/1/Watson/1"
    status_code: 200
    timeout: 10
    interval: 2
  - name: catalogue
    url: "$HOST/catalogue/categories"
  - name: shipping
    url: "$HOST/shipping/codes"
  - name: payment
    url: "$HOST/payment/health"
  - name: user
    url: "$HOST/user/uniqueid"
  - name: ratings
    url: "$HOST/ratings/api/fetch/Watson"

URL Parameterization

When defining Krkn-AI config files, the URL entry for an application may vary depending on the cluster. To make the URL configuration more manageable, you can specify the values for these parameters at runtime using the --param flag.

In the previous example, the $HOST variable in the config can be dynamically replaced during the Krkn-AI experiment run, as shown below.

uv run krkn_ai run -c krkn-ai.yaml -o results/ -p HOST=http://example.cluster.url/nginx

Configure Health Check Score into Fitness Function

By default, the results of health checks—including whether each check succeeded and the response times—are incorporated into the overall Fitness Function score. This allows Krkn-AI to use application health as part of its evaluation criteria.

If you want to exclude health check results from influencing the fitness score, you can set the include_health_check_failure and include_health_check_response_time fields to false in your configuration.

fitness_function:
    ...
    include_health_check_failure: false
    include_health_check_response_time: false

4 - Scenarios

Available Kkrn-AI Scenarios

The following Krkn scenarios are currently supported by Kkrn-AI.

At least one scenario must be enabled for the Kkrn-AI experiment to run.

ScenarioKkrn-AI Config (YAML)
Pod Scenarioscenario.pod-scenarios
Application Outagesscenario.application-outages
Container Scenarioscenario.container-scenarios
Node CPU Hogscenario.node-cpu-hog
Node Memory Hogscenario.node-memory-hog
Time Scenarioscenario.time-scenarios

By default, scenarios are not enabled. Depending on your use case, you can enable or disable these scenarios in the krkn-ai.yaml config file by setting the enable field to true or false.

scenario:
  pod-scenarios:
    enable: true

  application-outages:
    enable: false

  container-scenarios:
    enable: false

  node-cpu-hog:
    enable: true

  node-memory-hog:
    enable: true

  time-scenarios:
    enable: true