This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

What is krkn-ai?

Krkn-AI lets you automatically run Chaos scenarios and discover the most effective experiments to evaluate your system’s resilience.

How does it work?

Krkn-AI leverages evolutionary algorithms to generate experiments based on Krkn scenarios. By using user-defined objectives such as SLOs and application health checks, it can identify the critical experiments that impact the cluster.

  1. Generate a Krkn-AI config file using discover. Running this command will generate a YAML file that is pre-populated with cluster component information and basic setup.
  2. The config file can be further customized to suit your requirements for Krkn-AI testing.
  3. Start Krkn-AI testing:
    • The evolutionary algorithm will use the cluster components specified in the config file as possible inputs required to run the Chaos scenarios.
    • User-defined SLOs and application health check feedback are taken into account to guide the algorithm.
  4. Analyze results, evaluate the impact of different Chaos scenarios on application liveness and their fitness scores.

Getting Started

Follow the installation steps to set up the Krkn-AI CLI.

1 - Getting Started

How to deploy sample microservice and run Krkn-AI test

Getting Started with Krkn-AI

This documentation details how to deploy a sample microservice application on Kubernetes Cluster and run Krkn-AI test.

Prerequisites

  • Follow this guide to install Krkn-AI CLI.
  • Krkn-AI uses Thanos Querier to fetch SLO metrics by PromQL. You can easily install it by setting up prometheus-operator in your cluster.

Deploy Sample Microservice

For demonstration purpose, we will deploy a sample microservice called robot-shop on the cluster:

# Change to Krkn-AI project directory
cd krkn-ai/

# Namespace where to deploy the microservice application
export DEMO_NAMESPACE=robot-shop

# Whether the K8s cluster is an OpenShift cluster
export IS_OPENSHIFT=true
./scripts/setup-demo-microservice.sh

# Set context to the demo namespace
oc config set-context --current --namespace=$DEMO_NAMESPACE
# If you are using kubectl:
# kubectl config set-context --current --namespace=$DEMO_NAMESPACE

# Check whether pods are running
oc get pods

We will deploy a NGINX reverse proxy and a LoadBalancer service in the cluster to expose the routes for some of the pods.

# Setup NGINX reverse proxy for external access
./scripts/setup-nginx.sh

# Check nginx pod
oc get pods -l app=nginx-proxy

# Test application endpoints
./scripts/test-nginx-routes.sh

export HOST="http://$(kubectl get service rs -o json | jq -r '.status.loadBalancer.ingress[0].hostname')"

📝 Generate Configuration

Krkn-AI uses YAML configuration files to define experiments. You can generate a sample config file dynamically by running Krkn-AI discover command.

$ uv run krkn_ai discover --help
Usage: krkn_ai discover [OPTIONS]

  Discover components for Krkn-AI tests

Options:
  -k, --kubeconfig TEXT   Path to cluster kubeconfig file.
  -o, --output TEXT       Path to save config file.  [default: ./krkn-ai.yaml]
  -n, --namespace TEXT    Namespace(s) to discover components in. Supports
                          Regex and comma separated values.  [default: .*]
  -pl, --pod-label TEXT   Pod Label Keys(s) to filter. Supports Regex and
                          comma separated values.  [default: .*]
  -nl, --node-label TEXT  Node Label Keys(s) to filter. Supports Regex and
                          comma separated values.  [default: .*]
  -v, --verbose           Increase verbosity of output.  [default: 0]
  --help                  Show this message and exit.

# Discover components in cluster to generate the config
$ uv run krkn_ai discover -k ./path/to/kubeconfig.yaml -n "robot-shop" -pl "service" -o ./krkn-ai.yaml

Discover command generates a yaml file as an output that contains the initial boilerplate for testing. You can modify this file to include custom SLO definitions, cluster components and configure algorithm settings as per your testing use-case.

# Path to your kubeconfig file
kubeconfig_file_path: "./path/to/kubeconfig.yaml"

# Genetic algorithm parameters
generations: 5
population_size: 10
composition_rate: 0.3
population_injection_rate: 0.1

# Fitness function configuration for defining SLO
# In the below example, we use Total Restarts in "robot-shop" namespace as the SLO
fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point
  # Whether to include non-zero exit code status in the fitness function scoring
  include_krkn_failure: true

# Health endpoints for synthetic monitoring of applications
health_checks:
  stop_watcher_on_failure: false
  applications:
  - name: cart
    url: "$HOST/cart/add/1/Watson/1"
  - name: catalogue
    url: "$HOST/catalogue/categories"

# Chaos scenarios to consider during testing
scenario:
  pod-scenarios:
    enable: true
  application-outages:
    enable: true
  container-scenarios:
    enable: false
  node-cpu-hog:
    enable: false
  node-memory-hog:
    enable: false

# Cluster components to consider for Krkn-AI testing
cluster_components:
  namespaces:
  - name: robot-shop
    pods:
    - containers:
      - name: cart
      labels:
        service: cart
      name: cart-7cd6c77dbf-j4gsv
    - containers:
      - name: catalogue
      labels:
        service: catalogue
      name: catalogue-94df6b9b-pjgsr
  nodes:
  - labels:
      kubernetes.io/hostname: node-1
    name: node-1
  - labels:
      kubernetes.io/hostname: node-2
    name: node-2

Running Krkn-AI

Once your test configuration is set, you can start Krkn-AI testing using the run command. This command initializes a random population sample containing Chaos Experiments based on the Krkn-AI configuration, then starts the evolutionary algorithm to run the experiments, gather feedback, and continue evolving existing scenarios until the total number of generations defined in the config is met.

$ uv run krkn_ai run --help
Usage: krkn_ai run [OPTIONS]

  Run Krkn-AI tests

Options:
  -c, --config TEXT                     Path to Krkn-AI config file.
  -o, --output TEXT                     Directory to save results.
  -f, --format [json|yaml]              Format of the output file.  [default: yaml]
  -r, --runner-type [krknctl|krknhub]   Type of chaos engine to use.
  -p, --param TEXT                      Additional parameters for config file in key=value format.
  -v, --verbose                         Increase verbosity of output.  [default: 0]
  --help                                Show this message and exit.


# Configure Prometheus
# (Optional) In OpenShift cluster, the framework will automatically look for thanos querier in openshift-monitoring namespace. 
export PROMETHEUS_URL='https://Thanos-Querier-url'
export PROMETHEUS_TOKEN='enter-access-token'

# Start Krkn-AI test
uv run krkn_ai run -vv -c ./krkn-ai.yaml -o ./tmp/results/ -p HOST=$HOST

Understanding the Results

In the ./tmp/results directory, you will find the results from testing. The final results contain information about each scenario, their fitness evaluation scores, reports, and graphs, which you can use to further investigate.

.
└── results/
    ├── reports/
    │   ├── best_scenarios.yaml
    │   ├── health_check_report.csv
    │   └── graphs/
    │       ├── best_generation.png
    │       ├── scenario_1.png
    │       ├── scenario_2.png
    │       └── ...
    ├── yaml/
    │   ├── generation_0/
    │   │   ├── scenario_1.yaml
    │   │   ├── scenario_2.yaml
    │   │   └── ...
    │   └── generation_1/
    │       └── ...
    ├── log/
    │   ├── scenario_1.log
    │   ├── scenario_2.log
    │   └── ...
    └── krkn-ai.yaml

Reports Directory:

  • health_check_report.csv: Summary of application health checks containing details about the scenario, component, failure status and latency.
  • best_scenarios.yaml: YAML file containing information about best scenario identified in each generation.
  • best_generation.png: Visualization of best fitness score found in each generation.
  • scenario_<ids>.png: Visualization of response time line plot for health checks and heatmap for success and failures.

YAML:

  • scenario_<id>.yaml: YAML file detailing about the Chaos scenario executed which includes the krknctl command, fitness scores, health check metrices, etc. These files are organised under each generation folder.

Log:

  • scenario_<id>.log: Logs captured from krknctl scenario.

2 - Cluster Discovery

Automatically discover cluster components for Krkn-AI testing.

Krkn-AI uses a genetic algorithm to generate Chaos scenarios. These scenarios require information about the components available in the cluster, which is obtained from the cluster_components YAML field of the Krkn-AI configuration.

CLI Usage

$ uv run krkn_ai discover --help
Usage: krkn_ai discover [OPTIONS]

  Discover components for Krkn-AI tests

Options:
  -k, --kubeconfig TEXT   Path to cluster kubeconfig file.
  -o, --output TEXT       Path to save config file.
  -n, --namespace TEXT    Namespace(s) to discover components in. Supports
                          Regex and comma separated values.
  -pl, --pod-label TEXT   Pod Label Keys(s) to filter. Supports Regex and
                          comma separated values.
  -nl, --node-label TEXT  Node Label Keys(s) to filter. Supports Regex and
                          comma separated values.
  -v, --verbose           Increase verbosity of output.
  --help                  Show this message and exit.

Example

The example below filters cluster components from namespaces that match the patterns robot-.* and etcd. In addition to namespaces, we also provide filters for pod labels and node labels. This allows us to narrow down the necessary components to consider when running a Krkn-AI test.

$ uv run krkn_ai discover -k ./path/to/kubeconfig.yaml -n "robot-.*,etcd" -pl "service,env" -nl "disktype" -o ./krkn-ai.yaml

The above command generates a config file that contains the basic setup to help you get started. You can customize the parameters as described in the configs documentation. If you want to exclude any cluster components—such as a pod, node, or namespace—from being considered for Krkn-AI testing, simply remove them from the cluster_components YAML field.

# Path to your kubeconfig file
kubeconfig_file_path: "./path/to/kubeconfig.yaml"

# Genetic algorithm parameters
generations: 5
population_size: 10
composition_rate: 0.3
population_injection_rate: 0.1

# Fitness function configuration for defining SLO
# In the below example, we use Total Restarts in "robot-shop" namespace as the SLO
fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point
  include_krkn_failure: true

# Chaos scenarios to consider during testing
scenario:
  pod-scenarios:
    enable: true
  application-outages:
    enable: true
  container-scenarios:
    enable: false
  node-cpu-hog:
    enable: false
  node-memory-hog:
    enable: false

# Cluster components to consider for Krkn-AI testing
cluster_components:
  namespaces:
  - name: robot-shop
    pods:
    - containers:
      - name: cart
      labels:
        service: cart
        env: dev
      name: cart-7cd6c77dbf-j4gsv
    - containers:
      - name: catalogue
      labels:
        service: catalogue
        env: dev
      name: catalogue-94df6b9b-pjgsr
  - name: etcd
    pods:
    - containers:
      - name: etcd
        labels:
          service: etcd
        name: etcd-0
    - containers:
      - name: etcd
        labels:
          service: etcd
        name: etcd-1
  nodes:
  - labels:
      kubernetes.io/hostname: node-1
      disktype: SSD
    name: node-1
  - labels:
      kubernetes.io/hostname: node-2
      disktype: HDD
    name: node-2

3 - Configuration

Configuring Krkn-AI

Krkn-AI is configured using a simple declarative YAML file. This file can be automatically generated using Krkn-AI’s discover feature, which creates a config file from a boilerplate template. The generated config file will have the cluster components pre-populated based on your cluster.

3.1 - Evolutionary Algorithm

Configuring Evolutionary Algorithm

Krkn-AI uses an online learning approach by leveraging an evolutionary algorithm, where an agent runs tests on the actual cluster and gathers feedback by measuring various KPIs for your cluster and application. The algorithm begins by creating random population samples that contain Chaos scenarios. These scenarios are executed on the cluster, feedback is collected, and then the best samples (parents) are selected to undergo crossover and mutation operations to generate the next set of samples (offspring). The algorithm relies on heuristics to guide the exploration and exploitation of scenarios.

Genetic Algorithm

Terminologies

  • Generation: A single iteration or cycle of the algorithm during which the population evolves. Each generation produces a new set of candidate solutions.
  • Population: The complete set of candidate solutions (individuals) at a given generation.
  • Sample (or Individual): A single candidate solution within the population, often represented as a chromosome or genome. In our case, this is equivalent to a Chaos experiment.
  • Selection: The process of choosing individuals from the population (based on fitness) to serve as parents for producing the next generation.
  • Crossover: The operation of combining two Chaos experiments to produce a new scenario, encouraging the exploration of new solutions.
  • Mutation: A random alteration of parts of a Chaos experiment.
  • Composition: The process of combining existing Chaos experiments into a grouped scenario to represent a single new scenario.
  • Population Injection: The introduction of new individuals into the population to escape stagnation.

Configurations

The algorithm relies on specific configurations to guide its execution. These settings can be adjusted in the Krkn-AI config file, which you generate using the discover command.

generations

Total number of generation loop to run (Default: 20)

  • The value for this field should be at least 1.
  • Setting this to a higher value increases Krkn-AI testing coverage.
  • Each scenario tested in the current generation retains some properties from the previous generation.

population_size

Minimum Population size in each generation (Default: 10)

  • The value for this field should be at least 2.
  • Setting this to a higher value will increase the number of scenarios tested per generation, which is helpful for running diverse test samples.
  • A higher value is also preferred when you have a large set of objects in cluster components and multiple scenarios enabled.
  • If you have a limited set of components to be evaluated, you can set a smaller population size and fewer generations.

crossover_rate

How often crossover should occur for each scenario parameter (Default: 0.6 and Range: [0.0, 1.0])

  • A higher crossover rate increases the likelihood that a crossover operation will create two new candidate solutions from two existing candidates.
  • Setting the crossover rate to 1.0 ensures that crossover always occurs during selection process.

mutation_rate

How often mutation should occur for each scenario parameter (Default: 0.7 and Range: [0.0, 1.0])

  • This helps to control the diversification among the candidates. A higher value increases the likelihood that a mutation operation will be applied.
  • Setting this to 1.0 ensures persistent mutation during the selection process.

composition_rate

How often a crossover would lead to composition (Default: 0.0 and Range: [0.0, 1.0])

  • By default, this value is disabled, but you can set it to a higher rate to increase the likelihood of composition.

population_injection_rate

How often a random samples gets newly added to population (Default: 0.0 and Range: [0.0, 1.0])

  • A higher injection rate increases the likelihood of introducing new candidates into the existing generation.

population_injection_size

What’s the size of random samples that gets added to new population (Default: 2)

  • A higher injection size means that more diversified samples get added during the evolutionary algorithm loop.
  • This is beneficial if you want to start with a smaller population test set and then increase the population size as you progress through the test.

3.2 - Fitness Function

Configuring Fitness Function

The fitness function is a crucial element in the Krkn-AI algorithm. It evaluates each Chaos experiment and generates a score. These scores are then used during the selection phase of the algorithm to identify the best candidate solutions in each generation.

  • The fitness function can be defined as an SLO or as cluster metrics using a Prometheus query.
  • Fitness scores are calculated for the time range during which the Chaos scenario is executed.

Example

Let’s look at a simple fitness function that calculates the total number of restarts in a namespace:

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

This fitness function calculates the number of restarts that occurred during the test in the specified namespace. The resulting value is referred to as the Fitness Function Score. These scores are computed for each scenario in every generation and can be found in the scenario YAML configuration within the results. Below is an example of a scenario YAML configuration:

generation_id: 0
scenario_id: 1
scenario:
  name: node-memory-hog(60, 89, 8, kubernetes.io/hostname=node1,
    [], 1, quay.io/krkn-chaos/krkn-hog)
cmd: 'krknctl run node-memory-hog --telemetry-prometheus-backup False --wait-duration
  0 --kubeconfig ./tmp/kubeconfig.yaml --chaos-duration "60" --memory-consumption
  "89%" --memory-workers "8" --node-selector "kubernetes.io/hostname=node1"
  --taints "[]" --number-of-nodes "1" --image "quay.io/krkn-chaos/krkn-hog" '
log: ./results/logs/scenario_1.log
returncode: 0
start_time: '2025-09-01T16:55:12.607656'
end_time: '2025-09-01T16:58:35.204787'
fitness_result:
  scores: []
  fitness_score: 2
job_id: 1
health_check_results: {}

In the above result, the fitness score of 2 indicates that two restarts were observed in the namespace while running the node-memory-hog scenario. The algorithm uses this score as feedback to prioritize this scenario for further testing.

Types of Fitness Function

There are two types of fitness functions available in Krkn-AI: point and range.

Point-Based Fitness Function

In the point-based fitness function type, we calculate the difference in the fitness function value between the end and the beginning of the Chaos experiment. This difference signifies the change that occurred during the experiment phase, allowing us to capture the delta. This approach is especially useful for Prometheus metrics that are counters and only increase, as the difference helps us determine the actual change during the experiment.

E.g SLO: Pod Restarts across “robot-shop” namespace.

fitness_function: 
  query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
  type: point

Range-Based Fitness Function

Certain SLOs require us to consider changes that occur over a period of time by using aggregate values such as min, max, or average. For these types of value-based metrics in Prometheus, the range type of Fitness Function is useful.

Because the range type is calculated over a time interval—and the exact timing of each Chaos experiment may not be known in advance—we provide a $range$ parameter that must be used in the fitness function definition.

E.g SLO: Max CPU observed for a container.

fitness_function: 
  query: 'max_over_time(container_cpu_usage_seconds_total{namespace="robot-shop", container="mysql"}[$range$])'
  type: range

Defining Multiple Fitness Functions

Krkn-AI allows you to define multiple fitness function items in the YAML configuration, enabling you to track how individual fitness values vary for different scenarios in the final outcome.

You can assign a weight to each fitness function to specify how its value impacts the final score used during Genetic Algorithm selection. Each weight should be between 0 and 1. By default, if no weight is specified, it will be considered as 1.

fitness_function:
  items:
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point
    weight: 0.3
  - query: 'sum(kube_pod_container_status_restarts_total{namespace="etcd"})'
    type: point

Krkn Failures

Krkn-AI uses krknctl under the hood to trigger Chaos testing experiments on the cluster. As part of the CLI, it captures various feedback and returns a non-zero status code when a failure occurs. By default, feedback from these failures is included in the Krkn-AI Fitness Score calculation.

You can disable this by setting the include_krkn_failure to false.

fitness_function:
    include_krkn_failure: false
    query: 'sum(kube_pod_container_status_restarts_total{namespace="robot-shop"})'
    type: point

Health Check

Results from application health checks are also incorporated into the fitness score. You can learn more about health checks and how to configure them in more detail here.

How to Define a Good Fitness Function

  • Scoring: The higher the fitness score, the more priority will be given to that scenario for generating new sets of scenarios. This also means that scenarios with higher fitness scores are more likely to have an impact on the cluster and should be further investigated.

  • Normalization: Krkn-AI currently does not apply any normalization, except when a fitness function is assigned with weights. While this does not significantly impact the algorithm, from a user interpretation standpoint, it is beneficial to use normalized SLO queries in PromQL. For example, instead of using the maximum CPU for a pod as a fitness function, it may be more convenient to use the CPU percentage of a pod.

  • Use-Case Driven: The fitness function query should be defined based on your use case. If you want to optimize your cluster for maximum uptime, a good fitness function could be to capture restart counts or the number of unavailable pods. Similarly, if you are interested in optimizing your cluster to ensure no downtime due to resource constraints, a good fitness function would be to measure the maximum CPU or memory percentage.

3.3 - Application Health Checks

Configuring Application Health Checks

When defining the Chaos Config, you can provide details about your application endpoints. Krkn-AI can access these endpoints during the Chaos experiment to evaluate how the application’s uptime is impacted.

Configuration

The following configuration options are available when defining an application for health checks:

  • name: Name of the service.
  • url: Service endpoint; supports parameterization with “$”.
  • status_code: Expected status code returned when accessing the service.
  • timeout: Timeout period after which the request is canceled.
  • interval: How often to check the endpoint.
  • stop_watcher_on_failure: This setting allows you to stop the health check watcher for an endpoint after it encounters a failure.

Example

health_checks:
  stop_watcher_on_failure: false
  applications:
  - name: cart
    url: "$HOST/cart/add/1/Watson/1"
    status_code: 200
    timeout: 10
    interval: 2
  - name: catalogue
    url: "$HOST/catalogue/categories"
  - name: shipping
    url: "$HOST/shipping/codes"
  - name: payment
    url: "$HOST/payment/health"
  - name: user
    url: "$HOST/user/uniqueid"
  - name: ratings
    url: "$HOST/ratings/api/fetch/Watson"

URL Parameterization

When defining Krkn-AI config files, the URL entry for an application may vary depending on the cluster. To make the URL configuration more manageable, you can specify the values for these parameters at runtime using the --param flag.

In the previous example, the $HOST variable in the config can be dynamically replaced during the Krkn-AI experiment run, as shown below.

uv run krkn_ai run -c krkn-ai.yaml -o results/ -p HOST=http://example.cluster.url/nginx

Configure Health Check Score into Fitness Function

By default, the results of health checks—including whether each check succeeded and the response times—are incorporated into the overall Fitness Function score. This allows Krkn-AI to use application health as part of its evaluation criteria.

If you want to exclude health check results from influencing the fitness score, you can set the include_health_check_failure and include_health_check_response_time fields to false in your configuration.

fitness_function:
    ...
    include_health_check_failure: false
    include_health_check_response_time: false

3.4 - Scenarios

Available Kkrn-AI Scenarios

The following Krkn scenarios are currently supported by Kkrn-AI.

At least one scenario must be enabled for the Kkrn-AI experiment to run.

ScenarioKkrn-AI Config (YAML)
Pod Scenarioscenario.pod-scenarios
Application Outagesscenario.application-outages
Container Scenarioscenario.container-scenarios
Node CPU Hogscenario.node-cpu-hog
Node Memory Hogscenario.node-memory-hog
Time Scenarioscenario.time-scenarios

By default, scenarios are not enabled. Depending on your use case, you can enable or disable these scenarios in the krkn-ai.yaml config file by setting the enable field to true or false.

scenario:
  pod-scenarios:
    enable: true

  application-outages:
    enable: false

  container-scenarios:
    enable: false

  node-cpu-hog:
    enable: true

  node-memory-hog:
    enable: true

  time-scenarios:
    enable: true