Chaos engineering is the practice of deliberately introducing controlled failures into a system to uncover weaknesses before they affect end users. By continuously running chaos experiments, teams can build greater confidence in their systems and identify real performance bottlenecks. However, applying chaos in real-world environments can be challenging due to the complex, dynamic nature of applications and infrastructure, especially in environments like Kubernetes.
In this article, we introduce Krkn-AI, a project that addresses these challenges by providing a framework for AI-assisted, objective-driven chaos testing.
The challenge of reliability in modern systems
Modern applications are no longer monolithic programs. They are distributed, cloud-native, and composed of dozens—sometimes hundreds—of microservices spanning clusters and regions. This architecture enables flexibility and scalability, but it also introduces new layers of complexity. A small glitch in a single service can ripple outward, causing system-wide disruptions. Even brief downtime can result in revenue loss and eroded user trust. Reliability has become more than an engineering goal—it is a business-critical requirement.
The challenge lies in the unpredictability of real-world conditions. Sudden traffic surges, misconfigured pods, dependency failures, or hardware degradation can interact in ways that are nearly impossible to anticipate. Traditional testing, focused on correctness under normal conditions, cannot expose these unpredictable issues. Chaos engineering helps by simulating failures, but defining meaningful experiments and keeping them relevant as systems evolve is difficult. Today’s systems demand resilience strategies that adapt dynamically, learn intelligently, and continuously strengthen the system’s ability to withstand failures.
Why traditional chaos engineering isn’t enough
Chaos engineering has proven its value by exposing weaknesses through controlled failures—like shutting down pods, adding latency, or simulating resource exhaustion. But most practices still rely on manually defined experiments, where engineers decide what to break and then spend hours sifting through logs and metrics.
Real-world systems, however, are dynamic. Kubernetes clusters scale, workloads shift, and dependencies evolve, making static experiments insufficient. To keep pace, chaos testing must be adaptive, automated, and capable of learning from the system itself. Instead of guessing scenarios, engineers need tools that intelligently target weak points, run experiments, and deliver actionable insights. That’s where Krkn-AI comes in.
Our vision: Krkn-AI
Krkn-AI isn’t just another chaos testing tool—it’s designed to make resilience testing smarter, more automated, and easier to adopt. By combining evolutionary algorithms with the Krkn project, it automates experiment discovery, execution, and analysis so that engineers can focus on insights instead of manual setup.
Key highlights:
- Auto-framework for chaos engineering eliminates the need to manually author chaos experiments.
- Cluster-aware discoverability automatically scans your Kubernetes cluster to identify components and services.
- Enhanced test coverage detects complex system-disrupting paths that manual testing or human analysis might overlook due to the vast fault space.
- Objective-driven testing (service-level objective–aware) lets users align chaos testing directly with business and operational goals (for example, latency, error rates, availability).
- Built-in health checks quickly surface which failures actually degrade performance or availability of your application by incorporating real-time monitoring.
How Krkn-AI works
Krkn-AI applies a genetic algorithm—an evolutionary technique inspired by natural selection—to refine chaos experiments automatically. Instead of relying on static, pre-defined tests, it generates, evaluates, and evolves scenarios based on measurable system impact, guided by service-level objectives (SLOs) and real-time health signals.
How it works:
- Generate scenarios: Krkn-AI can run a variety of experiments, from single-fault (killing pods, adding latency, exhausting resources) to multi-fault (parallel and sequential complexity).
- Evaluate impact: Each experiment is measured against SLOs (like response time thresholds) and application health checks (latency, error rates, availability).
- Score results: Experiments that expose more stress or performance degradation receive higher scores.
- Evolve tests: The algorithm carries forward the most impactful experiments, mutates them, and produces a new generation of scenarios.
- Refine continuously: This feedback loop iteratively converges on tests that uncover hidden bottlenecks and weak points, without engineers needing to guess what to break next.
Try it out
The Krkn-AI getting started guide shows you how to set up a microservice on Kubernetes or Red Hat OpenShift and run your first test.
For installation details, configuration options, and advanced scenarios, check out the documentation.
Watch a short video demo:
To learn more about the project, and if you are interested in using and contributing, visit the Krkn-AI - GitHub repository.