Continuous performance testing (CPT) is a critical aspect of modern software development, especially considering the mission-critical applications and diverse infrastructure on which Red Hat OpenShift can run. In this article, we will discuss the importance of continuous performance testing, the challenges the OpenShift Performance and Scale Team discovered, and how shifting-left has increased our team velocity.
Why should your organization consider introducing CPT into their workflow? Introducing CPT into your workflow is crucial. In a rapidly changing industry, evolving software and specialized hardware increase the likelihood of performance bottlenecks. Without continuous performance testing, engineers face complex, error-prone manual testing, risking bottlenecks, degrading user experience, and incurring higher operational costs.
This begins our article series discussing the importance of integrating performance testing into your continuous integration/continuous delivery (CI/CD) pipelines and offering insights and best practices for ensuring optimal performance of your application or platform. Be sure to follow our CPT articles exploring the tooling, performance regression analysis, and recent case studies.
Challenges of performance testing OpenShift
While OpenShift provides a robust and scalable platform, its complexity introduces significant performance testing hurdles. The platform's reliance on various open source components (e.g., etcd for its database, Openvswitch for networking, and Prometheus for monitoring) means each element brings unique performance considerations.
Successfully testing OpenShift at scale requires correctly sizing cluster instances to avoid artificial bottlenecks, especially when measuring performance under heavy loads in large environments of 250+ nodes. Furthermore, the dynamic nature of containerized workloads demands a flexible, automated, and cloud-native testing solution. To meet these demands and stay ahead of industry trends like the massive shift toward virtualization, we developed kube-burner, a CNCF sandboxed project that provides the modern tooling necessary to rigorously test OpenShift's performance and scalability. In this series, you will learn more about kube-burner and how we are using it within our continuous performance testing pipelines.
In recent years, the OpenShift ecosystem has grown far beyond a simple container orchestrator. This evolution presented a significant quality engineering challenge from a performance and scale perspective. Our testing framework must now validate a complex matrix that includes installations on bare metal, public clouds (i.e., AWS, Azure, GCP, and IBM Cloud), and managed offerings (i.e., ROSA and ARO).
The software stack ships around 30 built-in operators, over 20 certified operators, and nearly 20 layered products like ACM, ACS, and OpenShift GitOps. Adding to all of this is the extremely fast pace of development, as the openshift/release
repository alone averages 300 pull requests each week. All of these different combinations of software allows OpenShift to support many use cases. However, each one of these combinations have different performance characteristics and considerations we need to identify.
From ad-hoc to shift-left
The Red Hat OpenShift Performance and Scale Team has undergone significant transitions in our continuous performance testing journey over the past 10+ years.
Initially, the OpenShift Performance and Scale Team ran performance and scale tests during big feature releases, architectural changes, and at the end of the OpenShift release. The team automated many of the tests. But we were running our CI/CD system in a bespoke way, not using the engineering team’s infrastructure. Having a bespoke system meant we had the burden of maintaining the CI/CD system (Jenkins and Airflow), which was complicated and required tribal knowledge to maintain.
During this time we also couldn’t run our tests before a nightly build, or what we call payload. This meant we could have hundreds of changes in the nightly build, all contributing to a performance regression. Our team needed to shift-left, meaning enabling our performance and scale tests to run earlier in the development stages like at the payload level or before they merge into our downstream product.
The team achieved a major breakthrough by shifting left and integrating our performance testing suite directly into the OpenShift CI/CD Prow pipeline. Adopting the Prow pipeline was a strategic decision that enabled our workloads for the entire OpenShift community, as it aligned our process with the primary CI system our OpenShift developers were already using.
The migration to Prow was quite simple since we had already automated our performance pipelines. To pivot to Prow, we simply had to re-use our benchmark scripting and create new OpenShift Prow steps in the OpenShift CI-Operator Step Registry. Once we introduced the steps to Prow, we then began working with the OpenShift Technical Release Team to have our workloads run as an informing job for payload. Being part of the TRT payload testing gives us the visibility to determine what handful of changes introduced into the downstream payload could impact OpenShift as a product.
This transition has been pivotal in several ways. First, it dramatically increased development velocity by automatically running tests on every OpenShift payload, allowing us to catch performance regressions early in the development cycle when they are faster and cheaper to fix. Second, it has led to seamless integration with OpenShift engineering. The Performance and Scale Team is now a core part of the release process, which has enhanced collaboration across the organization. This allows engineers to independently run performance tests on their pull requests, gain immediate feedback on their changes, and proactively work with other teams to optimize the platform.
Best practices for implementing CPT
To effectively implement continuous performance testing, consider the following best practices:
- Use the same CI system as your development team:
- Using the same CI/CD pipeline as your development team ensures they too can take advantage of your performance and scale tests.
- Reduces the need to maintain your own bespoke CI/CD system.
- Provide faster feedback to developers since your tests can be part of their tests that run against their PRs.
- Automate everything:
- Integrate performance tests into the same CI/CD pipeline your developers use, so performance tests run automatically with every payload or even better before every merge into the codebase.
- Make your automation unspecific to a CI/CD framework, allowing you to easily lift and shift.
- Define data and control plane performance baselines:
- Establish acceptable performance control-plane metrics (e.g., P99 PodReadyLatency, etcd CPU utilization, etc.).
- Having a clear set of baselines allows teams to understand the expected performance of your product.
- Establish acceptable performance for the data-path of your platform (e.g., network performance).
- Simulate realistic workloads:
- Use tools like kube-burner, k8s-netperf or ingress-perf that simulate real-world user behavior and varying load conditions without being too domain specific.
- Monitor key metrics:
- Collect and analyze performance data, including CPU usage, memory consumption, network I/O, and application-specific metrics.
- Early regression detection:
- Use tools like Orion to detect performance regression early in the development cycle, which can help provide a clear signal to our developers with their regression analysis of what metric increased or decreased.
- Isolate performance tests:
- Run performance tests in isolated environments that mimic production as closely as possible to avoid interference from other processes.
What’s next
Continuous performance testing is not just a nice-to-have, it's a necessity for ensuring the success of the Red Hat OpenShift ecosystem. By integrating performance testing early and continuously into our development lifecycle, we can deliver a high-performing, scalable, and reliable platform that meets the demands of our diverse customers. The Red Hat Performance and Scale Team's journey demonstrates the significant benefits of shifting performance testing left, leading to increased velocity and deeper integration into the product release cycle.
Consider how your organization can implement continuous performance testing to provide your development teams with an early signal of the impact their changes have on the performance of your product. Be sure to stay tuned for more articles in this series, diving into the tooling, regression analysis, and case studies we found with our continuous performance testing framework.