Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How to use Kafka Cruise Control for cluster optimization

July 5, 2023
Donato Marrazzo
Related topics:
Kafka
Related products:
Red Hat OpenShift Container PlatformStreams for Apache Kafka

Share:

    AMQ streams has recently promoted Cruise Control to the General Availability stage. It optimizes how Apache Kafka distributes the workload to improve performance and health. Often the Kafka clusters are deployed and grow over time, hosting multiple topics. Thanks to its robustness and elasticity reputation, the operations department tends to give it little care, monitoring the key health indicators. But they don’t know how to tune it to face the new workload.

    Cruise Control can become a fundamental ally in managing your Kafka clusters and getting the most out of your hardware resources. Plus, with the AMQ streams operator, it’s just a matter of turning a key. This article explains the key principles and how to make practical use of this new exciting capability.

    Unbalanced workloads

    When a new topic is created in Kafka, the partitions and its replicas are distributed evenly among the available brokers in the cluster. This is a wise behavior if your cluster is empty and you have no idea of your actual workload for different partitions. Figure 1 shows an example of how eight topics with three partitions are distributed across a cluster of three brokers. Partitions belonging to the same topic have the same color.

    An illustration of a topic with eight partitions and three replicas that is distributed across a cluster of three brokers with equal partitions.
    Figure 1: How 8 topics with 3 equal partitions are distributed across a cluster of 3 brokers.

    Over time, layering multiple topics from different applications can result in partitions with different sizes and workloads. Simply increasing the size of the cluster does not solve the problem. In fact, the new cluster member will be used along with the existing ones to accommodate newly created topic partitions, but it won’t change the assignment of the existing topics.

    In Figure 2, a cluster running out of resources expands with a new broker, then the user adds another topic with two partitions. New partitions are assigned in a round-robin fashion. So some will be assigned to the new broker, but the overloaded brokers are not relieved.

    An illustration of a new broker hosting only new partition replicas.
    Figure 2: The new broker hosts only new partition replicas.

    A Kafka cluster can be unbalanced from these different points of view:

    • Network utilization
    • RAM and CPU utilization
    • Disk utilization

    There are tools that allow the administrator to selectively reassign partitions across the cluster (kafka-reassign-partitions.sh), but this approach might work if there is a clear idea of the root cause of the unbalanced workload. Moreover, there are also other requirements that you need to address:

    • Replicas of the same partition must be in different racks.
    • All the physical resources cannot be exhausted (maximum capacity for disk, network, CPU).

    Trying to improve the partition assignment manually is tedious and error prone. Moreover, when the number of options grows, the mathematical optimization theory tells us that it will also lead to poorly optimized solutions. In fact, those problems are well known as mathematical optimization problems. The theory explains that even with a limited number of variables, the search space can become so huge that even the most advanced computer would take centuries to find the optimal solution. In fact, they fall under the NP-Complete or NP-hard problems.

    Fortunately, mathematicians and AI experts found methods (algorithms) to find at least a good enough solution to those problems (near optimal).

    Cruise Control for Apache Kafka

    LinkedIn, who originally created Apache Kafka and operates it on a large scale, developed Cruise Control to keep their clusters healthy. Then they made it open source.

    Here is a summary of the key features of Kafka Cruise Control:

    • Resource utilization tracking for brokers, topics, and partitions.
    • Multi-goal rebalance proposal generation (subset)
      • Rack-awareness
      • Resource capacity violation checks (CPU, DISK, Network I/O)
      • Per-broker replica count violation check
      • Resource utilization balance (CPU, DISK, Network I/O)
      • Leader traffic distribution
    • Actualize the previous proposal:
      • Rebalance the current partition topology
      • Rebelance on newly added brokers
      • Rebalance before removing brokers

    AMQ streams makes Cruise Control truly accessible, especially within the Red Hat OpenShift Container Platform. In fact, the operator provides an easy way to deploy Cruise Control and introduces a declarative way to trigger the analysis and apply rebalance proposals.

    Rebalance the cluster

    Cruise Control is a sophisticated tool, with many options that allow the administrator to tailor it to his specific environment.

    Before applying it to your production environment, it’s recommended to understand in detail all the features that are widely discussed in the official documentation. For this article, we’ll use the default configuration, which already provides an excellent experience with this new feature and a good understanding of the overall process.

    Install Cruise Control

    In the OpenShift Container Platform, enabling the Cruise Control is a matter of adding a line to your normal configuration:

    oc patch kafka my-cluster --patch '{"spec":{"cruiseControl": {}}}' --type=merge
    

    Behind the scenes, a new pod running Cruise Control is launched, and the Kafka cluster is instrumented to collect the required metrics, which are finally delivered through a set of new dedicated topics. It’s worth mentioning that to inject the metrics reporter, the Kafka pods go through a rolling update and in such a way that it preserves service continuity.

    Simulate an unbalanced workload

    One of the challenges with this feature is testing it in a reproducible manner, so you may be wondering how to achieve an unbalanced cluster in your test environment. A rather simple way, I found, is to develop a Kafka producer that intentionally generates loads against a subset of partitions whose leader is hosted on a particular broker. In fact, the broker that is the leader for a given partition is much more stressed than those that act as partition followers. Monitoring the CPU and network, you should get something that resembles Figure 3.

    A screenshot of a CPU and network chart in Grafana.
    Figure 3: The CPU and Network chart in Grafana.

    In Figure 3, the broker named my-cluster-kafka-0 differs from the others in both graphs. It scores less in the Network Idle graph, while it scores more in the CPU Usage graph.

    Start rebalancing

    The following is a simple procedure to engage the Cruise Control and rebalance the cluster:

    1. Deploy a basic rebalance configuration:

      echo "
      apiVersion: kafka.strimzi.io/v1beta2
      kind: KafkaRebalance
      metadata:
        name: full-rebalance
        labels:
          strimzi.io/cluster: my-cluster
      spec: {}
      " | oc apply -f -
      
    2. When KafkaRebalance is deployed, the Cruise Control immediately analyzes the Kafka metrics and prepares an optimization proposal. In fact, the following command shows True under the PROPOSALREADY column:

      oc get kafkarebalance
      NAME             CLUSTER      PENDINGPROPOSAL   PROPOSALREADY   REBALANCING   READY   NOTREADY
      full-rebalance   my-cluster                     True 
      
    3. To finally kick off the optimization, you need to approve the proposal by annotating the KafkaRebalance resource:

      oc annotate kafkarebalance full-rebalance strimzi.io/rebalance=approve
      

      If you want to trigger the rebalance directly without any further approval step, you can add the following annotation:

      oc annotate kafkarebalance full-rebalance strimzi.io/rebalance-auto-approval=true
      
    4. After a few minutes, to collect enough data and stabilize the workload, you should be able to evaluate the results as shown in Figure 4, where the lines of the different cluster members tend to equalize. 

      A screenshot of the CPU and network chart in Grafana after rebalance.
      Figure 4: The CPU and network chart in Grafana after rebalance.

    Scale up and rebalance

    Let’s consider the situation where you need to scale out the current cluster and take immediate advantage of the newly added broker.

    1. Scale out the Kafka cluster:

      oc patch kafka my-cluster --patch '{"spec":{"kafka": {"replicas": 4}}}' --type=merge
      
    2. The following configuration will ask the Cruise Control to redistribute the workload by explicitly taking the new cluster member into account:

      echo "
      apiVersion: kafka.strimzi.io/v1beta2
      kind: KafkaRebalance
      metadata:
        name: add-brokers-rebalance
        labels:
          strimzi.io/cluster: my-cluster
      spec:
        mode: add-brokers
        brokers: [3]
      " | oc apply -f -
      

      The previous KafkaRebalance introduced two new spec properties: mode: add-brokers and brokers: [3]. Their purpose is to make Cruise Control aware of the existence of the newly added broker and instruct it to distribute the workload to the newcomer.

    3. Before approving the rebalance proposal, let’s understand some other details issuing oc describe kafkarebalance add-brokers-rebalance:

      (...)
      Data To Move MB:               17
      Excluded Brokers For Leadership:
      Excluded Brokers For Replica Move:
      Excluded Topics:
      Intra Broker Data To Move MB:         0
      Monitored Partitions Percentage:      100
      Num Intra Broker Replica Movements:   0
      Num Leader Movements:                 0
      Num Replica Movements:                89
      On Demand Balancedness Score After:   80.87946050436929
      On Demand Balancedness Score Before:  76.41773549482696
      (...)    
      

      It indicates that is going to move 17MB of data and 89 partition replicas. This will lead to a scoring improvement of around four points.

      You might wonder about the score value. The optimization algorithms translate the different goals into mathematical functions that measure how good a solution is with respect to the given goals. So the number alone doesn’t mean much, but you should expect that the scoring after optimization is increased. For the sake of accuracy, not all the Cruise Control actions contribute to the score. Therefore, the rebalancing activity may improve the health of the cluster, even if the score does not change.

    4. Approve the optimization:

      oc annotate kafkarebalance add-brokers-rebalance strimzi.io/rebalance=approve
      

    Later on, if you want to repeat the analysis and then trigger a new optimization, all you have to do is add another annotation:

    oc annotate kafkarebalance add-brokers-rebalance strimzi.io/rebalance=refresh
    

    In a production environment, the rebalancing process might take some time, but the cluster remains operational. Your clients could experience a brief pause (a few seconds) as their partitions move elsewhere and they have to reestablish communication with a new broker. However, you may prefer to rebalance when the load is lighter. In these cases, if the rebalancing is taking too long and the peak time is approaching, you can stop the ongoing optimization. The rebalancing effort is broken down into a series of concatenated batches. So when a stop is needed, the following annotation informs the Cruise Control that the next batch must not start, whereas the running one is completed:

    oc annotate kafkarebalance add-brokers-rebalance strimzi.io/rebalance=stop
    

    If you want to test this feature in your demo environment, you can follow the detailed instructions of my AMQ streams demo project on GitHub. It will guide you through a Kafka cluster deployment using the Grafana dashboard to inspect the workload. Then you will run an application capable of generating an unbalanced load, and finally you will ponder the advantages of the rebalancing.

    Explore Kafka Cruise Control benefits

    Cruise Control is a valuable companion to Kafka, especially as your environment matures and evolves. AMQ streams makes it easy to take advantage of its many benefits, so it’s definitely worth adding it to your cluster. Clearly, before deploying it into a production environment, I recommend reading the official documentation, understand the available options (e.g., filtering out rebalancing goals), and monitor the reassignment closely.

    Last updated: October 25, 2024

    Related Posts

    • All about local and self-managed Kafka distributions

    • Extending Kafka connectivity with Apache Camel Kafka connectors

    • Fine-tune Kafka performance with the Kafka optimization theorem

    • The outbox pattern with Apache Kafka and Debezium

    Recent Posts

    • More Essential AI tutorials for Node.js Developers

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue