Smarter multi-cluster scheduling with dynamic scoring framework

In multi-cluster management, deciding where to deploy workloads is just as important as deciding what to deploy. Open Cluster Management (OCM) and Red Hat Advanced Cluster Management for Kubernetes provide powerful primitives for this through the Placement API and PlacementScore resources. These APIs enable intelligent workload distribution across clusters based on various criteria,but what if you could make these decisions based on real-time metrics of your choice from your clusters?

The power of Placement-based scheduling

The Placement API allows you to define rules for selecting managed clusters dynamically based on labels, cluster claims (properties reported from the cluster), taints/tolerations and built-in-scores. The Placement API can be combined with advanced PlacementScores, which extends the built-in-scoring capabilities. You can rank clusters numerically and let the placement engine automatically select the best candidates for your workloads. This allows you to build a sophisticated scheduler for multi-cluster environments.

While static scores work for fixed criteria (like cluster tier or region), real-world scheduling decisions often depend on dynamic factors:

Cost efficiency: Deploy to clusters with lower resource costs
Resource availability: Avoid overloaded clusters with high CPU or memory usage
Network latency: Choose clusters closer to your users or data sources
Predictive metrics: Score based on forecasted resource trends, not just current state

The challenge? Manually collecting metrics, calculating scores, and updating PlacementScore resources across dozens or hundreds of clusters is impractical.

Introducing the Dynamic Scoring Framework

The Dynamic Scoring Framework helps solve this problem by providing a general-purpose framework for automating cluster scoring based on Prometheus metrics. It acts as a bridge between your monitoring system and the Placement API, continuously evaluating clusters and updating their scores. The Placement API generates a corresponding PlacementDecision with the selected clusters listed in the status. As an end-user, you can parse the selected clusters and then operate on the target clusters. Or you can also integrate a high-level workload orchestrator with the PlacementDecision to leverage its scheduling capabilities.

How it works

The framework consists of three key components, including two custom resource definitions (CRD):

DynamicScorer (CRD): Register your custom scoring logic as a custom resource. Define what metrics to collect (for example, CPU usage, network latency) and which scoring API should process them.
DynamicScoringConfig (CRD): Aggregate multiple DynamicScorers and distribute the configuration to managed clusters with ConfigMaps.
DynamicScoringAgent: Deployed in each managed cluster, this agent watches for scoring configurations, fetches metrics from the local Prometheus instance, calls your scoring API, and publishes results as AddOnPlacementScores.

Here's the elegant part: You write the scoring API and logic, configure the framework CRDs mentioned above and the framework handles everything else, including metric collection, API calls, score updates, and integration with the Placement API.

Build your own scorer

The framework's extensibility is its superpower. Developers can build custom scorers tailored to their specific needs. For example:

Cost-based scoring: Query cloud provider APIs or cost metrics to score clusters by their expense per workload.
Predictive CPU scoring: Use machine learning models to predict CPU load increase over the next hour and score clusters inversely to expected load.
Latency-based scoring: Measure network latency from clusters to specific endpoints and score for proximity.

Here's a simplified example of what a scorer configuration looks like:

apiVersion: dynamic-scoring.open-cluster-management.io/v1alpha1
kind: DynamicScorer
metadata:
  name: cpu-predictor
  namespace: open-cluster-management
spec:
  name: cpu-predictor
  source:
    type: Prometheus
    endpoint:
      host: http://prometheus.monitoring.svc:9090
      path: /api/v1/query_range
    query:
      query: "rate(container_cpu_usage_seconds_total[5m])"
      range: 3600
      step: 60
  scoring:
    endpoint:
      host: http://cpu-predictor-api.default.svc:8000
      path: /scoring
    params:
      model: "linear_regression"
      horizon: 3600

Your scoring API receives time-series data from Prometheus and returns scores. The framework handles the rest,no need to write Kubernetes controllers, manage state, or update PlacementScore resources manually.

This means you can focus on the algorithm (how to score) rather than the plumbing (how to integrate with the cluster management system).

Zero-effort integration for existing tools

It gets better. Tools that already integrate with OCM's Placement API immediately benefit from dynamic scoring without any code changes.

Kueue: The Kubernetes-native job queueing system can use dynamic scores to make smarter multi-cluster scheduling decisions, automatically routing batch jobs to clusters with available capacity or lower costs.
ArgoCD: When deploying applications across multiple clusters, ArgoCD's ApplicationSet can leverage Placement decisions influenced by dynamic scores, ensuring deployments land on the most suitable clusters based on current conditions.

This is the power of building on top of well-designed APIs,extensibility without modification.

Getting started

Ready to try dynamic scoring in your multi-cluster environment? Here's a quick walkthrough:

You need an OCM hub cluster and one or more managed clusters. If you're just getting started with OCM, check out the quickstart guide.

1. Install the framework

Deploy the Dynamic Scoring Framework using Helm:

$ helm repo add ocm https://open-cluster-management.io/helm-charts
$ helm repo update
$ helm upgrade --install dynamic-scoring-framework \
  ocm/dynamic-scoring-framework \
  --namespace open-cluster-management \
  --create-namespace

2. Set up Prometheus (optional)

If you want to score based on Prometheus metrics, then you must ensure Prometheus is running in your managed clusters. The framework supports both local metrics (scraped in each cluster) and centralized collection (metrics aggregated in the hub).

$ helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --create-namespace

3. Deploy a sample scorer

The project includes a sample scorer demonstrating the scoring API contract. First, build and deploy the sample scorer:

$ export SAMPLE_SCORER_IMAGE=quay.io/dynamic-scoring/sample-scorer:latest
$ podman build -t $SAMPLE_SCORER_IMAGE samples/sample-scorer

Then deploy a managed cluster:

$ kind load docker-image $SAMPLE_SCORER_IMAGE --name worker01
CLUSTER_NAME=worker01 envsubst < samples/sample-scorer/manifestwork.yaml | \
  kubectl apply -f - --context kind-hub01

4. Register the scorer

Create a DynamicScorer resource to register your scoring API:

$ kubectl apply -f samples/mydynamicscorer-sample.yaml \
  -n open-cluster-management \
  --context kind-hub01

5. Activate scoring

Finally, create a DynamicScoringConfig to activate scoring in your managed clusters:

$ kubectl apply -f samples/mydynamicscoringconfig.yaml \
  -n open-cluster-management \
  --context kind-hub01

6. Verify Scores

After a few moments, check the AddOnPlacementScores created in your cluster namespaces:

$ kubectl get addonplacementscores \
-n worker01 --context kind-hub01

You see scores calculated by your scoring API, ready to influence Placement decisions:

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: AddOnPlacementScore
metadata:
  name: sample-my-score
  namespace: worker01
status:
  scores:
  - name: node-1;namespace-a;pod-x
    value: 85
  - name: node-2;namespace-b;pod-y
    value: 92

These scores can now be referenced in Placement resources to drive scheduling decisions.

Use cases and real-world applications

The Dynamic Scoring Framework opens up exciting possibilities:

Cost-optimized deployments: Automatically shift workloads to cheaper clusters during off-peak hours.
Predictive auto-scaling: Pre-emptively move workloads away from clusters predicted to experience high load.
Geo-aware routing: Score clusters by proximity to user traffic sources for lower latency.
Custom business logic: Implement any scoring algorithm that matters to your organization.

What's next?

The Dynamic Scoring Framework is part of the addon-contrib repository and continues to evolve with community contributions. Whether you're managing a handful of clusters or hundreds, dynamic scoring can make your multi-cluster scheduling smarter and more responsive to real-world conditions. Integrate your high-level workload orchestrator with the PlacementDecision and place your workloads intelligently across the fleet.

For more details:

Start building your own scorers and let your metrics drive your multi-cluster scheduling decisions!

Like what you see? Ready to dive even deeper into the details in a real world use case? Check out more this blog post from Dynamic Scoring creator SoftBank R&D. SoftBank's engineers demonstrate a key use case that motivated them to enhance Open Cluster Management with the Dynamic Scoring add-on. This is a deeply technical post that demonstrates the power of the add-on and the motivations behind it directly from the team that created it.

Smarter multi-cluster scheduling with dynamic scoring framework

The power of Placement-based scheduling

Introducing the Dynamic Scoring Framework

How it works

Build your own scorer

Zero-effort integration for existing tools

Getting started

1. Install the framework

2. Set up Prometheus (optional)

3. Deploy a sample scorer

4. Register the scorer

5. Activate scoring

6. Verify Scores

Use cases and real-world applications

What's next?

Red Hat build of Agent Sandbox: Isolated workload management with Kubernetes

Run Claude Code locally with vLLM and OpenShift AI

Verified boot in automotive with AutoSD

Red Hat OpenShift 4.22: What dynamic plugin developers need to know

What's new for developers in Red Hat OpenShift 4.22

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links