openshift containers

There might be occasions where over-provisioning your cluster will provide great benefit. If there are sudden bursts of pods being created in a short amount of time, there can be situations where a new node is spinning up but is not ready. This can cause issues as pods will stuck in a pending phase for about ten minutes whilst waiting for the new node. A solution to this is to over-provision your cluster. This enables a spare node at all times, or pre-emptily create a new node when the workflow starts to pick up.

Use cases

With this solution, pods are created that are considered low priority. When the more important workloads are created, the low priority pods get removed from the nodes. This leaves the low priority pods in a pending state. This forces Red Hat OpenShift to spawn a new node to accommodate the pending pods. Creating capacity for future high-priority pods without waiting for a new node.

Spare node

In this use case, the aim is to have spare nodes all the time. Typically, workloads are coming in faster than a node is spawned. An example might be large number of developers logging in the Red Hat OpenShift Dev Spaces at the beginning of their shifts (Figure 1).

A diagram of spare node workflow.
Figure 1: Spare node workflow diagram.

At the start there are only two worker nodes. With the spare pod solution, there can be a additional node at all times. This is done by controlling the number of low-priority pod replicas (see controlling the number of replicas below).

When the real/high-priority workload starts, it forces the low-priority pods out of the nodes. The low priority pods are then stuck in pending, as it can't be scheduled. This forces the auto scaler to spin up a new node, giving room for even more workload in the additional nodes.

Figure 2 shows a comparison of the two solutions.

A diagram showing a comparison of different solutions..
Figure 2: A comparison of different solutions.

Pick up workflow

This workflow is designed to up the cluster when workloads starts to come in. The workloads might come in a slower, but prefer to have a spare node ready rather than waiting for a new node.

For example let's say you run a batch job which processes foreign payments in parallel, every hour, different currency, every hour. And the duration of that batch job is more than an hour. So initially, each of those low priority pods are evicted and replaced by a first batch. In hour time, another batch gets spawned, but you already created a spare node so it can be scheduled successfully without waiting for a new node to spawn (Figure 3).

A diagram of the pick up workflow.
Figure 3: Pick up workflow diagram.

There are two nodes when the cluster started. The nodes are then filled with low priority pods. When the real/high-priority workload starts, it forces the low-priority pods out of the nodes. The low priority pods is then stuck in pending, as it can't be scheduled. This forces the auto scaler to spin up a new node, giving room for even more workload in the additional nodes.

Prerequisites

  • This article assumes a basic knowledge of the OpenShift command-line interface oc.
  • A running instance of OpenShift 4 on Red Hat OpenShift Service on AWS (ROSA) and admin access on the OpenShift cluster.
  • Access to the OpenShift web console and access to your cluster on console.redhat.com.

Cluster setup

To set up the cluster, log in to console.redhat.com and navigate to the cluster (Figures 3 and 4).

navigate to cluster
Figure 4: Navigating to the cluster
Figure 4: Navigating to the cluster.

Go to the machine pool tab (Figure 5).

select machine pool tab
Figure 5: Selecting the machine pool tab
Figure 5: Selecting the machine pool tab.

Create a new machine pool (Figure 6) or select a pre-existing machine pool and select scale (Figure 7).

edit machine pool
Figure 6: Editing machine pool
enable autoscaling
Figure 7: Enabling autoscaling

Enable autoscaling on your machine pool.

If using multiple machine pools, consider labelling it so the deployments can be fixed to a certain machine pool.

Create project

Now, you will create a new namespace on OpenShift.

oc new project over-provision

Create a priority class

The priority class dictates how important a pod compared to other pods. A lower priority would cause the scheduler will evict the pods, to allow the schedule of higher priority pods. By default most pods have a priority of 0.

Creating a new priority class allows the pods to be evicted when more important workloads are created.

Create a new priority class on your OpenShift cluster:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -10
globalDefault: false
description: "Priority class used by overprovisioning."

Apply priority class using the following:

 oc apply -f priority.yaml

Deploy the low priority pods

Create the low-priority pods using this deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
  namespace: over-provision
spec:
  replicas: 15
  selector:
    matchLabels:
      run: overprovisioning
  template:
    metadata:
      labels:
        run: overprovisioning
    spec:
      priorityClassName: overprovisioning
      terminationGracePeriodSeconds: 0
      nodeSelector:
        scale: "true"
      containers:
        - name: reserve-resources
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "350m"

Apply deployment using:

 oc apply -f low-priority-pods.yaml

There are two items that should be changed for your use case, the number of replicas and the CPU requests.

The CPU requests value should be set to an average of your workload CPU requests. This can help provide a estimate for predetermined sudden load.

For example, if your sudden load peak is 10 pods, all consuming ~`350m` CPU, setting the number of replicas and CPU will allow your cluster to have room for the sudden load peak.

Controlling the number of replicas

This is a very important step. The number of replicas is key. There are two options to control it, manually or using autoscaler.

Manual

The easiest method is to manually control the number of replicas. It can be set to a value of a known sudden load or a value where the cluster is forced to create a new node. This can be done using the deployment, or by increasing the number of replicas using the UI or CLI. oc scale deployment overprovisioning -n over-provision --replicas=20

Autoscaler

The other option is to use the cluster-proportional-autoscaler. The autoscaler dynamically controls the number of replicas, using a criteria set with a config map called overprovisioning-autoscaler  and the values set in the default-params section. This means that within a set parameter, it will scale accordingly. The autoscaler works in proportion to your cluster, looking at number of CPU cores, number of nodes, etc.

kind: ServiceAccount
apiVersion: v1
metadata:
  name: cluster-proportional-autoscaler-example
  namespace: over-provision
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cluster-proportional-autoscaler-example
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["replicationcontrollers/scale"]
    verbs: ["get", "update"]
  - apiGroups: ["extensions","apps"]
    resources: ["deployments/scale", "replicasets/scale"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cluster-proportional-autoscaler-example
subjects:
  - kind: ServiceAccount
    name: cluster-proportional-autoscaler-example
    namespace: overprovision
roleRef:
  kind: ClusterRole
  name: cluster-proportional-autoscaler-example
  apiGroup: rbac.authorization.k8s.io

Create the cluster role, cluster role binding, and service account using:

 oc apply -f service-account.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning-autoscaler
  namespace: over-provision
  labels:
    app: overprovisioning-autoscaler
spec:
  selector:
    matchLabels:
      app: overprovisioning-autoscaler
  replicas: 1
  template:
    metadata:
      labels:
        app: overprovisioning-autoscaler
    spec:
      containers:
        - image: registry.k8s.io/cluster-proportional-autoscaler-amd64:1.8.1
          name: autoscaler
          command:
            - /cluster-proportional-autoscaler
            - --namespace=overprovision
            - --configmap=overprovisioning-autoscaler
            - --default-params={"linear":{"coresPerReplica":1}}
            - --target=deployment/overprovisioning
            - --logtostderr=true
            - --v=2
      serviceAccountName: cluster-proportional-autoscaler-example

Create the deployment using:

 oc apply -f deployment.yaml

The autoscaler creates a config map called overprovisioning-autoscaler with the values set in the default-params section. There are multiple params that can be set including a minimum and maximum number of replicas.

The coresPerReplica is an important value. Leaving it as the default 1 would not allow the cluster autoscaler to scale down. It's best to increase this value to above 1. A great starting place would be 1.2. Combining it with a minimum value allows a spare node to be running at all times, if required.

You can find more information and configuration on GitHub.

Testing

To test the scaling, simply create a deployment:

 oc apply -f deployment.yaml

Increase the number of replicas using the UI or running:

 oc scale --replicas=15 deployment high-priority-pod

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: high-priority-pod
  name: high-priority-pod
  namespace: over-provision
spec:
  replicas: 3
  selector:
    matchLabels:
      app: high-priority-pod
  template:
    metadata:
      labels:
        app: high-priority-pod
    spec:
      containers:
        - image: image-registry.openshift-image-registry.svc:5000/openshift/httpd
          name: httpd
          ### Not needed, purely for demo purposes
          resources:
            requests:
              cpu: "350m"

Summary

There are pros and cons to this solution. There will be an increase in costs because there will be more nodes running versus having spare nodes ready to use at any given time.

It might not be for everyone. When the solution was tested, the decreased wait time for Red Hat OpenShift Dev Spaces was beneficial. It would allow for users to log in and start working faster rather than being forced to wait ten minutes for a new node to spin up and be ready. If your workflow is peaky, for example users log in in at 9:00 AM, this solution might benefit your use case. It will require tailoring for your environment and use case, but will hopefully be beneficial.

You can find more information on GitHub.