Featured image for GitOps + Kubernetes

Nodes in Red Hat OpenShift can be updated automatically through OpenShift's Machine Config Operator (MCO). A machine config is a custom resource that helps a cluster manage the complete life cycle of its nodes. When a machine config resource is created or updated in a cluster, the MCO picks up the update, performs the necessary changes to the selected nodes, and restarts the nodes gracefully by cordoning, draining, and rebooting them. The MCO handles everything ranging from the kernel to the kubelet.

However, interactions between the MCO and the GitOps workflow can introduce major performance issues and other undesired behavior. This article shows how to make the MCO and the Argo CD GitOps orchestration tool work well together.

Machine configs and Argo CD: Performance challenges

When using machine configs as part of a GitOps workflow, the following sequence can produce suboptimal performance:

  1. Argo CD starts a sync job after a commit to the Git repository containing application resources.
  2. If Argo CD notices a new or changed machine config while the sync operation is ongoing, MCO picks up the change to the machine config and starts rebooting the nodes to apply it.
  3. If any of the nodes that are rebooting contain the Argo CD application controller, the application controller terminates and the application sync is aborted.

Because the MCO reboots the nodes in sequential order, and the Argo CD workloads can be rescheduled on each reboot, it could take some time for the sync to be completed. This could also result in undefined behavior until the MCO has rebooted all nodes affected by the machine configs within the sync.

Extend the application's manifest in Git

The solution to the interactions in the previous section requires you to extend the application's manifest in Git by adding PreSync and PostSync hooks to Argo CD. Argo CD provides these hooks so that you can ensure that operations of your choice are performed before and after each sync (Figure 1). As the name suggests, a PreSync hook is a job that Argo CD executes right before the sync starts. Similarly, the PostSync hook executes after a sync.

Sync Hook Workflow
Figure 1. Sync hook workflow.

We will use kam-blog as the sample application for this demo. We have generated this application following directions in the article Bootstrap GitOps with Red Hat OpenShift Pipelines and kam CLI.

Add sync hooks to Argo CD

Our PreSync job pauses the Machine Config Pool (MCP) so it does not reboot the nodes in order to apply the machine config changes. We ensure this pause by setting the flag .spec.paused to true.

To insert the PreSync job, create a file named pre-sync-job.yaml and add it to the same directory as the application. The content of the file is:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
  name: mcp-worker-pause-job
  namespace: openshift-gitops
spec:
  template:
    spec:
      containers:
        - image: registry.redhat.io/openshift4/ose-cli:v4.4
          command:
            - /bin/bash
            - -c
            - |
              echo -n "Waiting for the MCP $MCP to converge."
              echo $(oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/$MCP)
              sleep $SLEEP
              echo "DONE"
          imagePullPolicy: IfNotPresent
          name: mcp-worker-pause-job
          env:
          - name: SLEEP
            value: "10"
          - name: MCP 
            value: worker
      restartPolicy: Never
      serviceAccount: sync-job-sa

The PostSync hook resumes the MCP so that it reboots the nodes, applying the queued or incoming machine config changes. Enable this behavior by setting the flag .spec.paused to false. To insert the PostSync job, create a file named post-sync-job.yaml and add it to the same directory as the application. The content of the file is:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
  name: mcp-worker-resume-job
  namespace: openshift-gitops
spec:
  template:
    spec:
      containers:
        - image: registry.redhat.io/openshift4/ose-cli:v4.4
          command:
            - /bin/bash
            - -c
            - |
              echo -n "Waiting for the MCP $MCP to converge."
              sleep $SLEEP
              echo $(oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/$MCP)
              echo "DONE"
          imagePullPolicy: Always
          name: mcp-worker-resume-job
          env:
          - name: SLEEP
            value: "5"
          - name: MCP 
            value: worker
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      serviceAccount: sync-job-sa
      serviceAccountName: sync-job-sa
      terminationGracePeriodSeconds: 30

Add permissions for Sync Hooks

In order for these jobs to execute successfully, they need permissions to manipulate machine config resources in the cluster. These permissions need to be granted using a ServiceAccount and appropriate ClusterRole and ClusterRoleBinding properties.

To add the ServiceAccount, ClusterRole, and ClusterRoleBinding properties, create a file named sync-job-cluster-rbac.yaml and add it to the same directory as the application. The content is:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations: {}
  name: sync-job-sa
  namespace: openshift-gitops
---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sync-job-sa-role
rules:
  - apiGroups:
      - apiextensions.k8s.io
      - machineconfiguration.openshift.io
    resources:
      - machineconfigpools
    verbs:
      - get
      - list
      - patch
---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: sync-job-sa-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: sync-job-sa-role
subjects:
  - kind: ServiceAccount
    name: sync-job-sa
    namespace: openshift-gitops

You can now apply the configuration to the cluster using the following command:

$ kubectl apply -k config/argocd

After you have applied the configuration, try manually syncing the application. You should see that the PreSync and PostSync jobs have paused and unpaused the MCP as shown in Figure 2.

Figure 2. The OpenShift user interface shows the actions of the PreSync and PostSync hooks.

You can also see that the MCP paused by examining its details (Figure 3).

Figure 3. Machine Config Pool details show that it is paused.

Once the sync job finishes, the PostSync job unpauses the MCP and resumes all the updates to the nodes in the cluster. The MCP details show this change as well (Figure 4).

Figure 4. Machine Config Pool details show that it is unpaused.

If the sync fails for any reason, the MCP will stay paused and won't update the nodes. To resume MCP updates, you have to manually update the MCP and set the flag .spec.paused to false. You can set the flag using the following command:

$ oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker

Conclusion

Updates to machine configs can lead to uncontrolled node reboots, termination of the sync process, and unanticipated issues in the application. The workaround in this article helps to prevent nodes from rebooting while the critical Argo CD sync operations are in progress.

Comments