This article discusses etcd backups for Red Hat OpenShift 4.X clusters in hybrid scenarios. This is a crucial activity for disaster recovery or node failure. etcd backups are responsible for recovering the state of master nodes and the cluster state, as it is the primary datastore of Kubernetes. It is recommended to store it externally as it ensures accessibility for node restoration even if node access or the nodes themselves become unavailable.
When to back up
Ideally you should initiate the cluster’s etcd data backup regularly and store it in a secure location outside the OpenShift cluster. After creating a new OpenShift cluster, the first certificate rotation happens after 24 hours of installation; you should not start the etcd backup before this operation as it will contain expired certificates. Additionally, it is recommended to initiate etcd backups during non-peak hours, as an etcd snapshot has a high I/O cost. Also, be sure do your etcd backup before and after any cluster upgrade process.
How to back up
In an OpenShift cluster, to back up your etcd database, an automated script is already provided at location /usr/local/bin/cluster-backup.sh
at the master node. To access it, you need to start a debug session with OpenShift CLI.
oc debug node/<master_node_name>
helps you to log in to master node. Once you run it, it will create a backup at the mentioned folder location. In the following sections, we will explain how to automate this process using a CronJob. This CronJob is run on the OpenShift cluster itself and will back up this file for all the master nodes in a timely matter. Make sure this backup that is created in master node is daily cleaned so that it doesn’t fill the disk space.
Where to store the backup?
This backup can be stored in any storage outside the cluster but should be reachable from the cluster. In this article we will explore the scenario of storing the etcd backup on Cloud Object Storage like S3. Similarly, it can be stored in other object stores for other clouds and NFS and other file share available on the clouds.
Execution
The next section details the steps required to store the etcd backup on IBM Cloud Object Storage.
Prerequisites
- You have access to cluster as a user with cluster-admin role.
- You have created an S3 Bucket which is accessible from the cluster.
We will create the following in OpenShift cluster:
- Namespace.
- Service account.
- Cluster role.
- Cluster role binding.
- AWS S3 key.
- CronJob.
You can create the namespace from the console or from OpenShift client CLI.
To schedule the etcd backup as a daily CronJob, it is important to create a dedicated namespace. Also make sure only cluster-admins have access to this namespace. Other team members would not need access to this namespace. See below:
oc new-project etcd-bkp --description “Openshift ETCD Backup” –display-name “ETCD Backup to S3”
Service account
We will create a service account to run the etcd backup CronJob with it:
kind: ServiceAccount
apiVersion: v1
metadata:
name: cronjob-etcd-bkp-sa
namespace: etcd-bkp
labels:
app: cronjob-etcd-backup
oc apply -f service_account.yaml
Cluster role
A cluster role is required to run the pod with proper privileges. The below YAML is required to create the proper cluster role:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cronjob-etcd-bkp-cr
rule:
- apiGroups: [""]
resources:
- "nodes"
verbs: ["get","list"]
- apiGroups: [""]
resources:
- "pods"
- "pods/log"
verbs: ["get","list","create","delete","watch"]
oc apply -f cluster_role.yaml
Cluster role binding
After the role creation, we need to bind with the service account we just created. Here is the YAML file to create the cluster role binding:
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cronjob-etcd-bkp-crb
labels:
app: cronjob-etcd-backup
subjects:
- kind: ServiceAccount
name: cronjob-etcd-bkp-sa
namespace: etcd-bkp
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cronjob-etcd-bkp-cr
oc apply -f cluster_role_binding.yaml
AWS S3 secret
We need to store the AWS access key ID and AWS access key in a secret for the S3 bucket. This secret is referred to in the CronJob to access the S3 bucket. Below is the sample to create the key:
apiVersion: v1
kind: Secret
metadata:
name: aws-s3-etcd-key
namespace: etcd-bkp
type: Opaque
data:
aws_access_key_id: <key_id | base 64>
aws_secret_access_key: <access_key | base 64>
region: <bucket_region | base 64>
oc apply -f s3_secret.yaml
CronJob
We can run the CronJob in two ways.
We can schedule our CronJob in such a way that the job runs in master node and takes the backup in the node itself and pushes it to S3 bucket. Let’s explore the CronJob provided below.
In this CronJob, the task runs on master node because of node selector:
spec:
nodeSelector:
node-role.kubernetes.io/master: ''
It uses the CLI image to initiate the backup:
image: registry.redhat.io/openshift4/ose-cli
It then invokes the backup script:
chroot /host /usr/local/bin/cluster-backup.sh
It creates the backup at /home/core/backup
with date appended to the name:
chroot /host /usr/local/bin/cluster-backup.sh /home/core/backup/$(date "+%F_%H%M%S")
It cleans the older backups:
chroot /host find /home/core/backup -minidepth 1 -type d -mmin +2 -exec rm -rf {} \;
Finally, it pushes the backup to AWS S3 bucket with AWS CLI image:
then aws s3 cp /host/home/core/backup/ s3://ocp-etcd-sync --recursive;
kind: CronJob
apiVersion: batch/v1
metadata:
name: cronjob-etcd-backup
namespace: etcd-bkp
labels:
app.kubernetes.io/name: cronjob-etcd-backup
spec:
schedule: "* * * * *"
concurrencyPolicy: Forbid
suspend: false
jobTemplate:
metadata:
labels:
app.kubernetes.io/name: cronjob-etcd-backup
spec:
backoffLimit: 0
template:
metadata:
labels:
app.kubernetes.io/name: cronjob-etcd-backup
spec:
nodeSelector:
node-role.kubernetes.io/master: ''
restartPolicy: Never
activeDeadlineSeconds: 500
serviceAccountName: cronjob-etcd-bkp-sa
hostPID: true
hostNetwork: true
enableServiceLinks: true
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
securityContext: {}
containers:
- name: cronjob-etcd-backup
image: registry.redhat.io/openshift4/ose-cli
terminationMessagePath: /dev/termination-log
command:
- /bin/bash
- '-c'
- >-
echo -e '\n\n---\nCreate etcd backup local to master\n' &&
chroot /host /usr/local/bin/cluster-backup.sh /home/core/backup/$(date "+%F_%H%M%S") &&
echo -e '\n\n---\nCleanup old local etcd backups\n' &&
chroot /host find /home/core/backup/ -mindepth 1 -type d -mmin +2 -exec rm -rf {} \;
securityContext:
privileged: true
runAsUser: 0
capabilities:
add:
- SYS_CHROOT
imagePullPolicy: Always
volumeMounts:
- name: host
mountPath: /host
terminationMessagePolicy: File
- name: aws-cli
image: amazon/aws-cli:latest
command:
- /bin/bash
- '-c'
- >-
while true; do if [[ $(find /host/home/core/backup/ -type d -cmin -1 ]]; then aws s3 cp /host/home/core/backup/ s3://ocp-etcd-sync --recursive; break; fi; done
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-s3-etcd-key
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-s3-etcd-key
key: aws_secret_access_key
- name: AWS_DEFAULT_REGION
valueFrom:
secretKeyRef:
name: aws-s3-etcd-key
key: region
volumeMounts:
- name: host
mountPath: /host
volumes:
- name: host
hostPath:
path: /
type: Directory
dnsPolicy: ClusterFirst
tolerations:
- key: node-role.kubernetes.io/master
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
Another way to initiate the backup is to schedule the CronJob in worker node, but it needs to access the master node to do the backup and push it to S3. Below is an example that goes to every master node and takes backup to master node. This can be moved to S3 or any other file storages like NFS or volumes that are backed up.
This backup is scheduled to run every 12 hours. During the backup process, the CronJob will also try to delete the older backups that might not be required any longer to avoid filling up storage. It uses the image registry.redhat.io/openshift4/ose-cli
. Instead, you can create your own image using Red Hat Universal Base Image (UBI) as base image and install the oc
CLI in the base image.
This below job will begin the backup at /home/core/backup/
on master nodes:
---
kind: CronJob
apiVersion: batch/v1
metadata:
name: cronjob-etcd-backup
namespace: etcd-bkp
labels:
app: ocp-etcd-bkp
spec:
concurrencyPolicy: Forbid
schedule: "0 */12 * * *"
failedJobsHistoryLimit: 5
successfulJobsHistoryLimit: 5
jobTemplate:
metadata:
labels:
app: ocp-etcd-bkp
spec:
backoffLimit: 0
template:
metadata:
labels:
app: ocp-etcd-bkp
spec:
containers:
- name: etcd-backup
image: "registry.redhat.io/openshift4/ose-cli"
command:
- "/bin/bash"
- "-c"
- oc get no -l node-role.kubernetes.io/master --no-headers -o name | xargs -I {} -- oc debug {} --to-namespace=etcd-bkp -- bash -c 'chroot /host sudo -E /usr/local/bin/cluster-backup.sh /home/core/backup/ && chroot /host sudo -E find /home/core/backup/ -type f -mmin +"1" -delete'
serviceAccountName: "cronjob-etcd-bkp-sa"
serviceAccount: "cronjob-etcd-bkp-sa"
This job differs from the previous one; it runs on worker nodes and uses oc debug
to log in to the master node, list all master nodes, and begin backup one by one using the following command:
- oc get no -1 node-role.kubernetes.io/master --no-headers -o name | xargs -I {} -- oc debug {}