Operator SDK tips and tricks

Kubernetes Operators are all the rage this season, and the fame is well deserved. Operators are evolving from being used primarily by technical-infrastructure gurus to becoming more mainstream, Kubernetes-native tools for managing complex applications. Kubernetes Operators today are important for cluster administrators and ISV providers, and also for custom applications developed in house. They provide the base for a standardized operational model that is similar to what cloud providers offer. Operators also open the door to fully portable workloads and services on Kubernetes.

The new Kubernetes Operator Framework is an open source toolkit that lets you manage Kubernetes Operators in an effective, automated, and scalable way. The Operator Framework consists of three components: the Operator SDK, the Operator Lifecycle Manager, and OperatorHub. In this article, I introduce tips and tricks for working with the Operator SDK. The Operator SDK 1.0.0 release shipped in mid-August, so it's a great time to have a look at it.

Note: Started by CoreOS and pursued by Red Hat for the past year, the Operator Framework initiative entered incubation with the Cloud Native Computing Foundation in July 2020.

Operator SDK tips and tricks

Exploring the Kubernetes Operator SDK

I took advantage of the summer holidays to explore the new Operator SDK 1.0.0 release. For my experimentation, I developed Operators using Helm, Ansible, and Go, and deployed them on both vanilla Kubernetes and Red Hat OpenShift. These languages are the ones proposed by Operator SDK and they offer a range of capabilities, from simple to very sophisticated Operators. Of course, you can use other technologies to develop your Operator as well, like Python or Quarkus. I found good resources to guide me—namely, the 'Hello, World' tutorial with Kubernetes Operators, Operator best practices, and Kubernetes Operators best practices for Go—but I am not that familiar with Go or Ansible, so I scratched my head a lot. The tips I'm sharing are all things that I wish I had known before I started. I hope that they will also help you.

Note: All of the code examples and resources we'll use are available in the GitHub repository for this article.

Tip 1: Handling default CRD values

Every Kubernetes Operator comes with its own custom resource definition (CRD), which is the grammar used to describe high-level resource specifications in a Kubernetes cluster. From a first-time user perspective, a simpler CRD is better; however, experienced users will appreciate the advanced tweaking options. Handling default values for all of your custom resource instances is crucial for keeping things simple and configurable, but each tool does it a little differently.

As an example, let's say that we want to deploy an application made of two components: a web application and a database. First-time users would deploy it using a simple custom resource like the one below:

apiVersion: redhat.com/v1beta1
kind: FruitsCatalog
metadata:
  name: fruitscatalog-sample
spec:
  appName: my-fruits-catalog

We will also need advanced options for the number of replicas, persistent storage, ingress, and so on.

Custom resource default values with Helm

A Helm chart defines a values.yaml file for handling custom resource default values. Using the Helm-based Operator SDK, it's pretty easy to add consistent values to our example file:

# Default values for fruitscatalog.
appName: fruits-catalog-helm
webapp:
  replicaCount: 1
  image: quay.io/lbroudoux/fruits-catalog:latest
  [...]
mongodb:
  install: true
  image: centos/mongodb-34-centos7:latest
  persistent: true
  volumeSize: 2Gi
  [...]

Custom resource default values with Ansible

The Ansible-based Operator SDK does not provide an out-of-the-box way to add handle custom resource default values. The trick I've found requires that you make three modifications to your Operator project.

First, create a roles/fruitscatalog/default/main.yml file for handling default values. Be aware of Ansible's usage of snake case, which is different from the camel case normally used for custom resource attributes. As an example, Ansible transforms replicaCount into replica_count, so you have to use this form in your Operator:

---
# defaults file for fruitscatalog
name: fruits-catalog-ansible
webapp:
  replica_count: 1
  image: quay.io/lbroudoux/fruits-catalog:latest
  [...]
mongodb:
  install: true
  image: centos/mongodb-34-centos7:latest
  persistent: true
  volume_size: 2Gi
  [...]

Once this file is present in your role, the Operator SDK will use it to initialize the missing parts in the user-supplied custom resource. The limit of this approach is that the SDK only realizes a first-level merge. If a user only puts the webapp.replicaCount into the custom resource, the other default child attributes will not be merged into the webapp variable. Basically, you will have to handle the merge process explicitly, using Ansible's combine() filter.

So, at the very beginning of the role, we need to ensure that we will have a complete resource based on what's provided by the user and merged with default:

- name: Load default values from defaults/main.yml
  include_vars:
    file: ../defaults/main.yml
    name: default_cr

- name: Complete Custom Resource spec with default values
  set_fact:
    webapp_full: "{{ default_cr.webapp|combine(webapp, recursive=True) }}"
    mongodb_full: "{{ default_cr.mongodb|combine(mongodb, recursive=True) }}"

The trick here is that the webapp and mongodb variables initialized by the SDK cannot be written; you will have to recreate new variables like webapp_full and base your Ansible template on this later one. What's nice is that this approach is fully functional when running your Kubernetes Operator locally using make run or ansible-operator run.

Custom resources default values with Go

The Go-based Operator SDK also requires its own approach. You can define an initialization method in the controller (as described in Kubernetes Operators best practices), but I believe there's a better way of handling it.

Using the Kubernetes apiextensions.k8s.io/v1 API, it is now possible to define default values directly within the CRD. In Helm and Ansible, you can complete the OpenAPI part of the CRD manually. For a Go-based Operator, you can use the +kubebuilder comments in your Go code:

// WebAppSpec defines the desired state of WebApp
// +k8s:openapi-gen=true
type WebAppSpec struct {
    // +kubebuilder:default:=1
    ReplicaCount int32 `json:"replicaCount,omitempty"`
    // +kubebuilder:default:="quay.io/lbroudoux/fruits-catalog:latest"
    Image   string      `json:"image,omitempty"`
    [...]
}

To enable this option, you have to tweak the project's Makefile to force the SDK to generate apiextensions.k8s.io/v1 manifests:

CRD_OPTIONS ?= "crd:trivialVersions=true,crdVersions=v1"

Running the make manifests command in your project generates a full CRD with default values for future custom resource instances:

---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.3.0
  creationTimestamp: null
  name: fruitscatalogs.redhat.com
spec:
  group: redhat.com
  names:
    kind: FruitsCatalog
    listKind: FruitsCatalogList
    plural: fruitscatalogs
    singular: fruitscatalog
  scope: Namespaced
  versions:
    - name: v1beta1
    schema:
      openAPIV3Schema:
        [...]
          [...]
          spec:
            description: FruitsCatalogSpec defines the desired state of FruitsCatalog
            properties:
              [...]
              webapp:
                description: WebAppSpec defines the desired state of WebApp
                properties:
                  image:
                    default: quay.io/lbroudoux/fruits-catalog:latest
                    type: string
                  replicaCount:
                    format: int32
                    default: 1
                    type: integer
                  [...]

That's pretty neat.

Tip 2: Preparing your Operator for OpenShift

One nice thing about the Operator SDK is that it scaffolds a huge part of your project from operator-sdk init or operator-sdk create api. That scaffold is much of what you need to deploy your Operator to OpenShift, but it's not everything. During my experiments, I found one missing piece, which is related to role-based access control (RBAC) permissions. Essentially, the Operator should be endorsed to do its job without having full access to the cluster.

When generating Kubernetes resources, an Operator should try to register itself as the owner of the resource. That makes it easier to watch the resource and implement finalizers. Typically, the Operator can include an ownerReference field that references the created CR:

ownerReferences:
  - apiVersion: redhat.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: FruitsCatalog
    name: fruitscatalog-sample
    uid: c5d7e996-013f-40ca-bd19-14ba73728eaf

The default scaffolding works well on vanilla Kubernetes. But on OpenShift, the Operator needs to be able to set finalizers on the custom resource after it's been created in order to set the ownerReference block. So now you have to add the extra permissions for your Operator as described below.

Adding RBAC permissions with Helm and Ansible

Using Helm and Ansible-based Operators, you can configure the RBAC permissions within the config/rbac/role.yaml file. You would typically add something like this:

- apiGroups:
  - redhat.com
  resources:
  - fruitscatalogs
  - fruitscatalogs/status
  - fruitscatalogs/finalizers 	# Missing line that is not added by the SDK
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

Adding RBAC permissions with Go

Using Go-based Operators, you can use a +kubebuilder:rbac comment to set the RBAC permissions directly into the controller source code. Just add something like this to your Reconcile function comments:

[...]
// +kubebuilder:rbac:groups=redhat.com,resources=fruitscatalogs/finalizers,verbs=get;create;update;patch;delete

// Reconcile the state rfor a FruitsCatalog object and makes changes based on the state read and what is in the FruitsCatalogSpec.
func (r *FruitsCatalogG1Reconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
	[...]
}

Note: These permissions might be added by default in a future release. See Pull Request #3779: Helm Operator: add finalizers permission for created APIs for details and tracking.

Tip 3: Discovering the cluster you're running on

Operators are expected to be adaptable, which means that they have to be able to change their actions and the resources they manage depending on the environment. A straightforward illustration is using route capabilities instead of ingress when running on OpenShift. To make that change, a Kubernetes Operator should be able to discover the Kubernetes distribution that it is deployed on and any extensions that have been installed on the cluster. Currently, such advanced discovery can only be done using the Ansible- and Go-based Operators.

Advanced discovery with Ansible

In Ansible-based Operators, we use a k8s lookup requesting the api_groups present on the cluster. Then, we should be able to detect that we're running on OpenShift and create a Route only when appropriate:

- name: Get information about the cluster
  set_fact:
    api_groups: "{{ lookup('k8s', cluster_info='api_groups') }}"

[...]

- name: The Webapp Route is present if OpenShift
  when: "'route.openshift.io' in api_groups"
  k8s:
    state: present
    definition: "{{ lookup('template', 'webapp-route.yml') | from_yaml  }}"

Advanced discovery with Go

This type of discovery is a little more complex using Go-based Operators. In this case, we use a specific DiscoveryClient from the discovery package. Once retrieved, you can make a request to retrieve the API groups and detect that you are on OpenShift:

import {
    [...]
    "k8s.io/client-go/discovery"
}

// getDiscoveryClient returns a discovery client for the current reconciler
func getDiscoveryClient(config *rest.Config) (*discovery.DiscoveryClient, error) {
    return discovery.NewDiscoveryClientForConfig(config)
}

// Reconcile the state for a FruitsCatalog object and makes changes based on the state read and what is in the FruitsCatalogSpec.
func (r *FruitsCatalogG1Reconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
	[...]
    // The discovery package is used to discover APIs supported by a Kubernetes API server.
    config, err := ctrl.GetConfig()
    if err == nil andand config != nil {
        dclient, err := getDiscoveryClient(config)
        if err == nil andand dclient != nil {
            apiGroupList, err := dclient.ServerGroups()
            if err != nil {
                reqLogger.Info("Error while querying ServerGroups, assuming we're on Vanilla Kubernetes")
            } else {
                for i := 0; i < len(apiGroupList.Groups); i++ {
                    if strings.HasSuffix(apiGroupList.Groups[i].Name, ".openshift.io") {
                        isOpenShift = true
                        reqLogger.Info("We detected being on OpenShift! Wouhou!")
                        break
                    }
                }
            }
        } else {
            reqLogger.Info("Cannot retrieve a DiscoveryClient, assuming we're on Vanilla Kubernetes")
        }
    }
    [...]
}

You could use this mechanism to detect other installed Operators, Ingress classes, storage capabilities, and so on.

Tip 4: Using extensions APIs in Go-based Operators

This tip is specific to Go-based Operators. As Ansible and Helm are treating everything as YAML, you can freely describe any kind of resources you need. Their validation will only occur on the cluster API-side when doing OpenAPI v3 validation or through an admission hook.

Go is a strongly typed language, which is clearly an advantage when you are dealing with complex Operators and data structures. With Go you can rely on tools like Integrated Development Environment (IDE) to help you through code completion and inline documentation. Thus you are able to validate the Kubernetes resources you are using before actually submitting them to the cluster. However, when you want to build something with API extensions, you'll have to integrate them as Go dependencies and register them within your own client runtime. I'll show you how to do that.

Note: While the following discussion might seem obvious to developers familiar with Go and Kubernetes, my background is in Java, and it took me a moment to figure it out.

Integrating API extensions as Go dependencies

First, you have to include the new API extension dependencies within your go.mod file at the project root. For this, Go modules use either a Git tag or branch name (in the latter case, it appears to translate the branch name into the latest commit hash). Following my previous example, if I want to use an OpenShift-specific data structure for a Route resource, I have to add the following:

require (
   [...]
   github.com/openshift/api v3.9.0+incompatible   // v3.9.0 is the last tag. New releases are managed as branches
   // github.com/openshift/client-go release-4.5  // As an example of integrating the OpenShift-specific client lib
)

The next step is to register one or many packages into the supported runtime schemes. This allows you to use Route Go objects with the standard Kubernetes Go client. For that, you have to modify the main.go file that was generated at the project root. Add a new import and register the scheme into the init() function:

import (
    [...]
    routev1 "github.com/openshift/api/route/v1"
)

func init() {
    utilruntime.Must(clientgoscheme.AddToScheme(scheme))
    utilruntime.Must(redhatv1beta1.AddToScheme(scheme))
    utilruntime.Must(routev1.AddToScheme(scheme))
    // +kubebuilder:scaffold:scheme
}

Finally, within your Go Reconcile() function or another Operator package, you'll be able to manipulate the Route structure in a strongly typed fashion that helps keep you on track. You can then create this object using the standard client present into your controller:

return androutev1.Route{
    [...]
     Spec: routev1.RouteSpec{
        To: routev1.RouteTargetReference{
            Name:   spec.AppName + "-webapp",
            Kind:   "Service",
            Weight: andweight,
        },
        Port: androutev1.RoutePort{
            TargetPort: intstr.IntOrString{
                Type:   intstr.String,
                StrVal: "http",
            },
        },
        TLS: androutev1.TLSConfig{
            Termination:                   routev1.TLSTerminationEdge,
            InsecureEdgeTerminationPolicy: routev1.InsecureEdgeTerminationPolicyNone,
        },
        WildcardPolicy: routev1.WildcardPolicyNone,
    },
}

Tip 5: Adjusting Operator resource consumption

My final tip is to watch your resources—which is a sentence with a double meaning.

In the specific context of Kubernetes Operators, I mean to say that the custom resource is watched by the Operator controller, often called the Operand. It is important to configure your Operator to watch dependent resources. While there is some really good documentation on watching dependent resources (see the docs for dependent watches, resources watched by the controller, and using predicates for event filtering), there's no need to dive into these right now. What's important to know is that watching more software resources impacts your physical resources, namely, CPU and memory.

That is the second meaning of the sentence: Once your Operator starts growing—which can happen very quickly—you should think carefully about the resources that it consumes. The default requests and limits set low values that should be adapted to your needs. This is especially true for Helm- or Ansible-based Operators.

Before you start raising CPU and memory, make sure you pay attention to the concurrent reconciles that your Operator should manage. Simply put: How many custom resources should your Operator manage simultaneously? By default, the Operator SDK sets this value to the number of cores present on the node running the Operator. If you're watching many resources and you have big nodes, however, then this setting could act as a multiplying factor for the consumed resources. Moreover, if your Kubernetes Operator is scoped to a specific namespace, there is a little chance that you'll need 16 concurrent reconciles for one or two custom resources in your namespace.

Managing concurrent reconciles

You can easily use the —max-concurrent-reconciles flag to set the number of maximum concurrent reconciles. The new Operator SDK project layout takes advantage of Kustomize, so you'll have to change the config/default/manager_auth_proxy_patch.yaml like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
spec:
  template:
    spec:
      containers:
      - name: kube-rbac-proxy
        [...]
      - name: manager
        args:
        - "--metrics-addr=127.0.0.1:8080"
        - "--enable-leader-election"
        - "--leader-election-id=fruits-catalog-operator"
        - "--max-concurrent-reconciles=4"

After that, you can set up the resources requests and limits in the usual Kubernetes way.

Wrap up

In this article, I've shared five tips that ease my life while developing Operators with the newly released Kubernetes Operator SDK 1.0.0. Having a strong background in Java, but not in Ansible or Go, the issues that I discussed all made me scratch my head for a few hours. The tips might be obvious to experienced developers, but I hope that they will save other developers' time.

What about you? What are your tricks for working with Kubernetes Operators?

Last updated: January 12, 2024