Kubernetes Operators are all the rage this season, and the fame is well deserved. Operators are evolving from being used primarily by technical-infrastructure gurus to becoming more mainstream, Kubernetes-native tools for managing complex applications. Kubernetes Operators today are important for cluster administrators and ISV providers, and also for custom applications developed in house. They provide the base for a standardized operational model that is similar to what cloud providers offer. Operators also open the door to fully portable workloads and services on Kubernetes.
The new Kubernetes Operator Framework is an open source toolkit that lets you manage Kubernetes Operators in an effective, automated, and scalable way. The Operator Framework consists of three components: the Operator SDK, the Operator Lifecycle Manager, and OperatorHub. In this article, I introduce tips and tricks for working with the Operator SDK. The Operator SDK 1.0.0 release shipped in mid-August, so it's a great time to have a look at it.
Note: Started by CoreOS and pursued by Red Hat for the past year, the Operator Framework initiative entered incubation with the Cloud Native Computing Foundation in July 2020.
Exploring the Kubernetes Operator SDK
I took advantage of the summer holidays to explore the new Operator SDK 1.0.0 release. For my experimentation, I developed Operators using Helm, Ansible, and Go, and deployed them on both vanilla Kubernetes and Red Hat OpenShift. These languages are the ones proposed by Operator SDK and they offer a range of capabilities, from simple to very sophisticated Operators. Of course, you can use other technologies to develop your Operator as well, like Python or Quarkus. I found good resources to guide me—namely, the 'Hello, World' tutorial with Kubernetes Operators, Operator best practices, and Kubernetes Operators best practices for Go—but I am not that familiar with Go or Ansible, so I scratched my head a lot. The tips I'm sharing are all things that I wish I had known before I started. I hope that they will also help you.
Note: All of the code examples and resources we'll use are available in the GitHub repository for this article.
Tip 1: Handling default CRD values
Every Kubernetes Operator comes with its own custom resource definition (CRD), which is the grammar used to describe high-level resource specifications in a Kubernetes cluster. From a first-time user perspective, a simpler CRD is better; however, experienced users will appreciate the advanced tweaking options. Handling default values for all of your custom resource instances is crucial for keeping things simple and configurable, but each tool does it a little differently.
As an example, let's say that we want to deploy an application made of two components: a web application and a database. First-time users would deploy it using a simple custom resource like the one below:
apiVersion: redhat.com/v1beta1 kind: FruitsCatalog metadata: name: fruitscatalog-sample spec: appName: my-fruits-catalog
We will also need advanced options for the number of replicas, persistent storage, ingress, and so on.
Custom resource default values with Helm
A Helm chart defines a values.yaml
file for handling custom resource default values. Using the Helm-based Operator SDK, it's pretty easy to add consistent values to our example file:
# Default values for fruitscatalog. appName: fruits-catalog-helm webapp: replicaCount: 1 image: quay.io/lbroudoux/fruits-catalog:latest [...] mongodb: install: true image: centos/mongodb-34-centos7:latest persistent: true volumeSize: 2Gi [...]
Custom resource default values with Ansible
The Ansible-based Operator SDK does not provide an out-of-the-box way to add handle custom resource default values. The trick I've found requires that you make three modifications to your Operator project.
First, create a roles/fruitscatalog/default/main.yml
file for handling default values. Be aware of Ansible's usage of snake case, which is different from the camel case normally used for custom resource attributes. As an example, Ansible transforms replicaCount
into replica_count
, so you have to use this form in your Operator:
--- # defaults file for fruitscatalog name: fruits-catalog-ansible webapp: replica_count: 1 image: quay.io/lbroudoux/fruits-catalog:latest [...] mongodb: install: true image: centos/mongodb-34-centos7:latest persistent: true volume_size: 2Gi [...]
Once this file is present in your role, the Operator SDK will use it to initialize the missing parts in the user-supplied custom resource. The limit of this approach is that the SDK only realizes a first-level merge. If a user only puts the webapp.replicaCount
into the custom resource, the other default child attributes will not be merged into the webapp
variable. Basically, you will have to handle the merge process explicitly, using Ansible's combine()
filter.
So, at the very beginning of the role, we need to ensure that we will have a complete resource based on what's provided by the user and merged with default:
- name: Load default values from defaults/main.yml include_vars: file: ../defaults/main.yml name: default_cr - name: Complete Custom Resource spec with default values set_fact: webapp_full: "{{ default_cr.webapp|combine(webapp, recursive=True) }}" mongodb_full: "{{ default_cr.mongodb|combine(mongodb, recursive=True) }}"
The trick here is that the webapp
and mongodb
variables initialized by the SDK cannot be written; you will have to recreate new variables like webapp_full
and base your Ansible template on this later one. What's nice is that this approach is fully functional when running your Kubernetes Operator locally using make run
or ansible-operator run
.
Custom resources default values with Go
The Go-based Operator SDK also requires its own approach. You can define an initialization method in the controller (as described in Kubernetes Operators best practices), but I believe there's a better way of handling it.
Using the Kubernetes apiextensions.k8s.io/v1
API, it is now possible to define default values directly within the CRD. In Helm and Ansible, you can complete the OpenAPI part of the CRD manually. For a Go-based Operator, you can use the +kubebuilder
comments in your Go code:
// WebAppSpec defines the desired state of WebApp // +k8s:openapi-gen=true type WebAppSpec struct { // +kubebuilder:default:=1 ReplicaCount int32 `json:"replicaCount,omitempty"` // +kubebuilder:default:="quay.io/lbroudoux/fruits-catalog:latest" Image string `json:"image,omitempty"` [...] }
To enable this option, you have to tweak the project's Makefile
to force the SDK to generate apiextensions.k8s.io/v1
manifests:
CRD_OPTIONS ?= "crd:trivialVersions=true,crdVersions=v1"
Running the make manifests
command in your project generates a full CRD with default values for future custom resource instances:
--- apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.3.0 creationTimestamp: null name: fruitscatalogs.redhat.com spec: group: redhat.com names: kind: FruitsCatalog listKind: FruitsCatalogList plural: fruitscatalogs singular: fruitscatalog scope: Namespaced versions: - name: v1beta1 schema: openAPIV3Schema: [...] [...] spec: description: FruitsCatalogSpec defines the desired state of FruitsCatalog properties: [...] webapp: description: WebAppSpec defines the desired state of WebApp properties: image: default: quay.io/lbroudoux/fruits-catalog:latest type: string replicaCount: format: int32 default: 1 type: integer [...]
That's pretty neat.
Tip 2: Preparing your Operator for OpenShift
One nice thing about the Operator SDK is that it scaffolds a huge part of your project from operator-sdk init
or operator-sdk create api
. That scaffold is much of what you need to deploy your Operator to OpenShift, but it's not everything. During my experiments, I found one missing piece, which is related to role-based access control (RBAC) permissions. Essentially, the Operator should be endorsed to do its job without having full access to the cluster.
When generating Kubernetes resources, an Operator should try to register itself as the owner of the resource. That makes it easier to watch the resource and implement finalizers. Typically, the Operator can include an ownerReference
field that references the created CR:
ownerReferences: - apiVersion: redhat.com/v1beta1 blockOwnerDeletion: true controller: true kind: FruitsCatalog name: fruitscatalog-sample uid: c5d7e996-013f-40ca-bd19-14ba73728eaf
The default scaffolding works well on vanilla Kubernetes. But on OpenShift, the Operator needs to be able to set finalizers on the custom resource after it's been created in order to set the ownerReference
block. So now you have to add the extra permissions for your Operator as described below.
Adding RBAC permissions with Helm and Ansible
Using Helm and Ansible-based Operators, you can configure the RBAC permissions within the config/rbac/role.yaml
file. You would typically add something like this:
- apiGroups: - redhat.com resources: - fruitscatalogs - fruitscatalogs/status - fruitscatalogs/finalizers # Missing line that is not added by the SDK verbs: - create - delete - get - list - patch - update - watch
Adding RBAC permissions with Go
Using Go-based Operators, you can use a +kubebuilder:rbac
comment to set the RBAC permissions directly into the controller source code. Just add something like this to your Reconcile
function comments:
[...] // +kubebuilder:rbac:groups=redhat.com,resources=fruitscatalogs/finalizers,verbs=get;create;update;patch;delete // Reconcile the state rfor a FruitsCatalog object and makes changes based on the state read and what is in the FruitsCatalogSpec. func (r *FruitsCatalogG1Reconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) { [...] }
Note: These permissions might be added by default in a future release. See Pull Request #3779: Helm Operator: add finalizers permission for created APIs for details and tracking.
Tip 3: Discovering the cluster you're running on
Operators are expected to be adaptable, which means that they have to be able to change their actions and the resources they manage depending on the environment. A straightforward illustration is using route capabilities instead of ingress when running on OpenShift. To make that change, a Kubernetes Operator should be able to discover the Kubernetes distribution that it is deployed on and any extensions that have been installed on the cluster. Currently, such advanced discovery can only be done using the Ansible- and Go-based Operators.
Advanced discovery with Ansible
In Ansible-based Operators, we use a k8s
lookup requesting the api_groups
present on the cluster. Then, we should be able to detect that we're running on OpenShift and create a Route
only when appropriate:
- name: Get information about the cluster set_fact: api_groups: "{{ lookup('k8s', cluster_info='api_groups') }}" [...] - name: The Webapp Route is present if OpenShift when: "'route.openshift.io' in api_groups" k8s: state: present definition: "{{ lookup('template', 'webapp-route.yml') | from_yaml }}"
Advanced discovery with Go
This type of discovery is a little more complex using Go-based Operators. In this case, we use a specific DiscoveryClient
from the discovery
package. Once retrieved, you can make a request to retrieve the API groups and detect that you are on OpenShift:
import { [...] "k8s.io/client-go/discovery" } // getDiscoveryClient returns a discovery client for the current reconciler func getDiscoveryClient(config *rest.Config) (*discovery.DiscoveryClient, error) { return discovery.NewDiscoveryClientForConfig(config) } // Reconcile the state for a FruitsCatalog object and makes changes based on the state read and what is in the FruitsCatalogSpec. func (r *FruitsCatalogG1Reconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) { [...] // The discovery package is used to discover APIs supported by a Kubernetes API server. config, err := ctrl.GetConfig() if err == nil andand config != nil { dclient, err := getDiscoveryClient(config) if err == nil andand dclient != nil { apiGroupList, err := dclient.ServerGroups() if err != nil { reqLogger.Info("Error while querying ServerGroups, assuming we're on Vanilla Kubernetes") } else { for i := 0; i < len(apiGroupList.Groups); i++ { if strings.HasSuffix(apiGroupList.Groups[i].Name, ".openshift.io") { isOpenShift = true reqLogger.Info("We detected being on OpenShift! Wouhou!") break } } } } else { reqLogger.Info("Cannot retrieve a DiscoveryClient, assuming we're on Vanilla Kubernetes") } } [...] }
You could use this mechanism to detect other installed Operators, Ingress
classes, storage capabilities, and so on.
Tip 4: Using extensions APIs in Go-based Operators
This tip is specific to Go-based Operators. As Ansible and Helm are treating everything as YAML, you can freely describe any kind of resources you need. Their validation will only occur on the cluster API-side when doing OpenAPI v3 validation or through an admission hook.
Go is a strongly typed language, which is clearly an advantage when you are dealing with complex Operators and data structures. With Go you can rely on tools like Integrated Development Environment (IDE) to help you through code completion and inline documentation. Thus you are able to validate the Kubernetes resources you are using before actually submitting them to the cluster. However, when you want to build something with API extensions, you'll have to integrate them as Go dependencies and register them within your own client runtime. I'll show you how to do that.
Note: While the following discussion might seem obvious to developers familiar with Go and Kubernetes, my background is in Java, and it took me a moment to figure it out.
Integrating API extensions as Go dependencies
First, you have to include the new API extension dependencies within your go.mod
file at the project root. For this, Go modules use either a Git
tag or branch name (in the latter case, it appears to translate the branch name into the latest commit hash). Following my previous example, if I want to use an OpenShift-specific data structure for a Route
resource, I have to add the following:
require ( [...] github.com/openshift/api v3.9.0+incompatible // v3.9.0 is the last tag. New releases are managed as branches // github.com/openshift/client-go release-4.5 // As an example of integrating the OpenShift-specific client lib )
The next step is to register one or many packages into the supported runtime schemes. This allows you to use Route
Go objects with the standard Kubernetes Go client. For that, you have to modify the main.go
file that was generated at the project root. Add a new import
and register the scheme into the init()
function:
import ( [...] routev1 "github.com/openshift/api/route/v1" ) func init() { utilruntime.Must(clientgoscheme.AddToScheme(scheme)) utilruntime.Must(redhatv1beta1.AddToScheme(scheme)) utilruntime.Must(routev1.AddToScheme(scheme)) // +kubebuilder:scaffold:scheme }
Finally, within your Go Reconcile()
function or another Operator package, you'll be able to manipulate the Route
structure in a strongly typed fashion that helps keep you on track. You can then create this object using the standard client present into your controller:
return androutev1.Route{ [...] Spec: routev1.RouteSpec{ To: routev1.RouteTargetReference{ Name: spec.AppName + "-webapp", Kind: "Service", Weight: andweight, }, Port: androutev1.RoutePort{ TargetPort: intstr.IntOrString{ Type: intstr.String, StrVal: "http", }, }, TLS: androutev1.TLSConfig{ Termination: routev1.TLSTerminationEdge, InsecureEdgeTerminationPolicy: routev1.InsecureEdgeTerminationPolicyNone, }, WildcardPolicy: routev1.WildcardPolicyNone, }, }
Tip 5: Adjusting Operator resource consumption
My final tip is to watch your resources—which is a sentence with a double meaning.
In the specific context of Kubernetes Operators, I mean to say that the custom resource is watched by the Operator controller, often called the Operand. It is important to configure your Operator to watch dependent resources. While there is some really good documentation on watching dependent resources (see the docs for dependent watches, resources watched by the controller, and using predicates for event filtering), there's no need to dive into these right now. What's important to know is that watching more software resources impacts your physical resources, namely, CPU and memory.
That is the second meaning of the sentence: Once your Operator starts growing—which can happen very quickly—you should think carefully about the resources that it consumes. The default requests and limits set low values that should be adapted to your needs. This is especially true for Helm- or Ansible-based Operators.
Before you start raising CPU and memory, make sure you pay attention to the concurrent reconciles that your Operator should manage. Simply put: How many custom resources should your Operator manage simultaneously? By default, the Operator SDK sets this value to the number of cores present on the node running the Operator. If you're watching many resources and you have big nodes, however, then this setting could act as a multiplying factor for the consumed resources. Moreover, if your Kubernetes Operator is scoped to a specific namespace, there is a little chance that you'll need 16 concurrent reconciles for one or two custom resources in your namespace.
Managing concurrent reconciles
You can easily use the —max-concurrent-reconciles
flag to set the number of maximum concurrent reconciles. The new Operator SDK project layout takes advantage of Kustomize, so you'll have to change the config/default/manager_auth_proxy_patch.yaml
like this:
apiVersion: apps/v1 kind: Deployment metadata: name: controller-manager namespace: system spec: template: spec: containers: - name: kube-rbac-proxy [...] - name: manager args: - "--metrics-addr=127.0.0.1:8080" - "--enable-leader-election" - "--leader-election-id=fruits-catalog-operator" - "--max-concurrent-reconciles=4"
After that, you can set up the resources requests and limits in the usual Kubernetes way.
Wrap up
In this article, I've shared five tips that ease my life while developing Operators with the newly released Kubernetes Operator SDK 1.0.0. Having a strong background in Java, but not in Ansible or Go, the issues that I discussed all made me scratch my head for a few hours. The tips might be obvious to experienced developers, but I hope that they will save other developers' time.
What about you? What are your tricks for working with Kubernetes Operators?
Last updated: January 12, 2024