Installing Kubeflow v0.7 on OpenShift 4.2

As part of the Open Data Hub project, we see potential and value in the Kubeflow project, so we dedicated our efforts to enable Kubeflow on Red Hat OpenShift. We decided to use Kubeflow 0.7 as that was the latest released version at the time this work began. The work included adding new installation scripts that provide all of the necessary changes such as permissions for service accounts to run on OpenShift.

The installation of Kubeflow is limited to the following components:

Central dashboard
Jupyterhub
Katib
Pipelines
Pytorch, tf-jobs (training)
Seldon (serving)
Istio

All of the new fixes and features will be proposed upstream to the Kubeflow project in the near future.

Prerequisites

To install Kubeflow on OpenShift, there are prerequisites regarding the platform and the tools.

Platform

To run this installation, OpenShift is needed as a platform. You can use either OpenShift 4.2 or Red Hat CodeReady Containers (CRC). If you choose OpenShift 4.2, all that you need is an available OpenShift 4.2 cluster. Or, you can try a cluster on try.openshift.com.

If you choose CodeReady Containers, you need a CRC-generated OpenShift cluster. Here are the recommended specifications:

16GB RAM
6 CPUs
45GB disk space

The minimum specifications are:

10GB RAM
6 CPUs
30GB disk space (the default for CRC)

Note: At the minimum specs, the CRC OpenShift cluster might be unresponsive for approximately 20 minutes while the Kubeflow components are being deployed.

When installing Kubeflow on a CRC cluster, there is an extra overlay (named "crc") to enable the metadata component in kfctl_openshift.yaml. This overlay is commented out by default. Uncomment the overlay to enable it.

Tools

The installation tool kfctl is needed to install/uninstall Kubeflow. Download the tool from GitHub. Version 0.7.0 is required for this installation.

Installing Kubeflow with Istio enabled

As noted earlier, we added a KFDef file to specifically install Kubeflow on OpenShift and included fixes for different components. To install Kubeflow 0.7 on OpenShift 4.2 please follow the steps below. It is assumed that this installation will run on an OpenShift 4.2 cluster:

Clone the opendatahub-manifest fork repo, which defaults to the branch v0.7.0-branch-openshift:

$ git clone https://github.com/opendatahub-io/manifests.git
$ cd manifests

Install using the OpenShift configuration file and the locally downloaded manifests, since at the time of writing we ran into this Kubeflow bug that would not allow downloading the manifests during a build process:

$ sed -i 's#uri: .*#uri: '$PWD'#' ./kfdef/kfctl_openshift.yaml
$ kfctl build --file=kfdef/kfctl_openshift.yaml
$ kfctl apply --file=./kfdef/kfctl_openshift.yaml

Verify your installation:

$ oc get pods

Launch the Kubeflow portal:

$ oc get routes -n istio-system istio-ingressgateway -o jsonpath='http://{.spec.host}/'
http://<istio ingress route>/

Deleting a Kubeflow installation

To delete a Kubeflow installation, follow these steps:

$ kfctl delete --file=./kfdef/<kfctl file name>.yaml
$ rm -rf kfdef/kustomize/
$ oc delete mutatingwebhookconfigurations.admissionregistration.k8s.io --all
$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io --all
$ oc delete namespace istio-system

Kubeflow components

To enable the installation of Kubeflow 0.7 on OpenShift 4.2, we added features and fixes to alleviate the installation issues we encountered. The following is a list of components along with a description of the changes and usage examples.

OpenShift KFDef

KFDef is a specification designed to control the provisioning and management of Kubeflow deployment. This spec is generally distributed in YAML format and follows a pattern of custom resources popular in Kubernetes to extend the platform. With the upcoming addition of Kubeflow Operator, KFDef is becoming the custom resource used for Kubeflow deployment and lifecycle management.

KFDef is built on top of Kustomize, which is a Kubernetes-native configuration management system. To deploy Kubeflow to OpenShift, we had to create a new KFDef YAML file that customizes the deployment manifests of Kubeflow components for OpenShift. With Kustomize as a configuration management layer for every component, it was necessary to add OpenShift-specific Kustomize overlays (patches applied to the default set of resource manifests when an overlay is selected).

Take a look at the OpenShift-specific KFDef file used in the deployment steps above in the opendatahub-io/manifests repository.

Central dashboard

The central dashboard works out of the box, provided that you access the Kubeflow web UI using the route for istio-ingressgateway in the istio-system namespace.

Upon first accessing the web UI, you will be prompted to create a Kubeflow user namespace. This is a one-time action for creating a single namespace. If you want to make additional namespaces accessible for Kubeflow deployment of notebook servers, Pipelines, etc., you can create a Kubeflow profile. By default, the central dashboard does not have authentication enabled.

Jupyter controller

We are using three Jupyter controller customizations: a custom notebook controller, a custom profile controller, and a custom notebook image. Let's take a look at each.

Custom notebook controller

We are using a customized notebook controller to avoid the default behavior of setting fsGroup: 100 in the stateful set that is created when spawning a notebook. That value would require a special security context restraint (SCC) for the service account in OpenShift. To further complicate matters, that SCC would need to be granted to a service account that is created only when the profile is created, so it’s not something that can be done during installation.

Custom profile controller

We are using a customized profile controller to avoid the default behavior of newly created profiles having the label istio-injection: enabled. That label causes the container to attempt to start an istio-init container that, in turn, tries to use iptables, which is not available in OpenShift 4.x. That init container will fail and cause the notebook start to fail.

Custom notebook image

We also added our own custom notebook image, which is prepopulated in the image selection dropdown. This image provides filesystem permissions in the /home/jovyan directory. It offers the functionality described here.

Katib

Katib suffered two main problems. The first was not being able to run cleanly as an unprivileged user (#960, #962, #967). The second is that it was damaging a generated security context in a pod by mutating the pod (#964). Both have been fixed in upstream Katib repositories and Katib now runs without issues on OpenShift.

The second issue, in particular, is a pattern common in applications relying on mutating webhooks where part of the mutation is adding a sidecar container to the pod that is being deployed. If the new container does not have an initialized security context, the pod admission policy controller will prevent its deployment. We have seen the same issue in the KFServing component.

Pipelines

To get Kubeflow Pipelines working on OpenShift, we had to specify the k8sapi executor for Argo because OpenShift 4.2 does not include a Docker daemon and CLI. Instead, it uses CRI-O as the container engine by default. We also had to add the finalizers to the workflow permissions for OpenShift to be able to set owner references.

This practice allows running YAML-based Pipelines that conform to Argo’s specification regarding k8sapi Pipelines execution, specifically for the condition of saving params and artifacts in volumes (such as emtpyDir) and not the path that is part of the base image layer (for example, /tmp). This specific requirement rendered all example Kubeflow Python Pipelines with errors. To test your Pipelines, use the fraud detection Pipelines provided in this article.

For minio installation, we also created a service account and gave that account permission to run as anyuid.

Training

For training, we had to make changes for two of the apps: PyTorch and TensorFlow jobs (tf-jobs).

PyTorch

For PyTorch, we did not have to make any changes to the component. However, we did have to make changes to the Dockerfile of one of the examples found here. We had to add the required folders and permissions to the Dockerfile by doing the following to run the example MNIST test:

Change the Dockerfile to include the following:

FROM pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
RUN pip install tensorboardX==1.6.0
RUN chmod 777 /var
WORKDIR /var
ADD mnist.py /var
RUN mkdir /data
RUN chmod 777 /data
ENTRYPOINT ["python", "/var/mnist.py"]

Build and push the Dockerfile to your registry:

podman build -f Dockerfile -t <your registry name>/pytorch-dist-mnist-test:2.0 ./
podman push <your registry name>/pytorch-dist-mnist-test:2.0

Add the registry image URL to the installation YAML file. We tested this setup without GPU and our file is the following:

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-gloo"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: <your registry name>/pytorch-dist-mnist-test:2.0<
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
resources: {}
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: <your registry name>/pytorch-dist-mnist-test:2.0
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
resources: {}

Create a PyTorch job by running the command

oc create -f v1/<filename.yaml>

Check that the worker and master PyTorch pods are running with no errors.

Tf-jobs

To get TF-jobs training working on OpenShift we had to add the tfjob/finalizers resource for the tf-job-operator ClusterRole for OpenShift to be able to set owner references. Follow these steps to run the example MNIST training job:

Run:

$ git clone https://github.com/kubeflow/tf-operator
$ cd tf-operator/examples/v1/mnist_with_summaries

Create the PersisentVolumeClaim shown below (we did have to change the acceddModes to RedWriteOnce for our cluster):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfevent-volume
namespace: kubeflow
labels:
 type: local
app: tfjob
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

Run:

$ oc apply -f tfevent-volume/<new pvc filename>.yaml
$ oc apply -f tf_job_mnist.yaml
$ oc describe tfjob mnist
Events:
Type    Reason                   Age   From      Message
----    ------                   ----  ----      -------
Normal  SuccessfulCreatePod      12m tf-operator Created pod: mnist-worker-0
Normal  SuccessfulCreateService  12m tf-operator Created service: mnist-worker-0
Normal  ExitedWithCode           11m tf-operator Pod: kubeflow.mnist-worker-0 exited with code 0
Normal  TFJobSucceeded           11m tf-operator TFJob mnist successfully completed.

Serving

For serving, we had to make changes for one of the apps: Seldon

Seldon

To get Seldon to work on OpenShift we had to delete the "8888" UID value assigned to the engine container that is part of a served model pod. This value dictated that every time a model is served, its engine controller container UID was assigned the value "8888," but that value is not within the allowed range of UID values in OpenShift.

For a quick example to try this issue out for yourself, here is an example fraud detection model:

Create a Seldon deployment YAML file using the following example:

{ "apiVersion": "machinelearning.seldon.io/v1alpha2", "kind": "SeldonDeployment", "metadata": { "labels": { "app": "seldon" }, "name": "modelfull", "namespace": "kubeflow" }, "spec": { "annotations": { "project_name": "seldon", "deployment_version": "0.1" }, "name": "modelfull", "oauth_key": "oauth-key", "oauth_secret": "oauth-secret", "predictors": [ { "componentSpecs": [{ "spec": { "containers": [ { "image": "nakfour/modelfull", "imagePullPolicy": "Always", "name": "modelfull", "resources": { "requests": { "memory": "10Mi" } } } ], "terminationGracePeriodSeconds": 40 } }], "graph": { "children": [], "name": "modelfull", "endpoint": { "type" : "REST" }, "type": "MODEL" }, "name": "modelfull", "replicas": 1, "annotations": { "predictor_version" : "0.1" } } ] } }

Install this configuration by running:

$ oc create -f "filename.yaml"

Verify that there is a pod that includes the name modelfull running.
Verify that there is a virtual service that includes the name modelfull.
From a terminal, send a predict request to the model using this example curl command:

curl -X POST -H 'Content-Type: application/json' -d '{"strData": "0.365194527642578,0.819750231339882,-0.5927999453145171,-0.619484351930421,-2.84752569239798,1.48432160780265,0.499518887687186,72.98"}' http://"Insert istio ingress domain name"/seldon/kubeflow/modelfull/api/v0.1/predictions

Istio

Installing the default Istio provided with Kubeflow 0.7 required adding a route to the Istio ingress gateway service and the anyuid security context. These additions give Istio permission to run as a privileged user for the multiple service accounts used by Istio's components.

Next steps

The Open Data Hub team is currently focused on multiple next steps or tasks:

Resolving component issues already discussed in this document, such as Pipeline and Katib.
Integrating Kubeflow 0.7 with Red Hat Service Mesh on OpenShift 4.2.
Proposing the changes discussed in this document back upstream to the Kubeflow community.
Working with the Kubeflow community to add official OpenShift platform documentation on the Kubeflow website as a supported platform.
Architecting and designing a solution for tight integration between Open Data Hub and Kubeflow that includes Operator redesign.

Last updated: March 28, 2023

Installing Kubeflow v0.7 on OpenShift 4.2

Share:

Prerequisites

Platform

Tools

Installing Kubeflow with Istio enabled

Deleting a Kubeflow installation

Kubeflow components

OpenShift KFDef

Central dashboard

Jupyter controller

Custom notebook controller

Custom profile controller

Custom notebook image

Katib

Pipelines

Training

PyTorch

Tf-jobs

Serving

Seldon

Istio

Next steps

Best practices for migration from Jaeger to Tempo

Essential Node.js Observability Resources

How to build AI-ready applications with Quarkus

Llama 4 herd is here with Day 0 inference support in vLLM

Simplify AI data integration with RamaLama and RAG

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue