openshift containers

The traditional CPU-centric system architecture is being replaced by designs where systems are aggregated from independently intelligent devices. These systems have their own compute capabilities and can natively run network functions with an accelerated data plane. The new model allows us to offload to accelerators not only the individual subroutines but whole software subsystems, such as networking or storage, with cloud-like security isolation and architectural compartmentalization.

One of the most prominent examples of this new architecture is the data processing unit (DPU). DPUs offer a complete compute system with an independent software stack, network identity, and provisioning capabilities. The DPU can host its own applications using either embedded or orchestrated deployment models.

The unique capabilities of the DPU allow for key infrastructure functions and their associated software stacks to be completely removed from the host node’s CPU cores and to be relocated onto the DPU. For instance, DPU could host the management plane of the network functions and part of the control plane, while the data plane could be accelerated by dedicated Arm cores, ASICs, GPUs, or FPGA IPs. Because DPUs can run independent software stacks locally, multiple network functions could run simultaneously on the same devices with service chaining and shared accelerators to provide generic in-line processing.

OVN/OVS offloading on NVIDIA BlueField-2 DPUs

Red Hat has collaborated with NVIDIA to extend the operational simplicity and hybrid cloud architecture of Red Hat OpenShift to NVIDIA BlueField-2 DPUs. Red Hat OpenShift 4.10 provides BlueField-2 OVN/OVS offload as a developer preview.

Installing OpenShift Container Platform on a DPU makes it possible to offload packet processing from the host CPUs to the DPU. Offloading resource-intensive tasks like packet processing from the server’s CPU to the DPU can free up cycles on the OpenShift worker nodes to run more user applications. OpenShift brings portability, scalability, and orchestration to DPU workloads, giving you the ability to use standard Kubernetes APIs along with consistent system management interfaces for both the host systems and DPUs.  

In short, utilizing OpenShift with DPUs lets you get the benefits of DPUs without sacrificing the hybrid cloud experience or adding unnecessary complexity to managing IT infrastructure.

To manage DPUs, Red Hat OpenShift replaces the native BlueField operating system (OS) on each DPU and is deployed using a two-cluster design that consists of:

  • Tenant cluster running on the host servers (x86)
  • Infrastructure cluster running on DPUs (Arm)

The architecture is illustrated in Figure 1.

Figure 1: The two-cluster design.

In this architecture, DPUs are provisioned as worker nodes of the Arm-based OpenShift infrastructure cluster. This is the blue cluster in Figure 1. The tenant OpenShift cluster, composed of the x86 servers, is where user applications typically run. This is the green cluster. In this deployment, each physical server runs both a tenant node on the x86 cores and an infrastructure node on the DPU Arm cores.

This architecture allows you to minimize the attack surface by decoupling the workload from the management cluster.

This architecture also streamlines operations by decoupling the application workload from the underlying infrastructure. That allows IT Ops to deploy and maintain the platform software and accelerated infrastructure while DevOps deploys and maintains application workloads independently from the infrastructure layer.

Red Hat OpenShift 4.10 provides capabilities for offloading Open Virtual Network (OVN) and Open Virtual Switch (OVS) services that typically run on servers, from the host CPU to the DPU. We are offering this functionality as a developer preview and enabling the following components to support OVN/OVS hardware offload to NVIDIA BlueField-2 DPUs:

  • DPU Network Operator: This component is used with the infrastructure cluster to facilitate OVN deployment.
  • DPU mode for OVN Kubernetes: This component is assigned by the cluster network operator for the tenant cluster.
  • SR-IOV network operator: This component discovers compatible network devices, such as the ConnectX-6 Dx NIC embedded inside the BlueField-2 DPU, and provisions them for SR-IOV access by pods on that server
  • ConnectX NIC Fast Data Path on Arm
  • Kernel flow offloading (TC Flower)
  • Experimental use of OpenShift Assisted Installer and BlueField-2 BMC

The combination of these components allows us to move ovn-kube-node services from the x86 host to the BlueField-2 DPU.

The network flows are offloaded in this manner (see Figure 2):

  • The ovn-k8s components are moved from the x86 host to the DPU (ovn-kube, vswitchd, ovsdb).
  • The Open vSwitch data path is offloaded from the BlueField Arm CPU to the ConnectX-6 Dx ASIC.
Figure 2: Illustrating the network flows.

The following Open vSwitch datapath flows managed by ovn-k8s are offloaded to a BlueField-2 DPU that is running OpenShift 4.10:

  • Pod to pod (east-west)
  • Pod to clusterIP service backed by a regular pod in diff node (east-west)
  • Pod to external (north-south)

Let’s take a more detailed look at the testbed setup.

Install and configure an accelerated infrastructure with OpenShift and NVIDIA BlueField-2 DPUs

In our sample testbed shown in Figure 3, we are using two x86 hosts with BlueField-2 DPU PCIe cards installed. Each DPU has eight Arm cores, two 25GB network ports, and a 1GbE management port. We’ve wired 25GB ports to a switch and also connected the BMC port of the DPU to a separate network to manage the device with IPMI.

Figure 3: The sample testbed.

Next, we’ve installed Red Hat OpenShift Container Platform 4.10 on a DPU to offload packet processing from the host x86 to the DPU. Offloading resource-intensive computational tasks such as packet processing from the server’s CPU to the DPU frees up cycles on the OpenShift Container Platform worker nodes to run more applications or to run the same number of applications more quickly.

OpenShift and DPU deployment architecture

In our setup, OpenShift replaces the native BlueField OS. We used the two-cluster architecture where DPU cards are provisioned as worker nodes in the Arm-based infrastructure cluster. The tenant cluster composed of x86 servers was used to run user applications.

We followed these steps to deploy tenant and infrastructure clusters:

  1. Install OpenShift assisted installer on the installer node using Podman.
  2. Install the infrastructure cluster.
  3. Install the tenant cluster.
  4. Install the DPU Network Operator on the infrastructure cluster.
  5. Configuring SR-IOV Operator for DPUs.

For details about how to install the OpenShift 4.10 on BlueField with OVN/OVS hardware offload, refer to the documentation.

When the cluster is deployed on the BlueField, the OVN configuration is automated with the "DPU Network Operator"; this operator can be installed in the OpenShift console or with the oc command.

As shown in Figure 4, the DPU operator is available for the Arm-based OpenShift clusters in the catalog and not visible in x86 OpenShift clusters.

Figure 4: The DPU Network Operator shown in the OperatorHub catalog.

Validate installation using the OpenShift console

When you have completed the last step, you have two OpenShift clusters running. We will make some console checks to show the configuration done and benchmark the OVS/OVN offloading.

Tenant cluster: We can list the nodes of the Tenant cluster nodes:

[egallen@bastion-tenant ~]$ oc get nodes
NAME                     STATUS   ROLES             AGE   VERSION
tenant-master-0          Ready    master            47d   v1.23.3+759c22b
tenant-master-1          Ready    master            47d   v1.23.3+759c22b
tenant-master-2          Ready    master            47d   v1.23.3+759c22b
tenant-worker-0          Ready    worker            47d   v1.23.3+759c22b
tenant-worker-1          Ready    worker            47d   v1.23.3+759c22b
x86-worker-advnetlab13   Ready    dpu-host,worker   35d   v1.23.3+759c22b
x86-worker-advnetlab14   Ready    dpu-host,worker   41d   v1.23.3+759c22b

Infrastructure cluster: We can list the nodes of the Infrastructure cluster, including the BlueField-2 nodes with the machine-config pool name dpu-worker:

[egallen@bastion-infrastructure ~]$ oc get nodes
NAME       STATUS   ROLES               AGE   VERSION
bf-13      Ready    dpu-worker,worker   71d   v1.22.1+6859754
bf-14      Ready    dpu-worker,worker   74d   v1.22.1+6859754
master-0   Ready    master              75d   v1.22.1+6859754
master-1   Ready    master              75d   v1.22.1+6859754
master-2   Ready    master              75d   v1.22.1+6859754
worker-0   Ready    worker              70d   v1.22.1+6859754
worker-1   Ready    worker              70d   v1.22.1+6859754

Tenant cluster: We can list the pods running in the openshift-ovn-kubernetes namespace in the tenant cluster:

[egallen@bastion-tenant ~]$ oc get pods -n openshift-ovn-kubernetes 
NAME                          READY   STATUS    RESTARTS       AGE
ovnkube-master-99x8j          4/6     Running   7              25d
ovnkube-master-qdfvv          6/6     Running   14 (15d ago)   25d
ovnkube-master-w28mh          6/6     Running   7 (15d ago)    25d
ovnkube-node-5xlxr            5/5     Running   5              34d
ovnkube-node-6nkm5            5/5     Running   5              34d
ovnkube-node-dpu-host-45crl   3/3     Running   50             34d
ovnkube-node-dpu-host-r8tlj   3/3     Running   30             28d
ovnkube-node-f2x2q            5/5     Running   0              34d
ovnkube-node-j6w6t            5/5     Running   5              34d
ovnkube-node-qtc6f            5/5     Running   0              34d

[egallen@bastion-tenant ~]$ oc get pods -n openshift-ovn-kubernetes -o wide
NAME                          READY   STATUS    RESTARTS       AGE   IP                NODE                     NOMINATED NODE   READINESS GATES
ovnkube-master-99x8j          4/6     Running   7              25d   192.168.111.122   tenant-master-2          <none>           <none>
ovnkube-master-qdfvv          6/6     Running   14 (15d ago)   25d   192.168.111.121   tenant-master-1          <none>           <none>
ovnkube-master-w28mh          6/6     Running   7 (15d ago)    25d   192.168.111.120   tenant-master-0          <none>           <none>
ovnkube-node-5xlxr            5/5     Running   5              34d   192.168.111.121   tenant-master-1          <none>           <none>
ovnkube-node-6nkm5            5/5     Running   5              34d   192.168.111.122   tenant-master-2          <none>           <none>
ovnkube-node-dpu-host-45crl   3/3     Running   50             34d   192.168.111.113   x86-worker-advnetlab13   <none>           <none>
ovnkube-node-dpu-host-r8tlj   3/3     Running   30             28d   192.168.111.114   x86-worker-advnetlab14   <none>           <none>
ovnkube-node-f2x2q            5/5     Running   0              34d   192.168.111.123   tenant-worker-0          <none>           <none>
ovnkube-node-j6w6t            5/5     Running   5              34d   192.168.111.120   tenant-master-0          <none>           <none>
ovnkube-node-qtc6f            5/5     Running   0              34d   192.168.111.124   tenant-worker-1          <none>           <none>

Infrastructure cluster: The DaemonSet and the nodes have the same dpu-worker label. The ovnkube-node host network pods will be scheduled on the BlueField-2:

[egallen@bastion-infrastructure ~]$ oc get ds -n default --show-labels
NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                     AGE   LABELS
ovnkube-node   2         2         2       2            2           beta.kubernetes.io/os=linux,node-role.kubernetes.io/dpu-worker=   65d   <none>

[egallen@bastion-infrastructure ~]$ oc get nodes --show-labels | grep dpu-worker
bf-13      Ready    dpu-worker,worker   71d   v1.22.1+6859754   beta.kubernetes.io/arch=arm64,beta.kubernetes.io/os=linux,kubernetes.io/arch=arm64,kubernetes.io/hostname=bf-13,kubernetes.io/os=linux,network.operator.openshift.io/dpu=,node-role.kubernetes.io/dpu-worker=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
bf-14      Ready    dpu-worker,worker   74d   v1.22.1+6859754   beta.kubernetes.io/arch=arm64,beta.kubernetes.io/os=linux,kubernetes.io/arch=arm64,kubernetes.io/hostname=bf-14,kubernetes.io/os=linux,network.operator.openshift.io/dpu=,node-role.kubernetes.io/dpu-worker=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos

Infrastructure cluster: We see the ovnkube-node DaemonSet running on the two BlueField-2 infrastructure cluster:

[egallen@bastion-infrastructure ~]$ oc get pods -n default 
NAME                 READY   STATUS    RESTARTS         AGE
ovnkube-node-hshxs   2/2     Running   13 (6d13h ago)   25d
ovnkube-node-mng24   2/2     Running   17 (42h ago)     25d


[egallen@bastion-infrastructure ~]$ oc get pods -n default -o wide
NAME                 READY   STATUS    RESTARTS         AGE   IP               NODE    NOMINATED NODE   READINESS GATES
ovnkube-node-hshxs   2/2     Running   13 (6d13h ago)   25d   192.168.111.28   bf-14   <none>           <none>
ovnkube-node-mng24   2/2     Running   17 (42h ago)     25d   192.168.111.29   bf-13   <none>           <none>

Infrastructure cluster: We get the IP address of the BlueField bf-14 to ssh on it (192.168.111.28):

[egallen@bastion-infrastructure ~]$ oc get nodes -o wide
NAME       STATUS   ROLES               AGE   VERSION           INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                       CONTAINER-RUNTIME
bf-13      Ready    dpu-worker,worker   71d   v1.22.1+6859754   192.168.111.29   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.40.1.el8_4.test.aarch64   cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
bf-14      Ready    dpu-worker,worker   74d   v1.22.1+6859754   192.168.111.28   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.40.1.el8_4.test.aarch64   cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
master-0   Ready    master              75d   v1.22.1+6859754   192.168.111.20   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.30.1.el8_4.aarch64        cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
master-1   Ready    master              75d   v1.22.1+6859754   192.168.111.21   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.30.1.el8_4.aarch64        cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
master-2   Ready    master              75d   v1.22.1+6859754   192.168.111.22   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.30.1.el8_4.aarch64        cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
worker-0   Ready    worker              70d   v1.22.1+6859754   192.168.111.23   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.30.1.el8_4.aarch64        cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8
worker-1   Ready    worker              70d   v1.22.1+6859754   192.168.111.24   <none>        Red Hat Enterprise Linux CoreOS 410.84.202201071203-0 (Ootpa)   4.18.0-305.30.1.el8_4.aarch64        cri-o://1.23.0-98.rhaos4.10.git9b7f5ae.el8

Infrastructure cluster: We can ssh to one BlueField-2 of the cluster to check the configuration:

[egallen@bastion-infrastructure ~]$ ssh core@192.168.111.28 
Red Hat Enterprise Linux CoreOS 410.84.202201071203-0
  Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html

---
Last login: Wed Mar 30 07:50:50 2022 from 192.168.111.1
[core@bf-14 ~]$ 

On one BlueField: We are running OpenShift 4.10 on the Arm cores:

[core@bf-14 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.10

[core@bf-14 ~]$ uname -m
aarch64

On one BlueField: The BlueField-2 has 8 x Armv8 A72 cores (64-bit):

[core@bf-13 ~]$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               0
Model name:          Cortex-A72
Stepping:            r1p0
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           48K
L2 cache:            1024K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

On one BlueField: dmidecode is a tool for dumping a computer's DMI (also called SMBIOS). The tool is exporting a table of contents. Running on the Jetson gives you an error: “No SMBIOS nor DMI entry point found, sorry.” The command is working on BlueField-2; you can get SMBIOS data from sysfs:

[core@bf-14 ~]$ sudo dmidecode | grep -A12 "BIOS Information"
BIOS Information
	Vendor: https://www.mellanox.com
	Version: BlueField:3.7.0-20-g98daf29
	Release Date: Jun 26 2021
	ROM Size: 4 MB
	Characteristics:
		PCI is supported
		BIOS is upgradeable
		Selectable boot is supported
		Serial services are supported (int 14h)
		ACPI is supported
		UEFI is supported
	BIOS Revision: 3.0

On one BlueField: We can get the SOC type with the lshw command:

[core@bf-14 ~]$ lshw -C system
bf-14                       
    description: Computer
    product: BlueField SoC (Unspecified System SKU)
    vendor: https://www.mellanox.com
    version: 1.0.0
    serial: Unspecified System Serial Number
    width: 64 bits
    capabilities: smbios-3.1.1 dmi-3.1.1 smp
    configuration: boot=normal family=BlueField sku=Unspecified System SKU uuid=888eecb3-cb1e-40c0-a925-562a7c62ed92
  *-pnp00:00
       product: PnP device PNP0c02
       physical id: 1
       capabilities: pnp
       configuration: driver=system

On one BlueField: We have 16GB of RAM on this BlueField-2:

[core@bf-14 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       2.3Gi       6.9Gi       193Mi       6.7Gi        10Gi
Swap:            0B          0B          0B

Infrastructure cluster: I can get the OpenShift console URL (Figure 5) with an oc command:

[egallen@bastion-infrastructure ~]$ oc whoami --show-console
https://console-openshift-console.apps.bf2cluster.dev.metalkube.org

[egallen@bastion-infrastructure ~]$ host console-openshift-console.apps.bf2cluster.dev.metalkube.org
console-openshift-console.apps.bf2cluster.dev.metalkube.org has address 192.168.111.4
Figure 5: Getting information about nodes in the cluster.

Tenant cluster: In the testpod1 we are running the iperf server:

[root@testpod-1 /]# taskset -c 6 iperf3 -s                                      
-----------------------------------------------------------                     
Server listening on 5201                                                        
-----------------------------------------------------------                     
Accepted connection from 10.130.4.132, port 59326                               
[  5] local 10.129.5.163 port 5201 connected to 10.130.4.132 port 59328         
[ ID] Interval           Transfer     Bitrate                                   
[  5]   0.00-1.00   sec  2.35 GBytes  20.2 Gbits/sec                            
[  5]   1.00-2.00   sec  2.46 GBytes  21.2 Gbits/sec                            
[  5]   2.00-3.00   sec  2.42 GBytes  20.8 Gbits/sec                            
[  5]   3.00-4.00   sec  2.24 GBytes  19.2 Gbits/sec                            
[  5]   4.00-5.00   sec  2.39 GBytes  20.5 Gbits/sec                            
[  5]   5.00-5.00   sec   384 KBytes  21.2 Gbits/sec                            
- - - - - - - - - - - - - - - - - - - - - - - - -                               
[ ID] Interval           Transfer     Bitrate                                   
[  5]   0.00-5.00   sec  11.9 GBytes  20.4 Gbits/sec                  receiver
-----------------------------------------------------------                     
Server listening on 5201                                                        
-----------------------------------------------------------
                                                                                

Tenant cluster: In the testpod2, we can get 20-21.2Gbps of throughput with TC Flower OVS hardware offloading instead of 3Gbps of traffic throughput without offloading:

[root@testpod-2 /]# taskset -c 6 iperf3 -c 10.129.5.163 -t 5
Connecting to host 10.129.5.163, port 5201
[  5] local 10.130.4.132 port 59328 connected to 10.129.5.163 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.35 GBytes  20.2 Gbits/sec   17   1.14 MBytes
[  5]   1.00-2.00   sec  2.46 GBytes  21.2 Gbits/sec    0   1.41 MBytes
[  5]   2.00-3.00   sec  2.42 GBytes  20.8 Gbits/sec  637   1.01 MBytes
[  5]   3.00-4.00   sec  2.24 GBytes  19.2 Gbits/sec    0   1.80 MBytes
[  5]   4.00-5.00   sec  2.39 GBytes  20.6 Gbits/sec    2   1.80 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec  11.9 GBytes  20.4 Gbits/sec  656             sender
[  5]   0.00-5.00   sec  11.9 GBytes  20.4 Gbits/sec                  receiver

iperf Done.

Conclusion

This example demonstrates the benefits of composable compute architectures that include hardware-level security isolation and capabilities to offload the entire subsystem, such as networking, to the DPU hardware. This enables clean architectural compartmentalization along the same boundaries as the corresponding software services, as well as freeing up working node x83 cores so they can run more applications.

Unlike SmartNICs that have been niche products with proprietary software stacks, DPUs running Linux, Kubernetes, and open source software stacks offer network function portability. And with Red Hat OpenShift, we are offering long-term support for this stack.

As DPUs gain more compute power, additional capabilities, and increased popularity in datacenters around the world, Red Hat and NVIDIA plan to continue work on enabling the offloading of additional software functions to DPUs.

Last updated: May 5, 2022

Comments