With Red Hat Virtualization entering the end of its maintenance support on August 31, 2024 (with EOL on August 31, 2026 at the EUS expiration date), many customers are currently researching possible alternatives. One Red Hat Virtualization feature that has surely been appreciated by many is its networking model, with the ability to isolate traffic using VLANs by leveraging trunk network interfaces and individual VLAN virtual interfaces created via the Red Hat Virtualization API or web UI. All of this leverages Red Hat Enterprise Linux (RHEL) virtualization technologies (libvirt, KVM) and Linux networking stack.
In this article, we will analyze what Red Hat sees as the natural Red Hat Virtualization successor. We will walk through how to set up a hosted control planes for Red Hat OpenShift cluster to install the Red Hat OpenShift Virtualization operator (formerly OpenShift container-native virtualization). From there, we'll deep dive into replicating a trunk network architecture in a stretched L2 domain with Border Gateway Control/Ethernet Virtual Private Network (BGP/EVPN) with the final objective of assigning routable IP addresses to the hosted virtual machines (VMs); that is, to replicate the general behavior of a VM running within a data center.
This approach is necessary because by design AWS doesn't allow unauthenticated ARP packets or spoofed MAC addresses; every MAC address that is generated within AWS is authenticated against a list of well-known MAC addresses. Generally, these MAC addresses are EC2 instances which have been created via AWS API and/or web UI. This limitation (or security measure, that is) generally prevents any OpenShift Virtualization virtual machine to have a routable—to the customer internal network—IP address and forces the use of pod networking. While the architecture just described isn't necessary for container-ready workloads that use OpenShift ingress to access the cluster, it'd still be relevant for workloads who haven't yet been modernized.
Management cluster provisioning
Provisioning the management cluster can be handled in two different ways, via Red Hat Advanced Cluster Management for Kubernetes (RHACM) or by manually using openshift-installer
, with RHACM being the preferred way. We won't be focusing on how the management hub installation is actually performed as there's plenty of documentation available.
Once the management hub has been deployed, the next step would be to install the multicluster engine (MCE) for Kubernetes operator through the Operator Hub. The MCE operator comes with the HyperShift operator already bundled and ready for use. Make sure the OpenShift Container Platform and MCE releases are 4.16 and 2.6 respectively as that's a requirement for self managed hosted control plane(s) as they turned GA in these versions.
Provision a hosted cluster (with a BYO VPC model)
This procedure comes with the following assumptions:
- You have an AWS account.
- You have connectivity back to your data center via either AWS Direct Connect, AWS Site-to-Site VPN, or AWS Transit Gateway (which is in turn configured to route traffic back to your data center/remote site).
- You have a private (routable within your data center/remote site) subnet range, which in our examples will be
172.16.0.0/12
.
The only currently supported way to create a hosted control plane is from within the MCE web UI. The process is currently manual and involves downloading the hosted control plane binary to accomplish some of the steps:
- Create the following AWS resources:
- 3 private subnets (one for each AWS AZ).
- 3 public subnets (one for each AWS AZ).
- 1 routing table per subnet (public subnets with 0.0.0.0 through IGW, private subnets with
172.16.0.0/12
via TGW/Site-to-Site VPN and 0.0.0.0 via NAT GW). - 1 IGW.
- 1 NAT GW.
- 1 private Route 53 DNS zone.
- 1
DHCP
Options. - Make sure private and public subnets are tagged accordingly as per this documentation.
- 1 Inbound DNS resolver associated with the VPC, in case the DNS zone in use is being delegated from an authoritative name server hosted elsewhere.
- IAM roles and users via this documentation. Make sure to create the OIDC bucket first as described here—this will allow the use of STS within the hosted control plane cluster.
- Make sure
AWS_REGION
andAWS_SHARED_CREDENTIALS_FILE
env variables are specified within the Hypershift/operator deployment.
- Generate a SSH key and fetch an OpenShift pull secret that will be used during the hosted control plane installation.
- Based on the IDs of the previously generated resources, modify the
hosted-cluster.yaml
from the examples repository and finally runoc create -f hosted-cluster.yaml
. As a note, theHostedCluster
resource has multiArch set totrue
. This will allow us to provision node pools with both x86_64 and aarch64 nodes. Please note aarch64 in hosted control plane(s) is still a TP. - Monitor the hosted control plane cluster creation via
oc describe hostedcluster hostedclustername -n namespace
, also check on the hosted control plane namespace (as an example, for a cluster namedexample-cluster-aws-us-east-1
, the hosted control plane will be created under theexample-cluster-aws-us-east-1-example-cluster-aws-us-east-1
namespace).
Day 1 operations
On top of including the cluster to Argo CD to allow your base roles to be applied to the hosted control plane, also make sure to:
- Create 3 node pools, virtualized EC2 instances for containerized workloads, a metal x86_64 node pool for OpenShift Virtualization guests, a metal ARM64 node pool for OpenShift Virtualization guests on ARM64, these resources are available in the examples repository.
- Install the OpenShift Virtualization operator via the Operator Hub.
- Make sure multi architecture is enabled in OpenShift Virtualization via
oc annotate --overwrite -n openshift-cnv hco kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{"op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "MultiArchitecture" }]'
.
Day 2 operations: Networking—components
The architecture we discuss in this article has the following moving parts and will be built on top of VXLAN tunnels, BGP, and its EVPN extension:
- 2 bridges per VLAN, one for the L2VNI, one for the L3VNI.
- 1 VRF per VLAN.
- 1 VLAN virtual interface added on top of the machine main physical network interface.
- The bridge device that contains the L2VNI (br-L2VNI) requires an IP address to be assigned, the IP address is going to be the default gateway for the VMs living on the hypervisor, the IP address will match across every hypervisor.
- A dhcrelay (a so called
DHCP
helper) configured on the switches where the L2 domain terminates in the data center, the helper will relayDHCP
traffic for bothIPv4
and IPv6 to yourDHCP
server of choice. - One NetworkAttachmentDefinition per VLAN, this object is created on a tenant namespace basis.
- NAT to the internet via AWS NAT GW re-using AWS bare metal nodes primary VRF default route.
- Network physical and virtual interfaces will be configured through NMState objects.
Day 2 operations: Networking—architecture
Figure 1 shows an architecture diagram that demonstrates how to extend OpenShift Virtualization connectivity options in AWS.
In order to overcome the aforementioned limitations we planned the following architecture:
- Presence of Route Reflectors. They prevent the requirement to interconnect every hypervisor as a BGP neighbor, which improves scalability.
- Leaf/Spine design where hypervisors equals leaves.
- BGP configuration handled via FRR and the openshift-frr project (originally part of the MetalLB suite).
- Route leaks between the primary VRF and each VLAN VRF in order to be able to share the primary VRF default route and any return traffic.
- Be mindful that the proposed architecture requires you to have access to both the management hub and the spoke cluster. For cases where two different teams or customers are expected to manage these clusters, the architecture will differ as the route reflectors will have to be configured within the hosted control plane cluster. The primary reason the route reflectors were configured on top of the management hub is because control plane nodes there are less likely to require rebuilds and other major changes over time, compared to the hosted control plane worker nodes which are meant to be disposable.
- Every VLAN will have the same default gateway defined on each hypervisor, this prevents VM migrations to break connectivity.
And step-by-step instructions in order to adopt this design:
- Install the FRR resources via
oc create -f frr-k8s.yaml
on both the hub and the hosted control plane clusters, the file references a toolbox image that will have to be built and pushed to an image registry (it currently references an image under my personal quay.io namespace). - Utilizing the
evpn-rr*
YAML files as a example, customize them in order to be able to provision:- Two Route Reflectors within the management hub control plane nodes.
- The route reflectors should be paired with the data center switches where the L2 domain will terminate.
- Customize the
evpn-*
YAML files and apply them on AWS bare metal x86_64 and aarch64 nodes, every hypervisor will be considered a leaf. - Apply MachineConfigs to handle NAT for VMs internet access through AWS infrastructure. This step involves setting up specific iptables rules.
- Apply NMState objects to configure physical and virtual network interfaces on bare metal x86_64 and aarch64 nodes.
- For virtual machines created through OpenShift Virtualization, make sure a
DHCP
helper is configured where the stretched L2 domains terminate in order forDHCP
to assign a default gateway (which should match the IP address defined on each of the hypervisors L2VNI bridges). - A NetworkAttachmentDefinition attached to a tenant, this will allow an RHOCP CNV virtual machine to attach its VNET interface to the L2VNI bridge).
Design pain points
As you may have noticed from the examples we provided, there are specific details we are currently looking at improving:
- Hardcoded IP addresses within the NMState resources, these are necessary for VXLAN local tunnel endpoints, remote endpoints aren't necessary as they will be discovered through the EVPN BGP extension.
- Hardcoded IP addresses on BGP routers configuration.
These current limitations prevent an automated cluster from scaling up as every resource (NMState/FRRConfigurations
) has to be manually added to Argo CD as it directly depends on information that is not available until the AWS node is provisioned.
After having successfully tested this architecture in a lab environment, we are now comfortable with eliminating any rough edges to make the whole solution easier to adopt. You might see a future article that provides an update on this!