Deploying Red Hat OpenShift hosted clusters on bare metal (often referred to as Hypershift) is a game-changer for infrastructure efficiency. By decoupling control planes from worker nodes, you can dramatically reduce hardware costs and dedicate more physical resources to running the virtual machines that matter.
As with any powerful technology, the official documentation provides the “what,” but the “how” and “why” are often discovered in the field. The rest is a collection of practical knowledge: the non-obvious truths and critical procedures you only learn through hands-on deployment, troubleshooting, and long hours in the command line. This is my guide to the five most surprising and impactful lessons learned from the field; things I wish someone had told me before I started.
1. The network is your foundation
While any OpenShift installation has its complexities, I've learned that the most delicate, high-impact, and critical component of a hosting cluster deployment is the network setup. If you get this wrong, nothing else matters.
A common scenario we encounter is migrating from VMware environments where the underlying physical switches are not configured for link aggregation protocols like LACP. This is because VMware has its own sophisticated load balancing that doesn't require it. Replicating this behavior in OpenShift without forcing a reconfiguration of the client's physical switches requires a very specific network configuration.
The solution is to use a balance-xor bond combined with a specific transmit hash policy. This policy analyzes L2 information (the source MAC address and the VLAN tag) to distribute traffic across interfaces, effectively mimicking VMware's default behavior without needing any switch-side changes.
This advanced configuration isn't straightforward to apply, and it's non-negotiable for virtualization workloads that you also set your MTU to 9000 for performance. By far, the most efficient and reliable way to implement it is through the agent-based installation method. It allows you to inject these precise network settings into the discovery ISO, ensuring the nodes come online correctly from the very beginning.
Getting the network right from the start isn't just a task, it's the foundation for the entire platform's stability and performance. This is where you should focus the majority of your initial attention.
2. The Kube API endpoint is an IP address, not a name
In a standard OpenShift deployment, you interact with the cluster's API via a fully qualified domain name (FQDN), like api.mycluster.example.com. It's intuitive and standard practice. With a hosted cluster, this expectation can lead to a common roadblock.
One of the first issues you'll encounter is that the API server for a hosted cluster is exposed via a raw IP address, not an FQDN. This is the expected and designed behavior. A common pitfall is trying to force the use of an FQDN, which inevitably leads to certificate validation errors when you try to access the cluster.
If you try to use an API with the cluster's domain name when you try to access it, it will fail on the certificate and give that x509 error. Consequently, you won't be able to use it.
While it's technically possible to work around this, it requires a significant amount of extra effort, such as setting up a custom Certificate Authority and managing a complex chain of custom certificates. In almost every case, it's not worth the trouble. This isn't just an inconvenience, it's a fundamental shift in how you must think about cluster identity and access, especially when designing automation that needs to be portable between standard and hosted environments. My advice is simple: accept it, configure your tools to use the IP address, and move on. For now, that's how it is.
3. It's not “batteries included”
In a typical OpenShift installation, Ingress just works. The necessary load balancing services are provisioned automatically. When deploying a hosted cluster on bare metal, this is not the case. It's a manual, multi-step process that is important to anticipate.
Here's what happens. After the agent nodes for your hosted cluster are provisioned and ready, the deployment will appear to stall. You'll see persistent errors on the console and ingress operators in the operator list. This isn't a failure. It's your cue to intervene.
The solution is to manually install and configure MetalLB (not on the hosting cluster where you've been working, but directly on the new, partially-deployed hosted cluster). This involves defining its entire operational context through a series of YAML manifests:
- Creating a namespace for MetalLB.
- Creating an OperatorGroup.
- Creating a subscription to install the operator. This is a critical step where you must copy the startingCSV version from the MetalLB instance on your hosting cluster to ensure version alignment (a classic problem).
- Creating the MetalLB instance.
- Defining an IPAddressPool with the IP address for your Ingress.
- Creating an L2Advertisement.
The final step (configuring the L2Advertisement) is particularly delicate. It requires you to reference the br-ex bridge interface, a core component that you should never modify or reference under any other circumstances. An error here can take your entire cluster offline.
This manual step reveals a core design principle of bare metal Hypershift: the platform provides the control plane, but you own the network integration entirely. Understanding this division of responsibility is the key to successfully managing the platform long term.
4. The installation is not broken
The user experience during a hosted cluster creation has a specific flow that can be surprising at first. You'll watch as many components in the installation UI turn green, showing progress. Then, suddenly, everything will stop. The progress bar won't advance, and you'll see errors related to the console and Ingress.
Your first instinct will be to assume the installation has failed. It hasn't. The process is simply paused, waiting for you to perform the manual MetalLB setup on the hosted cluster, as described in the previous point. The system is waiting for you to provide the Ingress layer it can't create for itself.
Another important detail is a status check in the UI labeled external-dns-reachable. This UI element is most relevant for public cloud deployments. So for bare metal, it's safe to disregard an indeterminate status here for on-premise, bare metal deployments. Knowing to ignore it separates experienced engineers from those who will lose hours chasing a non-existent problem.
This phase of the process perfectly illustrates the sense of being caught between the official steps: where things start to get complicated, is a perfect example of where field knowledge bridges the gap, guiding you on the next move.
Understanding the rhythm of the installation is key. You have to learn when the system has truly failed versus when it is simply waiting for your next move. This knowledge separates a smooth deployment from a challenging one.
5. The installer deletes its blueprints
This last point is a simple but critical piece of practical advice that can save you from a world of pain. The agent-based installer has a peculiar but important behavior: after it successfully generates the discovery ISO, it deletes the configuration files you used to create it.
The problem this creates is obvious. If the installation fails at a later stage, or if you simply need to make a small change and regenerate the ISO, your original configuration files are gone. You're forced to recreate them from scratch, which, given their complexity, is both time-consuming and prone to error.
The solution is incredibly simple, as captured in this piece of advice from the field.
Save your files with a .bkp extension, both agent-config and install-config. Why? Because these files can be deleted during the ISO generation process.
Before you run the command to generate the ISO, just make a backup of your configuration files. It's the small, practical tips like this one that are often the most valuable. They represent the distilled experience of engineers who have already navigated this process, and they can save you hours of rework.
Final thoughts
Successfully deploying OpenShift hosted clusters on bare metal is as much about navigating these unwritten rules and operational problems as it is about following the official steps. The technology offers unparalleled efficiency for managing OpenShift at scale, but it rewards those who are willing to dive deep into its practical realities.
The lessons learned from the field passed between engineers are what transform a complex process into a repeatable and reliable one. What piece of knowledge has been a lifesaver in your most complex infrastructure projects?