pip install vllm: The iceberg under a single command

Consider the following command:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct

When someone runs this and starts serving a model on an AMD GPU, they expect everything to just work. But there's a real build engineering challenge behind this command that most people never see.

Think of multiaccelerator AI builds like an iceberg. At the surface, a user runs two commands and starts serving a model. But beneath that simple experience lies layer after layer of build engineering complexity.

Table 1: Layers of a multiaccelerator AI build.
Layer	What you see
Surface	`pip install vllm`
Just below	"Works on AMD GPUs too" / Pre-built wheels on PyPI
Getting deeper	HIPification of CUDA kernels / ROCm version pinning / Separate torch builds per accelerator
Dark waters	AOTriton replacing cuDNN / xFormers custom ROCm compilation / FlashAttention ROCm forks
The abyss	Building the entire dependency tree from source / Package plug-in hooks / The version matrix / Why did `aiter` suddenly become `amd-aiter`?
Mariana Trench	xFormers build that silently shipped without HIP kernels and only crashed on MI300X weeks later / git bisect across `torch`, `triton`, and `aotriton` simultaneously

Red Hat AI supports multiple hardware platforms: NVIDIA GPUs (via CUDA), AMD Instinct GPUs (via ROCm), Intel Gaudi accelerators, Google TPU, IBM Spyre, and CPU-only environments. Each of these requires its own build of the entire AI/ML software stack. This blog post is about what it takes to make that build happen, and goes into detail on what's happening in all those layers described in Table 1.

The current landscape

The current ecosystem for open source AI/ML is generally CUDA-first. Packages like PyTorch, xFormers, FlashAttention, and Triton are written and tested primarily against NVIDIA hardware. CUDA support is mature and well-established as a result, but other accelerators require additional configuration, separate build paths, and careful version management.

For customers who want hardware choice, this creates a gap—and that's where our team comes in.

The challenge

Deploying on a different accelerator involves much more than just recompiling for a different target. Each accelerator has its own compiler toolchain, runtime libraries, and kernel implementations. Before it can execute on ROCm, CUDA code must be translated through the HIP API. For Gaudi, there's Habana's SynapseAI SDK. For TPU, there's XLA. These aren't just different flags passed to the same compiler; they're fundamentally different software stacks.

Coupled dependency tree

A single package like vLLM has a lot of dependencies. For example, it depends on PyTorch, which depends on Triton, which on ROCm depends on AOTriton (ahead-of-time compiled Triton kernels that replace NVIDIA's cuDNN and CUTLASS). Each has its own accelerator-specific build requirements, and often their versions must be precisely aligned.

Consider ROCm 6.4 with PyTorch 2.9.1, for example. This pairing alone has its own set of 25+ packages that must all be built from source, and must be version-compatible and ABI-compatible with each other.

The race against upstream

When you are building a multitude of packages from source, and some of those packages have strict version constraints, any upstream change can cascade through the entire dependency graph. For example:

A package might rename itself overnight.
An expected upstream release might suddenly skip an anticipated package upgrade.
Compilation flags or environment variables might be deprecated or renamed.

Deep dive: Building for ROCm

ROCm is one of the most complex variants to build for, and it's the one I work on daily. Here's what's involved behind the scenes in making it work.

HIPification

CUDA kernel code must be translated to HIP to run on AMD GPUs. Some packages handle this upstream, while others require translation at build time. While the translation process is mostly about renaming things, subtle differences between the CUDA and HIP APIs can surface bugs, especially in performance-critical attention kernels.

AOTriton: The ROCm attention saga

On NVIDIA hardware, attention operations use cuDNN or CUTLASS. On ROCm, we use AOTriton, which is essentially a set of precompiled Triton kernels shipped as a library. AOTriton pins a specific Triton commit as a submodule. This creates a tight version coupling. If the Triton version that PyTorch wants doesn't match the version that AOTriton was built against, things can break in non-obvious ways.

Package-specific build hooks

Many packages need custom build logic for ROCm. We handle this through a plug-in system that can override how each package resolves its source, prepares for building, sets environment variables, and executes the actual build. For example, xFormers requires ROCm-specific compilation flags (such as setting PYTORCH_ROCM_ARCH="gfx942"), and PyTorch itself needs AMD-specific build steps injected into its build process. We once inadvertently shipped an xFormers build that silently skipped HIP compilation entirely. The build succeeded and the package installed fine, but no ROCm kernels were compiled. It wasn't until vLLM hit a specific attention pattern on MI300X hardware that it crashed with a somewhat cryptic HIP error. The root cause was a missing environment variable that the build container didn't set.

The version matrix

Every ROCm release potentially changes which PyTorch version is compatible, which Triton commit is needed, and which packages need a rebuild. Managing this matrix and knowing when a new ROCm version requires a new build stack rather than updating an existing one is an ongoing challenge.

How we solve it

We address these build challenges by maintaining a specialized pipeline that provides granular control over the software stack and ensures consistency across hardware platforms.

Building everything from source

We use Fromager, an open source tool for rebuilding complete dependency trees of Python wheels from source. This gives us:

Reproducibility: Every build is deterministic and auditable.
License compliance: We know exactly what code goes into every wheel.
Security: There's a full software bill of materials (SBOM) for every package.
ABI compatibility: All packages in a stack are built against the same libraries.

Variant-aware build infrastructure

Our build system is designed around the idea that every accelerator is different. Each one gets its own build environment, its own dependency set, and its own version constraints.

Constraint solving at scale

When we pin PyTorch to 2.10.0 for ROCm 7.1, we have to manually ensure that every related package (including torchvision, torchaudio, xformers, triton, aotriton, and others) is pinned to a compatible version. We track all of this in a version constraint file, which is essentially a curated list of which versions of which packages are known to work together. Get one version wrong and the build fails—or worse, produces wheels that crash at runtime.

Here's an example of a constraints file:

#constraints.txt
aotriton==0.11.2b0
amd-aiter==0.1.10.post2
triton==3.6.0
torch==2.10.0
torchaudio==2.10.0
torchvision==0.25.0
vllm>=0.17.0,<0.19.0
xformers==0.0.34

Why all this matters

From the user's perspective, the result is simple: pip install works the same regardless of whether they're using NVIDIA A100s, AMD MI300Xs, or Intel Gaudi 2s. They get the same Python API, the same model support, and optimized performance for the hardware that their code is running on.

What's next

The open source ecosystem is gradually moving toward better multiaccelerator support. PyTorch's back-end abstraction is improving, and projects like Triton aim to be a portable GPU programming model. But we are not there yet, and until we are, someone has to make pip install work the same on every GPU. That's what we do.

Fromager is open source and actively developed. If you're building Python wheels for AI/ML workloads across multiple hardware platforms, check out the project on GitHub. You can also learn more about Red Hat AI.

pip install vllm: The iceberg under a single command

The current landscape

The challenge

Coupled dependency tree

The race against upstream

Deep dive: Building for ROCm

HIPification

AOTriton: The ROCm attention saga

Package-specific build hooks

The version matrix

How we solve it

Building everything from source

Variant-aware build infrastructure

Constraint solving at scale

Why all this matters

What's next

How NetworkManager uses eBPF to support CLAT and IPv6-mostly

Running database workloads on Red Hat OpenShift Virtualization

Demystify the architecture of OpenShift hosted control planes

Visualize your cluster: Manage observability with Red Hat build of Perses

Why your RBAC linter misses privilege escalation chains (and how to fix it)

Get started with consuming GPU-hosted large language models on Developer Sandbox

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links