Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

pip install vllm: The iceberg under a single command

April 16, 2026
Percy Mattsson
Related topics:
AI inference
Related products:
Red Hat AI

    Consider the following command:

    pip install vllm
    vllm serve meta-llama/Llama-3.1-8B-Instruct

    When someone runs this and starts serving a model on an AMD GPU, they expect everything to just work. But there's a real build engineering challenge behind this command that most people never see.

    Think of multiaccelerator AI builds like an iceberg. At the surface, a user runs two commands and starts serving a model. But beneath that simple experience lies layer after layer of build engineering complexity.

    Table 1: Layers of a multiaccelerator AI build.
    LayerWhat you see
    Surfacepip install vllm
    Just below"Works on AMD GPUs too" / Pre-built wheels on PyPI
    Getting deeperHIPification of CUDA kernels / ROCm version pinning / Separate torch builds per accelerator
    Dark watersAOTriton replacing cuDNN / xFormers custom ROCm compilation / FlashAttention ROCm forks
    The abyssBuilding the entire dependency tree from source / Package plug-in hooks / The version matrix / Why did aiter suddenly become amd-aiter?
    Mariana TrenchxFormers build that silently shipped without HIP kernels and only crashed on MI300X weeks later / git bisect across torch, triton, and aotriton simultaneously

    Red Hat AI supports multiple hardware platforms: NVIDIA GPUs (via CUDA), AMD Instinct GPUs (via ROCm), Intel Gaudi accelerators, Google TPU, IBM Spyre, and CPU-only environments. Each of these requires its own build of the entire AI/ML software stack. This blog post is about what it takes to make that build happen, and goes into detail on what's happening in all those layers described in Table 1.

    The current landscape

    The current ecosystem for open source AI/ML is generally CUDA-first. Packages like PyTorch, xFormers, FlashAttention, and Triton are written and tested primarily against NVIDIA hardware. CUDA support is mature and well-established as a result, but other accelerators require additional configuration, separate build paths, and careful version management.

    For customers who want hardware choice, this creates a gap—and that's where our team comes in.

    The challenge

    Deploying on a different accelerator involves much more than just recompiling for a different target. Each accelerator has its own compiler toolchain, runtime libraries, and kernel implementations. Before it can execute on ROCm, CUDA code must be translated through the HIP API. For Gaudi, there's Habana's SynapseAI SDK. For TPU, there's XLA. These aren't just different flags passed to the same compiler; they're fundamentally different software stacks.

    Coupled dependency tree

    A single package like vLLM has a lot of dependencies. For example, it depends on PyTorch, which depends on Triton, which on ROCm depends on AOTriton (ahead-of-time compiled Triton kernels that replace NVIDIA's cuDNN and CUTLASS). Each has its own accelerator-specific build requirements, and often their versions must be precisely aligned.

    Consider ROCm 6.4 with PyTorch 2.9.1, for example. This pairing alone has its own set of 25+ packages that must all be built from source, and must be version-compatible and ABI-compatible with each other.

    The race against upstream

    When you are building a multitude of packages from source, and some of those packages have strict version constraints, any upstream change can cascade through the entire dependency graph. For example:

    • A package might rename itself overnight.
    • An expected upstream release might suddenly skip an anticipated package upgrade.
    • Compilation flags or environment variables might be deprecated or renamed.

    Deep dive: Building for ROCm

    ROCm is one of the most complex variants to build for, and it's the one I work on daily. Here's what's involved behind the scenes in making it work.

    HIPification

    CUDA kernel code must be translated to HIP to run on AMD GPUs. Some packages handle this upstream, while others require translation at build time. While the translation process is mostly about renaming things, subtle differences between the CUDA and HIP APIs can surface bugs, especially in performance-critical attention kernels.

    AOTriton: The ROCm attention saga

    On NVIDIA hardware, attention operations use cuDNN or CUTLASS. On ROCm, we use AOTriton, which is essentially a set of precompiled Triton kernels shipped as a library. AOTriton pins a specific Triton commit as a submodule. This creates a tight version coupling. If the Triton version that PyTorch wants doesn't match the version that AOTriton was built against, things can break in non-obvious ways.

    Package-specific build hooks

    Many packages need custom build logic for ROCm. We handle this through a plug-in system that can override how each package resolves its source, prepares for building, sets environment variables, and executes the actual build. For example, xFormers requires ROCm-specific compilation flags (such as setting PYTORCH_ROCM_ARCH="gfx942"), and PyTorch itself needs AMD-specific build steps injected into its build process. We once inadvertently shipped an xFormers build that silently skipped HIP compilation entirely. The build succeeded and the package installed fine, but no ROCm kernels were compiled. It wasn't until vLLM hit a specific attention pattern on MI300X hardware that it crashed with a somewhat cryptic HIP error. The root cause was a missing environment variable that the build container didn't set.

    The version matrix

    Every ROCm release potentially changes which PyTorch version is compatible, which Triton commit is needed, and which packages need a rebuild. Managing this matrix and knowing when a new ROCm version requires a new build stack rather than updating an existing one is an ongoing challenge.

    How we solve it

    We address these build challenges by maintaining a specialized pipeline that provides granular control over the software stack and ensures consistency across hardware platforms.

    Building everything from source

    We use Fromager, an open source tool for rebuilding complete dependency trees of Python wheels from source. This gives us:

    • Reproducibility: Every build is deterministic and auditable.
    • License compliance: We know exactly what code goes into every wheel.
    • Security: There's a full software bill of materials (SBOM) for every package.
    • ABI compatibility: All packages in a stack are built against the same libraries.

    Variant-aware build infrastructure

    Our build system is designed around the idea that every accelerator is different. Each one gets its own build environment, its own dependency set, and its own version constraints.

    Constraint solving at scale

    When we pin PyTorch to 2.10.0 for ROCm 7.1, we have to manually ensure that every related package (including torchvision, torchaudio, xformers, triton, aotriton, and others) is pinned to a compatible version. We track all of this in a version constraint file, which is essentially a curated list of which versions of which packages are known to work together. Get one version wrong and the build fails—or worse, produces wheels that crash at runtime.

    Here's an example of a constraints file:

    #constraints.txt
    aotriton==0.11.2b0
    amd-aiter==0.1.10.post2
    triton==3.6.0
    torch==2.10.0
    torchaudio==2.10.0
    torchvision==0.25.0
    vllm>=0.17.0,<0.19.0
    xformers==0.0.34

    Why all this matters

    From the user's perspective, the result is simple: pip install works the same regardless of whether they're using NVIDIA A100s, AMD MI300Xs, or Intel Gaudi 2s. They get the same Python API, the same model support, and optimized performance for the hardware that their code is running on.

    What's next

    The open source ecosystem is gradually moving toward better multiaccelerator support. PyTorch's back-end abstraction is improving, and projects like Triton aim to be a portable GPU programming model. But we are not there yet, and until we are, someone has to make pip install work the same on every GPU. That's what we do.

    Fromager is open source and actively developed. If you're building Python wheels for AI/ML workloads across multiple hardware platforms, check out the project on GitHub. You can also learn more about Red Hat AI.

    Related Posts

    • Getting started with the vLLM Semantic Router project's Athena release: Optimize your tokens for agentic AI

    • 5 steps to triage vLLM performance

    • Serve and benchmark Prithvi models with vLLM on OpenShift

    • Practical strategies for vLLM performance tuning

    • How to deploy and benchmark vLLM with GuideLLM on Kubernetes

    • Why vLLM is the best choice for AI inference today

    Recent Posts

    • Camel integration quarterly digest: Q1 2026

    • Integrate Red Hat Enterprise Linux VMs into OpenShift Service Mesh

    • Red Hat build of Kueue 1.3: Enhanced batch workload management on Kubernetes

    • pip install vllm: The iceberg under a single command

    • Build deterministic OpenShift dataplane performance with TRex

    What’s up next?

    Learning Path Red Hat AI

    Get started with consuming GPU-hosted large language models on Developer Sandbox

    Learn the many ways you can interact with GPU-hosted large language models...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue