Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

From 200 lines to 15: How Helion is rewriting the rules of GPU programming

Helion: Simplifying GPU programming with PyTorch-like syntax

April 24, 2026
Sumantro Mukherjee Parshant Sharma
Related topics:
Artificial intelligenceCompilers
Related products:
Developer ToolsetRed Hat Enterprise Linux

    The evolution of programming efficient GPU kernels has led to a continuous push towards higher levels of abstraction, moving developer focus from hardware management to computational logic. CUDA provides maximum control, with developers manually managing every detail like thread blocks, memory access, synchronization, and index calculations. It's a powerful process, but also complex. Triton emerged as a new GPU language, simplifying the task by introducing block-based programming, allowing developers to manage teams of threads rather than individuals. However, Triton still demands manual effort, like defining block sizes and calculating program IDs. The latest language is Helion, a Python embedded domain-specific language that abstracts all low-level parallelism detail to allow developers to write GPU operations using simple, intuitive syntax using PyTorch.

    What if writing a GPU kernel felt like writing PyTorch?

    Helion automates almost every part of GPU kernel development. Instead of forcing you to manage low-level details of GPU execution, Helion lets you write code that describes the computation you want. A matrix multiplication (matmul) kernel might take over 200 lines in CUDA or around 80 lines in Triton due to manual indexing, masking, and stride handling. That's reduced to about 15 lines of PyTorch-like code in Helion.

    You write a simple loop like for tile_m, tile_n in hl.tile([m, n]): and use operations like torch.addmm(), while Helion handles indexing, tiling, masks, grid sizing, memory layouts, and all the hardware-level configuration. Helion searches through hundreds and thousands of possible implementations to select the fastest one for the specific hardware and problem size, giving developers performance without complexity.

    import torch, helion, helion.language as hl
    
    @helion.kernel()
    def matmul(x: torch.Tensor, y: torch. Tensor) -> torch. Tensor:
    m, k = x.size()
    k2, n = y.size()
    out = torch.empty([m, n], dtype=x.dtype, device=x.device)
    for tile_m, tile_n in hl.tile([m, n]):
    acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
    for tile_k in hl.tile(k):
    acc torch.addmm (acc, x[tile_m, tile_k], y[tile_k, tile_n])
    out[tile_m, tile_n] = acc
    return out

    One kernel, 1000 variants, zero manual tuning

    Helion's real advantage comes from its autotuning system. Instead of writing and tweaking GPU kernels manually, you create a single Helion kernel and the compiler automatically generates hundreds or even thousands of Triton variants, each with different choices, including block sizes, loop orders, indexing methods, program ID mappings, wrap counts, pipeline depths, unrolling strategies, and cache optimizations. It uses an LFBO-based pattern search for autotuning, while also supporting evolutionary algorithms for completeness. Typical kernels tune within minutes, while more complex kernels may take longer.

    After the optimal configuration is found, you can lock it in production so that there's no tuning cost at runtime. The result is performance portability. The same kernel adapts automatically to different GPU generations (Ampere, Hopper, Blackwell) without manual changes. This process is illustrated in Figure 1.

    An overview of how Helion processes your code, optimizes for target architecture, and provides a config.
    Figure 1: An overview of how Helion processes your code, optimizes for target architecture, and provides a config.

    How Helion works

    When the Helion kernel is called for the first time, it parses your Python function into an abstract syntax tree (AST) and runs type propagation to determine tensor shapes, data types, and how different values depend on each other. It then separates what should run on the host (tensor allocations, shape calculations) and what should run on the GPU, which is identified through hl.tile loops. The GPU portion is captured through PyTorch's FX system and lowered through TorchInductor, which translates operators such as torch.addmm, torch.sum, torch.exp into Triton form. The many steps, most of which you don't manually perform, are shown in Figure 2.

    The compiler builds the full configuration space, and for each option converts the internal representation into Triton code for right indexing, masking, and memory access logic. Triton compiles into GPU machine code, which is cached so that repeated calls with the same tensor signature run instantly.

    Steps toward Triton codegen.
    The optimization process.
    The config is applied only in the final step of the process.
    Figure 2: Steps toward Triton codegen and an optimized config.

    Real-world impact: Less code, faster kernels

    Helion provides boost to both performance and productivity for machine learning (ML) engineers requiring custom GPU kernels. The examples in the Helion Git repository show how flexible it is. Simple functions take only 5 to 10 lines, while fused kernels like GEGLU are implemented in 30 lines instead of hundreds. Even complex components like attention mechanisms and layer norms remain concise and easy to maintain.

    Debugging is also straightforward. You can print the generated Triton code with HELION_PRINT_OUTPUT_CODE=1, run kernels in an eager, Python-like mode with HELION_INTERPRET=1, or generate full repro scripts when filing bug reports. Although autotuning takes 10 to 15 mins for each kernel for each shape, the savings in time spent coding is huge. It's easier to understand and maintain, and the resulting performance often matches or exceeds hand-written and hand-optimized kernels, while automatically adapting across GPU generations.

    Future of GPU programming

    Helion is changing the way we think about writing GPU kernels. Just as high-level languages freed programmers from writing assembly, and frameworks like PyTorch removed the burden of hand-written back-propagation, Helion removes the need to manually manage low-level GPU details while still delivering top performance.

    The evolution from CUDA (hundreds of lines and fully manual tuning) to Triton (dozens of lines with block level abstractions) to Helion (10 to 30 lines of PyTorch-like code with hundreds of automatically tuned variants) shows that the direction of GPU programming is moving towards high-level tools that make expert-level results broadly accessible. And because Helion can explore optimization spaces far beyond what a human can test, developers can spend more time innovating and less time with thread layouts and memory management. Here are common workflows for a Helion developer, from idea to production.

    Phase 1: Write

    The code you write is often no more than 15 lines of code.

    1. Define kernel functions: @helion.kernel() decorator
    2. Write host code (CPU): Allocate tensors, compute shapes
    3. Write device code (GPU): hl.tile loops and PyTorch ops
    4. Debug in eager Python mode: Test with HELION_INTERPRET=1

    Phase 2: Tune

    This usually takes 10 minutes for each kernel, for each shape.

    1. First call triggers autotune: Automatic, no code changes
    2. LFBO explores configs: 1000+ Triton variants tested
    3. Best config printed: Copy into @helion.kernel(configs…)
    4. Inspect with PRINT_OUTPUT: See generated Triton code

    Phase 3: Deploy

    Your project is deployed with no runtime tuning overhead.

    1. Lock config in decorator: Zero tuning cost at runtime
    2. Deterministic compilation: Single optimized Triton kernel
    3. Binary cached: Instant on repeat calls
    4. Re-tune for new hardware: The same code is re-tuned to run on a different GPU

    See it in action: 3 kernels in 30 lines of code

    Here's a vector addition function in 7 lines of code:

    @helion.kernel()
    def add_kernel(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        size = x.size(0)
        out=torch.empty_like(x)
        for tile in hl.tile(size):
            out[tile] = x[tile] + y[tile]
        return out

    Softmax in 10 lines:

    @helion.kernel()
    def softmax kernel(x: torch.Tensor) -> torch.Tensor:
    n, _m = x.size()
    out = torch.empty_like(x)
    for tile_n in hl.tile(n):
    values = x[tile_n, :]
    amax = torch.amax(values, dim=1, keepdim=True)
    exp torch.exp(values - amax)
    sum exp = torch.sum(exp, dim=1, keepdim=True)
    out[tile_n, :] = exp / sum_exp
    return out

    Debug like it's Python

    You can use the same techniques you use in Python to debug your code. To see the generated Triton code:

    HELION_PRINT_OUTPUT_CODE=1 python my_kernel.py

    To debug without GPU compilation:

    HELION_INTERPRET=1 python my_kernel.py

    Locking the config for production use

    Once autotuning completes, you can lock the optimal config for zero-overhead production use. For example:

    @helion.kernel(config=helion.Config(
    block_sizes=[64, 64, 64],
    loop_orders=[[0, 1]],
    num_warps=8,
    num_stages=6,
    indexing='block_ptr',
    pid_type='flat'
    ))
    def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    ...

    Getting started

    With Helion, you write minimal code, and you get automated block sizes, program IDs, grid dims, tensor indexing, masking, strides, and autotuning config lists. It's open source, and ready for use.

    To start using Helion for GPU kernel, the setup is just 4 commands:

    $ python3.12 -m venv helion_env && \
    source helion_env/bin/activate
    $ pip install "torch>=2.9" --index-url \ https://download.pytorch.org/whl/cu128
    $ pip install helion packaging
    $ python -c "import helion; import torch; print('CUDA:', torch.cuda.is_available())"

    Our example code is in a Git repository, so feel free to clone and iterate!

    Last updated: April 27, 2026

    Related Posts

    • Configure NVIDIA Blackwell GPUs for Red Hat AI workloads

    • Estimate GPU memory for LLM fine-tuning with Red Hat AI

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    Recent Posts

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    • How EvalHub manages two-layer Kubernetes control planes

    • Tekton joins the CNCF as an incubating project

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.