Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Estimate GPU memory for LLM fine-tuning with Red Hat AI

Stop guessing VRAM

March 4, 2026
Mohib Azam
Related topics:
Artificial intelligenceData sciencePython
Related products:
Red Hat AI Inference ServerRed Hat AIRed Hat Enterprise Linux AIRed Hat OpenShift AI

    Training Hub is an open source Python package that lets you fine-tune a pre-trained language model on a dataset of your choice, It is available as an open source package hosted on PyPI, and a downstream build is available on the Red Hat Python Index for use in Red Hat OpenShift AI workbenches as part of Red Hat AI 3.0. While language models are already powerful out of the box, fine-tuning a model on your dataset can improve its ability to handle specific tasks, making it more reliable for your business needs.

    However, fine-tuning a model requires significantly more memory than inference. Launching fine-tuning experiments without planning can be a significant waste of GPU hours and team resources if they encounter an out-of-memory error. Using a GPU setup with more memory than necessary is also inefficient, in which case you might prefer to use smaller, more cost-effective GPUs that still support your fine-tuning process. A memory estimator can calculate the amount of memory required for a given fine-tuning experiment, helping you quickly find the most efficient GPU setup, training method, and hyperparameters.

    Starting with Red Hat OpenShift AI 3.0, Training Hub includes memory_estimator.py, an API that allows you to estimate how much memory you can expect to be used for your given fine-tuning setup. With this information, you can train your model efficiently and avoid manual testing to determine whether your hardware can handle the workload.

    This blog covers:

    • How the estimator works and which training components to adjust.
    • Suggestions for adjusting your training setup to fit specific GPU specifications.
    • How to use the memory estimator in your code.
    • Next steps for streamlining model fine-tuning, including runtime estimates and automated hyperparameter suggestions.

    How to estimate memory usage

    Note

    Research into language model memory usage is still evolving. This information might change as we improve Training Hub's estimation tools and our understanding of memory allocation during fine-tuning.

    You can approximate memory usage by looking at several components. By default, Training Hub stores data as Float32, which occupies 4 bytes.

    Model memory is the amount of memory required to deploy the model on a GPU.

    Model Memory = Num Parameters * Data Type Bytes

    Gradient memory is the memory needed to store the neural network gradients during backpropagation. 

    Gradient Memory = Num Trainable Parameters * DType Bytes

    Optimizer memory is the memory needed to store optimizer states, which is based on the size of the gradient memory. AdamW, the most common optimizer for LLMs, uses two parameters.

    Optimizer Memory = Num Trainable Parameters * DType Bytes * 2

    Activation memory is the memory needed for intermediate activations across each neural network layer. You can estimate it using this formula: 

    Activation Memory = Num Layers * Num Batch Tokens * Hidden Size * DType Bytes

    Note that some sources multiply this value by a constant K. For example, this article on calculating VRAM requirements suggests K should be 10-30. Other GPU memory estimation guides note that activation memory is proportional to the right side. In our testing, we found the best results by assuming the values are equal rather than proportional.

    Output memory is the GPU memory required to store the network's final output. This is primarily constrained by the model’s vocabulary size.

    Output Memory = Num_Tokens_In_Batch * Vocab_Size * Data_Type_Bytes * M

    The constant M varies depending on the training method. It is typically a float between 1 and 3. For SFT, this value is set to 8/3 and for OSFT, this value is set to 7/3.

    To account for any discrepancies in the final total, you can apply an Overhead multiplier to the subtotal to provide a range of values. Of course, tight bounds are ideal. In Training Hub, we consider an overhead of up to 1.3x.

    Figure 1 shows the memory allocation during the first few steps of fine-tuning Granite 2B on SFT. The green area represents parameter memory. The yellow area is the optimizer memory, which is roughly twice the size of the parameter memory. The blue area is gradient memory, which is approximately the same size as the parameter memory. The slanted gray area is activation memory, and the brief gray spikes show output memory.

    Stacked area chart showing memory use over time: 5 GB for parameters, 10 GB for optimizer state, and 5 GB for gradients.
    Figure 1: Memory breakdown for the first training steps of a Granite 2B model using SFT.

    Supervised Fine-Tuning (SFT)

    Standard SFT uses a simple formula that sums these components:

    Total Memory = (Model + Optimizer + Gradient + Activation + Output) * Overhead

    LoRA and QLoRA

    LoRA and QLoRA are alternative fine-tuning methods that aim to be more memory efficient. Unlike SFT and OSFT, LoRA mostly stores data in Float16, which uses 2 bytes and takes a rank parameter, r. For each weight matrix of size W_in x W_out, LoRA forms two matrices of size W_in x r and r x W_Out to approximate the weight’s update.

    Ab_params = 0for weight in model_weights:    Lora_matrix_a_size = weight.dim[0] * r    Lora_matrix_b_size = weight.dim[1] * r    Ab_params += ((lora_matrix_a_size) + (lora_matrix_b_size))

    The rank r is a constant much lower than W_In and W_Out. For example, using a value of 16 results in much lower total memory usage. This changes the formulas to:

    LoRA Model Memory = (Num Parameters + Num AB parameters) * Data Type Bytes
    LoRA Gradient Memory = Num AB parameters * Data Type Bytes
    LoRA Optimizer Memory = Num AB parameters * Data Type Bytes * 2

    In LoRA, gradient and output allocations typically do not coexist. The maximum allocated memory is determined by whichever is larger: the gradient or the output memory size. The final formula is:

    LoRA Total Memory = (Model + Optimizer + Activation + max(Gradient, Output)) * Overhead

    QLoRA is similar to LoRA, but the model is quantized to Float4 instead of Float16 to save more memory. However, you must place the unquantized model on the GPU before quantization. This can create a memory bottleneck for small r values.

    LoRA Model Memory = Num Parameters * 0.5 + Num AB parameters * Data Type Bytes
    QLoRA Required Memory = max(QLoRA Total Memory, Model Memory)

    Orthogonal Subspace Fine-Tuning (OSFT)

    Note

    This information is not reflected in Training Hub 0.5.0 (part of Red Hat AI 3.3). The estimator will be updated in a future release.

    OSFT is a method that creates models capable of multiple tasks by freezing model weights during training. Similar to LoRA, OSFT uses additional matrices to approximate weight matrices during computation. Each weight matrix is replaced by three matrices: U, S, and V. For a weight matrix of size M x N, U is size M x min(M, N), S is size min(M, N), and V is size min(M, N) x N:

    OSFT_params = 0for weight in model_weights:    low_dim = min(weight.dim[0], weight.dim[1])    U_size = weight.dim[0] * low_dim    S_size = low_dim    V_size = weight.dim[1] * low_dim    OSFT_params += (U_size + S_size + V_size)
    OSFT Model Memory = (Num Parameters + OSFT Parameters) * Data Type Bytes

    Liger kernels

    When you are using a method that uses Liger kernels, such as OSFT, the expensive output tensors on the GPU, which instead gives you the formula of:

    Total Memory = (Model + Optimizer + Gradient + Activation) * Overhead

    How to reduce memory usage

    If your training setup exceeds your available VRAM, there are several strategies to lower memory requirements.

    Change your model

    Choosing a model with fewer parameters reduces the memory requirements for most fine-tuning components. Note that the model's vocabulary size also affects memory usage, depending on whether you use Liger kernels.

    Change your training method

    Training Hub’s implementation of OSFT uses Liger kernels to reduce memory usage. These kernels prevent output tensors from being stored on the GPU, which lowers VRAM for larger models. LoRA also improves efficiency by using matrices that are smaller than the main model. To reduce LoRA’s memory usage further, decrease the rank size (lora-r) or use QLoRA. Your GPU must still be large enough to support the unquantized mode during the initial setup.

    Adjust your hyperparameters

    Reducing the number of tokens placed on the GPU is an effective way to lower memory usage, especially for models with large output memory allocations. You can do this by lowering the batch_size or max_tokens_per_gpu parameters.

    Why not quantize your model?

    Quantization is not recommended for SFT and OSFT. You will achieve better model performance by training without quantization and then quantizing the model for inference after fine-tuning. If you use LoRA, consider QLoRA, it quantizes the model during training to reduce memory requirements.

    How to use the memory estimator

    To get started with the memory estimator, use the editable example notebook, memory_estimator_example.ipynb (Figure 2). This notebook includes estimators for SFT and OSFT.

    A screenshot of the "Memory Estimator" notebook in Training Hub, showing an overview of how to use the memory_estimator API.
    Figure 2: The header for the memory_estimator_example.ipynb notebook file in Training Hub.

    Each fine-tuning method has an associated estimator class: BasicEstimator (for SFT), OSFTEstimator, LoRAEstimator, and QLoRAEstimator. You can import these into your notebook to use them. You can create an instance of these classes and provide inputs such as:

    • num_gpus: The number of GPUs in the training setup.
    • gpu_memory: The memory (in bytes) for each GPU.
    • model_path: The path to the model being fine-tuned. This should be a Hugging Face repository link.
    • The highest number of tokens on each GPU. For SFT and OSFT, use the max_tokens_per_gpu parameter. Alternatively, you can provide both batch_size and max_seq_len. The memory estimator then calculates the maximum token count as the product of these values.
    • verbose: Controls how much information the estimator prints.
    • OSFT and LoRA have method-specific hyperparameters that affect memory estimation, such as lora_r for LoRA and unfreeze_rank_ratio for OSFT.
    estimator = BasicEstimator(num_gpus=num_gpus,                    gpu_memory=gpu_memory,                    model_path=model_path,                    max_tokens_per_gpu=max_tokens_per_gpu,                    verbose=2)

    After creating the object, call the estimate method to perform the calculation.

    low_bound, expected, high_bound = estimator.estimate()

    Alternatively, you can skip the object instantiation by using the estimate convenience function provided by memory_estimator.py. Provide your parameters and training method (as a string) to receive the estimated results.

    low, expected, high = estimate(training_method="sft",                        num_gpus=num_gpus,                        gpu_memory=gpu_memory,                        model_path=model_path,                        max_tokens_per_gpu=max_tokens_per_gpu,                        verbose=2)

    When the estimation finishes, the output includes the lower bound, upper bound, and expected memory values. If you set the verbose value to 1 or 2, the printed output indicates whether the training setup works on your hardware (num_gpus and gpu_memory). For example, you receive the following output for verbose=2 when training Granite 3.3 2B on two 48 GB GPUs, such as L40s:

    Estimations for ibm-granite/granite-3.3-2b-instruct:
    Summary:
    The expected amount of memory needed to run this model is about 55.8 GBThe lower and upper bounds are 50.8 - 66.0 GBIf you have 2 GPUs, you will need about 27.9 GB, with bounds of 25.4 - 33.0 GB per GPU
    Component Breakdown:
    Each GPU will need 4.7 GB to store the model parametersEach GPU will need 9.4 GB to store the optimizer statesEach GPU will need 4.7 GB to store the gradientsEach GPU will need 2.5 GB to store the intermediate activationsEach GPU will need 4.0 GB to store the outputsUp to 7.6 GB can be expected as overhead
    Decision:
    The proposed training setup should work for your hardware.

    What's next?

    We are planning and developing several new features to further simplify the fine-tuning workflow and help you plan your resources more effectively.

    Time estimations

    Estimating model runtime is harder than estimating memory. Many factors determine how long a program runs, and the results are not always predictable. However, knowing this information is still very helpful for planning the number of hours you’ll need to use your GPUs.

    We are exploring options for creating a fine-tuning time estimator. In the meantime, our wiki page provides guidelines on how long common fine-tuning processes should take.

    Hyperparameter suggestions and hardware discovery

    Because the estimator is a hard-coded algorithm and memory allocation depends on hyperparameter settings, we are developing a system that recommends a set of hyperparameters based on your hardware and model. We plan to include a hardware discovery feature that automatically detects your GPU configuration and determines if training is possible. We also plan to recommend model sizes and training methods based on your hardware and use case.

    Conclusion

    Fine-tuning is an effective way to create an inference model for a specific task, and Red Hat AI’s Training Hub provides a simple way to manage the process. If you have concerns about the memory required for fine-tuning, the memory estimator in Training Hub provides an estimate in advance so you can adjust your training or hardware setup.

    To get started with the memory estimator, install the latest version of Training Hub and run the provided example notebooks.

    Training Hub is versioned and supported on Red Hat AI through the Red Hat Python Index. It also integrates with Kubeflow Trainer for distributed training workloads.

    Acknowledgements

    Huge thanks to the Red Hat AI Innovation Team and other Training Hub members for their help developing this tool, including Mustafa Eyceoz, Oleg Silkin, and Aditi Saluja. I am also grateful to the reviewers who helped shape this blog.

    Related Posts

    • Profiling vLLM Inference Server with GPU acceleration on RHEL

    • Network performance in distributed training: Maximizing GPU utilization on OpenShift

    • Optimize GPU utilization with Kueue and KEDA

    • Boost AI efficiency with GPU autoscaling on OpenShift

    • How to run vLLM on CPUs with OpenShift for GPU-free inference

    • Improve GPU utilization with Kueue in OpenShift AI

    Recent Posts

    • How to collaborate with AI to improve your Ansible skills

    • Estimate GPU memory for LLM fine-tuning with Red Hat AI

    • Kafka Monthly Digest: February 2026

    • Serve and benchmark Prithvi models with vLLM on OpenShift

    • Optimize PyTorch training with the autograd engine

    What’s up next?

    share-graphic-applied-ai-enterprise-java-ebook.png

    Applied AI for Enterprise Java Development

    Alex Soto Bueno +2
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue