Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

From tuning to serving: How open source powers the LLM life cycle

March 26, 2025
Junpei Ishikawa
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

Share:

    Since the rise of gen AI, many companies have been working to integrate large language models (LLMs) into their business processes to create value. One of the key challenges is providing domain-specific knowledge to LLMs. Many companies have chosen retrieval-augmented generation (RAG), storing internal documents in a vector database and querying the LLM while referencing stored knowledge. Another approach is fine-tuning, which slightly modifies the original model weights to incorporate new knowledge and skills.

    In the past, fine-tuning LLMs was not an easy task for many organizations. It required a specialized training cluster and a broad range of technical expertise. However, the open source ecosystem has lowered the barrier to entry. For example, Hugging Face offers a variety of popular tools for training and customizing models, while Kubeflow provides a cloud-native approach to running training jobs across distributed containers.

    In this article, we will demonstrate how the Red Hat OpenShift AI Kubeflow Training (KFT) Operator and open source tools enable us to fine-tune LLMs in a distributed environment.

    All the resources are stored in this GitHub repository, and the trained model is stored in the Hugging Face repository.

    Disclaimer

    Fine-tuning LLMs with the Kubeflow Training Operator and SFT Trainer is still a Limited Availability feature in the latest OpenShift AI v2.18. If you need support for this feature, please contact Red Hat to obtain approval.

    Prerequisites  

    We need to ensure that the following tools are available on the Red Hat OpenShift cluster:

    • OpenShift AI operator with the KFT Operator.
    • NVIDIA GPU Operator and Node Feature Discovery Operator.
    • StorageClass that supports the ReadWriteMany (RWX) access mode.

    The KFT Operator can be installed through the OpenShift AI Operator. Once the managementState of the DataScienceCluster is set to Managed, the OpenShift AI Controller will install the KFT Operator in the cluster as follows:

        trainingoperator:
          managementState: Managed

    Our OpenShift cluster is built on Amazon Web Services (AWS) and includes two g6e.xlarge instances as GPU nodes equipped with an NVIDIA L40S device. We will install the NVIDIA GPU Operator and the Node Feature Discovery Operator to enable these resources on the cluster.

    When running training jobs on multiple nodes, training datasets must be accessible simultaneously from multiple nodes. So we will require the RWX access mode for persistent volumes (PVs). In this article, we will use the Red Hat OpenShift Data Foundation CephFS for RWX storage.

    Prepare the dataset

    The fine-tuning dataset needs to be stored in a PV before training or downloaded from Hugging Face at the beginning of the training. 

    We will use the GSM8K dataset, which we have pre-stored in an object storage bucket. We will demonstrate how to download it to a PV before starting the fine-tuning.

    Begin by creating Persistent Volume Claims (PVCs) as follows:

    $ oc new-project fine-tuning
    $ git clone https://github.com/JPishikawa/ft-by-sft/
    $ oc apply -f ft-by-sft/deploy/storage/pvc.yaml

    Create the my-storage Secret which includes the credentials for object storage. Modify the Secret for your environment:

    $ oc apply -f ft-by-sft/deploy/storage/secret.yaml

    Create the Pod that downloads the dataset from object storage to the PV:

    apiVersion: v1
    kind: Pod
    metadata:
      name: download-dataset
      labels:
        name: download-dataset
    spec:
      volumes:
        - name: dataset-volume
          persistentVolumeClaim:
            claimName: dataset-volume
      restartPolicy: Never
      initContainers:
        - name: fix-volume-permissions
          image: quay.io/quay/busybox:latest
          command: ["sh"]
          args: ["-c", "chmod -R 777 /data/input"]
          volumeMounts:
            - mountPath: "/data/input/"
              name: dataset-volume
      containers:
        - name: download-data
          imagePullPolicy: IfNotPresent
          image: quay.io/modh/kserve-storage-initializer:rhoai-2.17
          args:
            - 's3://my-fine-tuning-trial/data/'
            - /data/input
          env:
            - name: STORAGE_CONFIG
              valueFrom:
                secretKeyRef:
                  name: storage-config
                  key: my-storage
          volumeMounts:
            - mountPath: "/data/input/"
              name: dataset-volume

    The object bucket name, my-fine-tuning-trial and the path, /data/, are specified in the container arguments. Once the Pod has completed downloading, the dataset is stored in dataset-volume.

    Configuration

    The FMS HF Tuning is an open-source Python library that wraps Hugging Face's SFT Trainer and PyTorch FSDP to run LLM fine-tuning jobs. We will use this library and the KFT PyTorchJob to run distributed training jobs.

    The configuration parameters of FMS HF Tuning are configured via the following ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: training-config
    data:
      config.json: |
        {
          "accelerate_launch_args": {
                "main_process_ip": "kfto-demo-master-0",
                "main_process_port": 29500,
                "num_processes": 2,
                "num_machines": 2,
                "machine_rank": 0,
                "mixed_precision": "bf16",
                "use_fsdp": "true",
                "fsdp_sharding_strategy": 4,
                "rdzv_backend": "c10d"
            },
          "model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
          "training_data_path": "/data/input/train-00000-of-00001.parquet",
          "output_dir": "/data/output/tuning/qwen2.5-tuning",
          "save_model_dir": "/data/output/model",
          "num_train_epochs": 3,
          "per_device_train_batch_size": 4,
          "per_device_eval_batch_size": 4,
          "gradient_accumulation_steps": 16,
          "packing": "True",
          "gradient_checkpointing": "True",
          "save_strategy": "epoch",
          "learning_rate": 2e-05,
          "lr_scheduler_type": "constant",
          "include_tokens_per_second": true,
          "data_formatter_template": "### Question:\n{{question}}\n\n### Answer:\n{{answer}}<|im_end|>",
          "response_template": "### Answer:\n",
          "logging_strategy": "steps",
          "logging_steps": 0.2,
          "neftune_noise_alpha": 5,
          "use_flash_attn": true,
          "use_liger_kernel": "True",
          "peft_method": "lora",
          "r": 16,
          "lora_alpha": 32,
          "lora_dropout": 0.05,
          "bias": "none",
          "target_modules": ["all-linear"],
          "lora_post_process_for_vllm": true,
          "trackers": ["aim"],
          "experiment": "my-first-experiment",
          "aim_remote_server_ip": "aim.aim.svc.cluster.local",
          "aim_remote_server_port": "53800"
        }

    In accelerate_launch_args, arguments for accelerate launch and FSDP configuration are passed: 

    • main_process_ip: Headless Service name of the master Pod of PyTorchJob.
    • num_processes: The total number of GPUs.
    • num_machines: The total number of GPU nodes.
    • fsdp_sharding_strategy: FSDP sharding strategy. 4 is "HYBRID_SHARD".

    Other parameters mostly come from Hugging Face's TrainingArguments:

    • model_name_or_path: The base model name on Hugging Face Hub.
    • training_data_path: The path to the training dataset stored in the attached PV.
    • per_device_train_batch_size and gradient_accumulation_steps: The product of these values should match the Tensor Core requirements.
    • peft_method: Using the Low-Rank Adaptation (LoRA) method to fine-tune the model.
    • use_liger_kernel: Using the Liger Kernel to accelerate training and reduce video RAM (VRAM).
    • trackers: Using the Aim stack for experiment tracking.

    If you would like to try full-parameter fine-tuning, remove the LoRA-related configurations. It requires more VRAM to complete the training.

    Experiment tracking

    As described in the configuration section, we use the Aim stack for experiment tracking. Aim provides a visualization of key training metrics and is easy to integrate into FMS HF Tuning.

    To deploy Aim on OpenShift, run the following commands:

    $ cd ~/ft-by-sft/deploy/aim/
    $ /bin/bash deploy.sh

    This script creates resources in the aim namespace. Once the Aim Pod is running, you can access the Aim graphical user interface (GUI) via a route.

    Running a training job

    PyTorchJob is one of the custom resources in the KFT Operator. The following PyTorchJob creates a master Pod and a worker Pod for distributed training:

    apiVersion: kubeflow.org/v1
    kind: PyTorchJob
    metadata:
      name: kfto-demo
    spec:
      pytorchReplicaSpecs:
        Master:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - env:
                    - name: SFT_TRAINER_CONFIG_JSON_PATH
                      value: /etc/config/config.json
                    - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                      value: "false"
                    - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                      value: "1"
                    - name: PYTORCH_CUDA_ALLOC_CONF
                      value: "expandable_segments:True"
                  image: 'quay.io/jishikaw/fms-hf-tuning:latest'
                  imagePullPolicy: IfNotPresent
                  name: pytorch
                  ports:
                    - containerPort: 29500
                      name: pytorchjob-port
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                  volumeMounts:
                    - mountPath: /etc/config
                      name: config-volume
                    - mountPath: /data/input
                      name: dataset-volume
                    - mountPath: /data/output
                      name: model-volume
                    - mountPath: /.cache
                      name: cache-volume
                    - mountPath: "/dev/shm"
                      name: dshm
              volumes:
                - configMap:
                    items:
                      - key: config.json
                        path: config.json
                    name: training-config
                  name: config-volume
                - persistentVolumeClaim:
                    claimName: dataset-volume
                  name: dataset-volume
                - name: model-volume
                  persistentVolumeClaim:
                    claimName: model-volume
                - name: cache-volume
                  persistentVolumeClaim:
                    claimName: cache-volume
                - name: dshm
                  emptyDir:
                    medium: Memory
        Worker:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - env:
                    - name: SFT_TRAINER_CONFIG_JSON_PATH
                      value: /etc/config/config.json
                    - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                      value: "false"
                    - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                      value: "1"
                    - name: PYTORCH_CUDA_ALLOC_CONF
                      value: "expandable_segments:True"
                  image: 'quay.io/jishikaw/fms-hf-tuning:latest'
                  imagePullPolicy: IfNotPresent
                  name: pytorch
                  ports:
                    - containerPort: 29500
                      name: pytorchjob-port
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                  volumeMounts:
                    - mountPath: /etc/config
                      name: config-volume
                    - mountPath: /data/input
                      name: dataset-volume
                    - mountPath: /data/output
                      name: model-volume
                    - mountPath: /.cache
                      name: cache-volume
                    - mountPath: "/dev/shm"
                      name: dshm
              volumes:
                - configMap:
                    items:
                      - key: config.json
                        path: config.json
                    name: training-config
                  name: config-volume
                - persistentVolumeClaim:
                    claimName: dataset-volume
                  name: dataset-volume
                - name: model-volume
                  persistentVolumeClaim:
                    claimName: model-volume
                - name: cache-volume
                  persistentVolumeClaim:
                    claimName: cache-volume
                - name: dshm
                  emptyDir:
                    medium: Memory
      runPolicy:
        suspend: false

    Once the training job starts running, the worker's init container tries to connect to the master Pod based on the parameters specified in the ConfigMap.

    The init container error may occur when running the job for the first time because it takes a while to pull the container image. In that case, deleting and recreating the PyTorchJob could resolve the error.

    If the training fails due to a compute unified device architecture (CUDA) out of memory error, decrease the values of per_device_train_batch_size and gradient_accumulation_steps in the ConfigMap to reduce VRAM consumption.

    It takes about one and a half hours to complete the training job. You can monitor training metrics (e.g., loss) on the Aim GUI, as shown in Figure 1.

    Aim GUI for visualization of training metrics
    Figure 1: Aim GUI for visualizing training metrics.

    Serve the fine-tuned model

    The trained model is stored in model-volume as a LoRA adapter, which can be served with the base model on OpenShift AI.

    Create a new connection

    1. Go to the OpenShift AI console and create a new connection.
    2. Select URI - v1 as the Connection type and set pvc://model-volume/model/ as the URI (Figure 2).
    Creating a URI type data connection
    Figure 2: Creating a URI type data connection.

    Deploy the model

    1. Switch to the Models tab and deploy the model.
    2. Select vLLM ServingRuntime for KServe as the serving runtime.
    3. Select the created connection for the model source.
    4. Add the following arguments and the environment variable (Figure 3):
    --enable-lora
    --lora-modules=tuned-qwen=/mnt/models/
    --model=Qwen/Qwen2.5-7B-Instruct

    Set HF_HUB_OFFLINE to 0 as shown in Figure 3. This allows downloading the base model from Hugging Face Hub.

    Configuration parameters and environment variable
    Figure 3: Configuration parameters and environment variables.

    Once the model is deployed, the model API can be called (Figure 4).

    Deployed fine-tuned model and API endpoints
    Figure 4: Deployed fine-tuned model and API endpoints.

    Potential improvements

    In this article, we demonstrated how to fine-tune an LLM with the KFT Operator on OpenShift AI. Training jobs can be managed via PyTorchJob, and FMS HF Tuning helps run distributed training jobs in a simple way. Additionally, the trained model can be served through OpenShift AI.

    There are potential areas for improvement in real use cases:

    • Integration with Kueue: OpenShift AI includes Kueue for training job management. It is important to allocate cluster resources fairly to each training job, and Kueue supports this need.
    • GPU Direct RDMA: If available, connecting GPUs across different nodes with a high speed network is crucial for training efficiency. InfiniBand and RoCEv2 are popular options for this purpose and can help reduce the overall training time.
    Last updated: April 7, 2025

    Related Posts

    • Fine-tune large language models using OpenShift Data Science

    • Fine-tune Kafka performance with the Kafka optimization theorem

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    • Model training in Red Hat OpenShift AI

    • Red Hat OpenShift AI and machine learning operations

    Recent Posts

    • Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding

    • Kafka Monthly Digest: June 2025

    • How to configure and manage Argo CD instances

    • Why Models-as-a-Service architecture is ideal for AI models

    • How to run MicroShift as a container using MINC

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue