Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

From tuning to serving: How open source powers the LLM life cycle

March 26, 2025
Junpei Ishikawa
Related topics:
Artificial intelligence
Related products:
Red Hat AIRed Hat OpenShift AI

    Since the rise of gen AI, many companies have been working to integrate large language models (LLMs) into their business processes to create value. One of the key challenges is providing domain-specific knowledge to LLMs. Many companies have chosen retrieval-augmented generation (RAG), storing internal documents in a vector database and querying the LLM while referencing stored knowledge. Another approach is fine-tuning, which slightly modifies the original model weights to incorporate new knowledge and skills.

    In the past, fine-tuning LLMs was not an easy task for many organizations. It required a specialized training cluster and a broad range of technical expertise. However, the open source ecosystem has lowered the barrier to entry. For example, Hugging Face offers a variety of popular tools for training and customizing models, while Kubeflow provides a cloud-native approach to running training jobs across distributed containers.

    In this article, we will demonstrate how the Red Hat OpenShift AI Kubeflow Training (KFT) Operator and open source tools enable us to fine-tune LLMs in a distributed environment.

    All the resources are stored in this GitHub repository, and the trained model is stored in the Hugging Face repository.

    Disclaimer

    Fine-tuning LLMs with the Kubeflow Training Operator and SFT Trainer is still a Limited Availability feature in the latest OpenShift AI v2.18. If you need support for this feature, please contact Red Hat to obtain approval.

    Prerequisites  

    We need to ensure that the following tools are available on the Red Hat OpenShift cluster:

    • OpenShift AI operator with the KFT Operator.
    • NVIDIA GPU Operator and Node Feature Discovery Operator.
    • StorageClass that supports the ReadWriteMany (RWX) access mode.

    The KFT Operator can be installed through the OpenShift AI Operator. Once the managementState of the DataScienceCluster is set to Managed, the OpenShift AI Controller will install the KFT Operator in the cluster as follows:

        trainingoperator:
          managementState: Managed

    Our OpenShift cluster is built on Amazon Web Services (AWS) and includes two g6e.xlarge instances as GPU nodes equipped with an NVIDIA L40S device. We will install the NVIDIA GPU Operator and the Node Feature Discovery Operator to enable these resources on the cluster.

    When running training jobs on multiple nodes, training datasets must be accessible simultaneously from multiple nodes. So we will require the RWX access mode for persistent volumes (PVs). In this article, we will use the Red Hat OpenShift Data Foundation CephFS for RWX storage.

    Prepare the dataset

    The fine-tuning dataset needs to be stored in a PV before training or downloaded from Hugging Face at the beginning of the training. 

    We will use the GSM8K dataset, which we have pre-stored in an object storage bucket. We will demonstrate how to download it to a PV before starting the fine-tuning.

    Begin by creating Persistent Volume Claims (PVCs) as follows:

    $ oc new-project fine-tuning
    $ git clone https://github.com/JPishikawa/ft-by-sft/
    $ oc apply -f ft-by-sft/deploy/storage/pvc.yaml

    Create the my-storage Secret which includes the credentials for object storage. Modify the Secret for your environment:

    $ oc apply -f ft-by-sft/deploy/storage/secret.yaml

    Create the Pod that downloads the dataset from object storage to the PV:

    apiVersion: v1
    kind: Pod
    metadata:
      name: download-dataset
      labels:
        name: download-dataset
    spec:
      volumes:
        - name: dataset-volume
          persistentVolumeClaim:
            claimName: dataset-volume
      restartPolicy: Never
      initContainers:
        - name: fix-volume-permissions
          image: quay.io/quay/busybox:latest
          command: ["sh"]
          args: ["-c", "chmod -R 777 /data/input"]
          volumeMounts:
            - mountPath: "/data/input/"
              name: dataset-volume
      containers:
        - name: download-data
          imagePullPolicy: IfNotPresent
          image: quay.io/modh/kserve-storage-initializer:rhoai-2.17
          args:
            - 's3://my-fine-tuning-trial/data/'
            - /data/input
          env:
            - name: STORAGE_CONFIG
              valueFrom:
                secretKeyRef:
                  name: storage-config
                  key: my-storage
          volumeMounts:
            - mountPath: "/data/input/"
              name: dataset-volume

    The object bucket name, my-fine-tuning-trial and the path, /data/, are specified in the container arguments. Once the Pod has completed downloading, the dataset is stored in dataset-volume.

    Configuration

    The FMS HF Tuning is an open-source Python library that wraps Hugging Face's SFT Trainer and PyTorch FSDP to run LLM fine-tuning jobs. We will use this library and the KFT PyTorchJob to run distributed training jobs.

    The configuration parameters of FMS HF Tuning are configured via the following ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: training-config
    data:
      config.json: |
        {
          "accelerate_launch_args": {
                "main_process_ip": "kfto-demo-master-0",
                "main_process_port": 29500,
                "num_processes": 2,
                "num_machines": 2,
                "machine_rank": 0,
                "mixed_precision": "bf16",
                "use_fsdp": "true",
                "fsdp_sharding_strategy": 4,
                "rdzv_backend": "c10d"
            },
          "model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
          "training_data_path": "/data/input/train-00000-of-00001.parquet",
          "output_dir": "/data/output/tuning/qwen2.5-tuning",
          "save_model_dir": "/data/output/model",
          "num_train_epochs": 3,
          "per_device_train_batch_size": 4,
          "per_device_eval_batch_size": 4,
          "gradient_accumulation_steps": 16,
          "packing": "True",
          "gradient_checkpointing": "True",
          "save_strategy": "epoch",
          "learning_rate": 2e-05,
          "lr_scheduler_type": "constant",
          "include_tokens_per_second": true,
          "data_formatter_template": "### Question:\n{{question}}\n\n### Answer:\n{{answer}}<|im_end|>",
          "response_template": "### Answer:\n",
          "logging_strategy": "steps",
          "logging_steps": 0.2,
          "neftune_noise_alpha": 5,
          "use_flash_attn": true,
          "use_liger_kernel": "True",
          "peft_method": "lora",
          "r": 16,
          "lora_alpha": 32,
          "lora_dropout": 0.05,
          "bias": "none",
          "target_modules": ["all-linear"],
          "lora_post_process_for_vllm": true,
          "trackers": ["aim"],
          "experiment": "my-first-experiment",
          "aim_remote_server_ip": "aim.aim.svc.cluster.local",
          "aim_remote_server_port": "53800"
        }

    In accelerate_launch_args, arguments for accelerate launch and FSDP configuration are passed: 

    • main_process_ip: Headless Service name of the master Pod of PyTorchJob.
    • num_processes: The total number of GPUs.
    • num_machines: The total number of GPU nodes.
    • fsdp_sharding_strategy: FSDP sharding strategy. 4 is "HYBRID_SHARD".

    Other parameters mostly come from Hugging Face's TrainingArguments:

    • model_name_or_path: The base model name on Hugging Face Hub.
    • training_data_path: The path to the training dataset stored in the attached PV.
    • per_device_train_batch_size and gradient_accumulation_steps: The product of these values should match the Tensor Core requirements.
    • peft_method: Using the Low-Rank Adaptation (LoRA) method to fine-tune the model.
    • use_liger_kernel: Using the Liger Kernel to accelerate training and reduce video RAM (VRAM).
    • trackers: Using the Aim stack for experiment tracking.

    If you would like to try full-parameter fine-tuning, remove the LoRA-related configurations. It requires more VRAM to complete the training.

    Experiment tracking

    As described in the configuration section, we use the Aim stack for experiment tracking. Aim provides a visualization of key training metrics and is easy to integrate into FMS HF Tuning.

    To deploy Aim on OpenShift, run the following commands:

    $ cd ~/ft-by-sft/deploy/aim/
    $ /bin/bash deploy.sh

    This script creates resources in the aim namespace. Once the Aim Pod is running, you can access the Aim graphical user interface (GUI) via a route.

    Running a training job

    PyTorchJob is one of the custom resources in the KFT Operator. The following PyTorchJob creates a master Pod and a worker Pod for distributed training:

    apiVersion: kubeflow.org/v1
    kind: PyTorchJob
    metadata:
      name: kfto-demo
    spec:
      pytorchReplicaSpecs:
        Master:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - env:
                    - name: SFT_TRAINER_CONFIG_JSON_PATH
                      value: /etc/config/config.json
                    - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                      value: "false"
                    - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                      value: "1"
                    - name: PYTORCH_CUDA_ALLOC_CONF
                      value: "expandable_segments:True"
                  image: 'quay.io/jishikaw/fms-hf-tuning:latest'
                  imagePullPolicy: IfNotPresent
                  name: pytorch
                  ports:
                    - containerPort: 29500
                      name: pytorchjob-port
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                  volumeMounts:
                    - mountPath: /etc/config
                      name: config-volume
                    - mountPath: /data/input
                      name: dataset-volume
                    - mountPath: /data/output
                      name: model-volume
                    - mountPath: /.cache
                      name: cache-volume
                    - mountPath: "/dev/shm"
                      name: dshm
              volumes:
                - configMap:
                    items:
                      - key: config.json
                        path: config.json
                    name: training-config
                  name: config-volume
                - persistentVolumeClaim:
                    claimName: dataset-volume
                  name: dataset-volume
                - name: model-volume
                  persistentVolumeClaim:
                    claimName: model-volume
                - name: cache-volume
                  persistentVolumeClaim:
                    claimName: cache-volume
                - name: dshm
                  emptyDir:
                    medium: Memory
        Worker:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - env:
                    - name: SFT_TRAINER_CONFIG_JSON_PATH
                      value: /etc/config/config.json
                    - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                      value: "false"
                    - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                      value: "1"
                    - name: PYTORCH_CUDA_ALLOC_CONF
                      value: "expandable_segments:True"
                  image: 'quay.io/jishikaw/fms-hf-tuning:latest'
                  imagePullPolicy: IfNotPresent
                  name: pytorch
                  ports:
                    - containerPort: 29500
                      name: pytorchjob-port
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                  volumeMounts:
                    - mountPath: /etc/config
                      name: config-volume
                    - mountPath: /data/input
                      name: dataset-volume
                    - mountPath: /data/output
                      name: model-volume
                    - mountPath: /.cache
                      name: cache-volume
                    - mountPath: "/dev/shm"
                      name: dshm
              volumes:
                - configMap:
                    items:
                      - key: config.json
                        path: config.json
                    name: training-config
                  name: config-volume
                - persistentVolumeClaim:
                    claimName: dataset-volume
                  name: dataset-volume
                - name: model-volume
                  persistentVolumeClaim:
                    claimName: model-volume
                - name: cache-volume
                  persistentVolumeClaim:
                    claimName: cache-volume
                - name: dshm
                  emptyDir:
                    medium: Memory
      runPolicy:
        suspend: false

    Once the training job starts running, the worker's init container tries to connect to the master Pod based on the parameters specified in the ConfigMap.

    The init container error may occur when running the job for the first time because it takes a while to pull the container image. In that case, deleting and recreating the PyTorchJob could resolve the error.

    If the training fails due to a compute unified device architecture (CUDA) out of memory error, decrease the values of per_device_train_batch_size and gradient_accumulation_steps in the ConfigMap to reduce VRAM consumption.

    It takes about one and a half hours to complete the training job. You can monitor training metrics (e.g., loss) on the Aim GUI, as shown in Figure 1.

    Aim GUI for visualization of training metrics
    Figure 1: Aim GUI for visualizing training metrics.

    Serve the fine-tuned model

    The trained model is stored in model-volume as a LoRA adapter, which can be served with the base model on OpenShift AI.

    Create a new connection

    1. Go to the OpenShift AI console and create a new connection.
    2. Select URI - v1 as the Connection type and set pvc://model-volume/model/ as the URI (Figure 2).
    Creating a URI type data connection
    Figure 2: Creating a URI type data connection.

    Deploy the model

    1. Switch to the Models tab and deploy the model.
    2. Select vLLM ServingRuntime for KServe as the serving runtime.
    3. Select the created connection for the model source.
    4. Add the following arguments and the environment variable (Figure 3):
    --enable-lora
    --lora-modules=tuned-qwen=/mnt/models/
    --model=Qwen/Qwen2.5-7B-Instruct

    Set HF_HUB_OFFLINE to 0 as shown in Figure 3. This allows downloading the base model from Hugging Face Hub.

    Configuration parameters and environment variable
    Figure 3: Configuration parameters and environment variables.

    Once the model is deployed, the model API can be called (Figure 4).

    Deployed fine-tuned model and API endpoints
    Figure 4: Deployed fine-tuned model and API endpoints.

    Potential improvements

    In this article, we demonstrated how to fine-tune an LLM with the KFT Operator on OpenShift AI. Training jobs can be managed via PyTorchJob, and FMS HF Tuning helps run distributed training jobs in a simple way. Additionally, the trained model can be served through OpenShift AI.

    There are potential areas for improvement in real use cases:

    • Integration with Kueue: OpenShift AI includes Kueue for training job management. It is important to allocate cluster resources fairly to each training job, and Kueue supports this need.
    • GPU Direct RDMA: If available, connecting GPUs across different nodes with a high speed network is crucial for training efficiency. InfiniBand and RoCEv2 are popular options for this purpose and can help reduce the overall training time.
    Last updated: April 7, 2025

    Related Posts

    • Fine-tune large language models using OpenShift Data Science

    • Fine-tune Kafka performance with the Kafka optimization theorem

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    • Model training in Red Hat OpenShift AI

    • Red Hat OpenShift AI and machine learning operations

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.