From tuning to serving: How open source powers the LLM life cycle

Since the rise of gen AI, many companies have been working to integrate large language models (LLMs) into their business processes to create value. One of the key challenges is providing domain-specific knowledge to LLMs. Many companies have chosen retrieval-augmented generation (RAG), storing internal documents in a vector database and querying the LLM while referencing stored knowledge. Another approach is fine-tuning, which slightly modifies the original model weights to incorporate new knowledge and skills.

In the past, fine-tuning LLMs was not an easy task for many organizations. It required a specialized training cluster and a broad range of technical expertise. However, the open source ecosystem has lowered the barrier to entry. For example, Hugging Face offers a variety of popular tools for training and customizing models, while Kubeflow provides a cloud-native approach to running training jobs across distributed containers.

In this article, we will demonstrate how the Red Hat OpenShift AI Kubeflow Training (KFT) Operator and open source tools enable us to fine-tune LLMs in a distributed environment.

All the resources are stored in this GitHub repository, and the trained model is stored in the Hugging Face repository.

Disclaimer

Fine-tuning LLMs with the Kubeflow Training Operator and SFT Trainer is still a Limited Availability feature in the latest OpenShift AI v2.18. If you need support for this feature, please contact Red Hat to obtain approval.

Prerequisites

We need to ensure that the following tools are available on the Red Hat OpenShift cluster:

OpenShift AI operator with the KFT Operator.
NVIDIA GPU Operator and Node Feature Discovery Operator.
StorageClass that supports the ReadWriteMany (RWX) access mode.

The KFT Operator can be installed through the OpenShift AI Operator. Once the managementState of the DataScienceCluster is set to Managed, the OpenShift AI Controller will install the KFT Operator in the cluster as follows:

    trainingoperator:
      managementState: Managed

Our OpenShift cluster is built on Amazon Web Services (AWS) and includes two g6e.xlarge instances as GPU nodes equipped with an NVIDIA L40S device. We will install the NVIDIA GPU Operator and the Node Feature Discovery Operator to enable these resources on the cluster.

When running training jobs on multiple nodes, training datasets must be accessible simultaneously from multiple nodes. So we will require the RWX access mode for persistent volumes (PVs). In this article, we will use the Red Hat OpenShift Data Foundation CephFS for RWX storage.

Prepare the dataset

The fine-tuning dataset needs to be stored in a PV before training or downloaded from Hugging Face at the beginning of the training.

We will use the GSM8K dataset, which we have pre-stored in an object storage bucket. We will demonstrate how to download it to a PV before starting the fine-tuning.

Begin by creating Persistent Volume Claims (PVCs) as follows:

$ oc new-project fine-tuning
$ git clone https://github.com/JPishikawa/ft-by-sft/
$ oc apply -f ft-by-sft/deploy/storage/pvc.yaml

Create the my-storage Secret which includes the credentials for object storage. Modify the Secret for your environment:

$ oc apply -f ft-by-sft/deploy/storage/secret.yaml

Create the Pod that downloads the dataset from object storage to the PV:

apiVersion: v1
kind: Pod
metadata:
  name: download-dataset
  labels:
    name: download-dataset
spec:
  volumes:
    - name: dataset-volume
      persistentVolumeClaim:
        claimName: dataset-volume
  restartPolicy: Never
  initContainers:
    - name: fix-volume-permissions
      image: quay.io/quay/busybox:latest
      command: ["sh"]
      args: ["-c", "chmod -R 777 /data/input"]
      volumeMounts:
        - mountPath: "/data/input/"
          name: dataset-volume
  containers:
    - name: download-data
      imagePullPolicy: IfNotPresent
      image: quay.io/modh/kserve-storage-initializer:rhoai-2.17
      args:
        - 's3://my-fine-tuning-trial/data/'
        - /data/input
      env:
        - name: STORAGE_CONFIG
          valueFrom:
            secretKeyRef:
              name: storage-config
              key: my-storage
      volumeMounts:
        - mountPath: "/data/input/"
          name: dataset-volume

The object bucket name, my-fine-tuning-trial and the path, /data/, are specified in the container arguments. Once the Pod has completed downloading, the dataset is stored in dataset-volume.

Configuration

The FMS HF Tuning is an open-source Python library that wraps Hugging Face's SFT Trainer and PyTorch FSDP to run LLM fine-tuning jobs. We will use this library and the KFT PyTorchJob to run distributed training jobs.

The configuration parameters of FMS HF Tuning are configured via the following ConfigMap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: training-config
data:
  config.json: |
    {
      "accelerate_launch_args": {
            "main_process_ip": "kfto-demo-master-0",
            "main_process_port": 29500,
            "num_processes": 2,
            "num_machines": 2,
            "machine_rank": 0,
            "mixed_precision": "bf16",
            "use_fsdp": "true",
            "fsdp_sharding_strategy": 4,
            "rdzv_backend": "c10d"
        },
      "model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
      "training_data_path": "/data/input/train-00000-of-00001.parquet",
      "output_dir": "/data/output/tuning/qwen2.5-tuning",
      "save_model_dir": "/data/output/model",
      "num_train_epochs": 3,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 16,
      "packing": "True",
      "gradient_checkpointing": "True",
      "save_strategy": "epoch",
      "learning_rate": 2e-05,
      "lr_scheduler_type": "constant",
      "include_tokens_per_second": true,
      "data_formatter_template": "### Question:\n{{question}}\n\n### Answer:\n{{answer}}<|im_end|>",
      "response_template": "### Answer:\n",
      "logging_strategy": "steps",
      "logging_steps": 0.2,
      "neftune_noise_alpha": 5,
      "use_flash_attn": true,
      "use_liger_kernel": "True",
      "peft_method": "lora",
      "r": 16,
      "lora_alpha": 32,
      "lora_dropout": 0.05,
      "bias": "none",
      "target_modules": ["all-linear"],
      "lora_post_process_for_vllm": true,
      "trackers": ["aim"],
      "experiment": "my-first-experiment",
      "aim_remote_server_ip": "aim.aim.svc.cluster.local",
      "aim_remote_server_port": "53800"
    }

In accelerate_launch_args, arguments for accelerate launch and FSDP configuration are passed:

main_process_ip: Headless Service name of the master Pod of PyTorchJob.
num_processes: The total number of GPUs.
num_machines: The total number of GPU nodes.
fsdp_sharding_strategy: FSDP sharding strategy. 4 is "HYBRID_SHARD".

Other parameters mostly come from Hugging Face's TrainingArguments:

model_name_or_path: The base model name on Hugging Face Hub.
training_data_path: The path to the training dataset stored in the attached PV.
per_device_train_batch_size and gradient_accumulation_steps: The product of these values should match the Tensor Core requirements.
peft_method: Using the Low-Rank Adaptation (LoRA) method to fine-tune the model.
use_liger_kernel: Using the Liger Kernel to accelerate training and reduce video RAM (VRAM).
trackers: Using the Aim stack for experiment tracking.

If you would like to try full-parameter fine-tuning, remove the LoRA-related configurations. It requires more VRAM to complete the training.

Experiment tracking

As described in the configuration section, we use the Aim stack for experiment tracking. Aim provides a visualization of key training metrics and is easy to integrate into FMS HF Tuning.

To deploy Aim on OpenShift, run the following commands:

$ cd ~/ft-by-sft/deploy/aim/
$ /bin/bash deploy.sh

This script creates resources in the aim namespace. Once the Aim Pod is running, you can access the Aim graphical user interface (GUI) via a route.

Running a training job

PyTorchJob is one of the custom resources in the KFT Operator. The following PyTorchJob creates a master Pod and a worker Pod for distributed training:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: kfto-demo
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
                - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                  value: "false"
                - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  value: "1"
                - name: PYTORCH_CUDA_ALLOC_CONF
                  value: "expandable_segments:True"
              image: 'quay.io/jishikaw/fms-hf-tuning:latest'
              imagePullPolicy: IfNotPresent
              name: pytorch
              ports:
                - containerPort: 29500
                  name: pytorchjob-port
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /data/input
                  name: dataset-volume
                - mountPath: /data/output
                  name: model-volume
                - mountPath: /.cache
                  name: cache-volume
                - mountPath: "/dev/shm"
                  name: dshm
          volumes:
            - configMap:
                items:
                  - key: config.json
                    path: config.json
                name: training-config
              name: config-volume
            - persistentVolumeClaim:
                claimName: dataset-volume
              name: dataset-volume
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-volume
            - name: cache-volume
              persistentVolumeClaim:
                claimName: cache-volume
            - name: dshm
              emptyDir:
                medium: Memory
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
                - name: SET_NUM_PROCESSES_TO_NUM_GPUS
                  value: "false"
                - name: TORCH_NCCL_ASYNC_ERROR_HANDLING
                  value: "1"
                - name: PYTORCH_CUDA_ALLOC_CONF
                  value: "expandable_segments:True"
              image: 'quay.io/jishikaw/fms-hf-tuning:latest'
              imagePullPolicy: IfNotPresent
              name: pytorch
              ports:
                - containerPort: 29500
                  name: pytorchjob-port
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /data/input
                  name: dataset-volume
                - mountPath: /data/output
                  name: model-volume
                - mountPath: /.cache
                  name: cache-volume
                - mountPath: "/dev/shm"
                  name: dshm
          volumes:
            - configMap:
                items:
                  - key: config.json
                    path: config.json
                name: training-config
              name: config-volume
            - persistentVolumeClaim:
                claimName: dataset-volume
              name: dataset-volume
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-volume
            - name: cache-volume
              persistentVolumeClaim:
                claimName: cache-volume
            - name: dshm
              emptyDir:
                medium: Memory
  runPolicy:
    suspend: false

Once the training job starts running, the worker's init container tries to connect to the master Pod based on the parameters specified in the ConfigMap.

The init container error may occur when running the job for the first time because it takes a while to pull the container image. In that case, deleting and recreating the PyTorchJob could resolve the error.

If the training fails due to a compute unified device architecture (CUDA) out of memory error, decrease the values of per_device_train_batch_size and gradient_accumulation_steps in the ConfigMap to reduce VRAM consumption.

It takes about one and a half hours to complete the training job. You can monitor training metrics (e.g., loss) on the Aim GUI, as shown in Figure 1.

Aim GUI for visualization of training metrics — Figure 1: Aim GUI for visualizing training metrics.

Serve the fine-tuned model

The trained model is stored in model-volume as a LoRA adapter, which can be served with the base model on OpenShift AI.

Create a new connection

Go to the OpenShift AI console and create a new connection.
Select URI - v1 as the Connection type and set pvc://model-volume/model/ as the URI (Figure 2).

Figure 2: Creating a URI type data connection.

Deploy the model

Switch to the Models tab and deploy the model.
Select vLLM ServingRuntime for KServe as the serving runtime.
Select the created connection for the model source.
Add the following arguments and the environment variable (Figure 3):

--enable-lora
--lora-modules=tuned-qwen=/mnt/models/
--model=Qwen/Qwen2.5-7B-Instruct

Set HF_HUB_OFFLINE to 0 as shown in Figure 3. This allows downloading the base model from Hugging Face Hub.

Figure 3: Configuration parameters and environment variables.

Once the model is deployed, the model API can be called (Figure 4).

Figure 4: Deployed fine-tuned model and API endpoints.

Potential improvements

In this article, we demonstrated how to fine-tune an LLM with the KFT Operator on OpenShift AI. Training jobs can be managed via PyTorchJob, and FMS HF Tuning helps run distributed training jobs in a simple way. Additionally, the trained model can be served through OpenShift AI.

There are potential areas for improvement in real use cases:

Integration with Kueue: OpenShift AI includes Kueue for training job management. It is important to allocate cluster resources fairly to each training job, and Kueue supports this need.
GPU Direct RDMA: If available, connecting GPUs across different nodes with a high speed network is crucial for training efficiency. InfiniBand and RoCEv2 are popular options for this purpose and can help reduce the overall training time.

Last updated: April 7, 2025

From tuning to serving: How open source powers the LLM life cycle

Share:

Prerequisites

Prepare the dataset

Configuration

Experiment tracking

Running a training job

Serve the fine-tuned model

Create a new connection

Deploy the model

Potential improvements

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue