Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Post-training methods for language models

LLM post-training in 2025: An overview of model customization options today

November 4, 2025
Mustafa Eyceoz Addie Stevens
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    Post-training represents one of the most active areas in large language model (LLM) development today. While pre-training establishes a model’s general understanding of language and world knowledge, post-training transforms that general foundation into something useful, safe, and domain-specific. This overview explores the current landscape of post-training methods, from supervised fine-tuning and continual learning to parameter-efficient and reinforcement learning approaches. It concludes with a look at how to get started using these methods through the open source Training Hub library.

    The basics: Pre-training

    Every language model begins as a collection of randomly initialized parameters, a blank neural canvas. Pre-training is the process of filling that canvas with broad linguistic and semantic knowledge drawn from diverse text corpora.

    The model starts with a vocabulary, a set of interpretable subwords or tokens that define its input and output space. Training proceeds by predicting the next token in a sequence given all tokens that came before it, a process known as next-token prediction.

    In Figure 1, consider the simple sentence "The dog eats the apples." During pre-training, the model learns to predict each word in turn: given the, predict dog; given the dog, predict eats, and so forth.

    Figure 1
    Figure 1: Example of tokenized sentence for next-token prediction.

    In practice, the tokens are rarely clean word boundaries. Figure 2 illustrates that subword tokenization often results in smaller, less readable fragments that still represent meaningful units to the model.

    Figure 2
    Figure 2: Subword tokenization visualization.

    Through this process, the model learns the statistical relationships between tokens across billions of examples, enabling it to form coherent patterns, semantic links, and syntactic rules. As shown in Figure 3, the goal is to build a probability distribution over sequences that predicts the most likely next token given the preceding context. Once this general linguistic capability is established, post-training techniques can specialize and align the model for real-world tasks.

    Figure 3
    Figure 3: Next-token probability distribution illustration.

    From language to utility: The role of post-training

    Pre-training yields a model that can form sentences and predict plausible continuations. Post-training answers the question: how can this model be made useful?

    Post-training encompasses methods that teach a model to:

    • Follow human instructions
    • Answer questions and reason through problems
    • Align with safety, domain, or brand requirements
    • Retain knowledge while learning new data

    Supervised Fine-Tuning (SFT)

    The most fundamental post-training method is supervised fine-tuning. The goal is simple: given an input sequence, train the model to produce a corresponding output sequence.

    In Figure 4, an example input "What is the capital of France?" is paired with the target output "The capital is Paris." The model learns this generally through the same core next-token prediction, this time with the input as a pre-existing condition, and the output being the following learned sequence.

    Figure 4
    Figure 4: Probabilistic next-token distribution illustration.

    However, natural dialogue introduces a complication. The model must distinguish between the roles of participants, identifying what comes from the user versus what the model should generate. Without this distinction, the model cannot structure or even understand how to engage in multi-turn conversations correctly.

    Messages-formatted data

    To solve this, SFT relies on messages-formatted data, often referred to as instruction tuning or multi-turn chat tuning. Each message in a conversation is tagged by role (for example, user, assistant, system), enabling the model to understand who said what.

    Under the hood, these roles are represented by special tokens that mark message boundaries. Figure 5 shows how user and assistant tokens are inserted to separate dialogue turns. Different model families define these tokens uniquely, leading to interoperability challenges illustrated in Figure 6.

    Figure 5
    Figure 5: Message format with special tokens (user and assistant).
    Figure 6
    Figure 6: Variations in model-specific message formats.

    To simplify user interaction, models implement a chat template, a translation layer that converts natural text into the appropriate tokenized format. Figure 7 demonstrates how a raw prompt like "What is the capital of France?" is transformed internally into user and assistant sequences.

    Figure 7
    Figure 7: Chat template transformation example.

    When preparing SFT data, each training example becomes a JSON line containing a list of message dictionaries with role and content pairs. Here is a canonical example:

    {
      "Messages": [
        {
          "role": "user",
          "content": "What is the capital of France?"
        },
        {
          "role": "assistant",
          "content": "The capital is Paris."
        }
      ]
    }

    This structure allows models to learn consistent dialogue patterns and participate in multi-turn exchanges effectively.

    Continual learning

    Once a model has been fine-tuned, organizations often need to extend it with new knowledge or domains over time, a process known as continual learning. For example, you might update a model with annual financial data or incorporating evolving internal documentation.

    The challenge lies in catastrophic forgetting. During standard fine-tuning, all weights are updated, and new learning can overwrite old knowledge. Current research explores methods that preserve general performance while accommodating incremental updates.

    Orthogonal Subspace Fine-Tuning (OSFT)

    One promising solution is Orthogonal Subspace Fine-Tuning (OSFT), a continual-learning approach that targets the least important model weights for updating. By identifying which parameters contribute least to previous tasks and general model performance, OSFT minimizes forgetting while learning new information.

    OSFT uses adaptive singular value decomposition (SVD) to decompose each weight matrix into ranked components. High-rank (large singular value) directions capture critical information, while low-rank components correspond to redundant or less significant details.

    During OSFT, a threshold defined by the unfreeze rank ratio determines which parts of the model remain trainable. As shown in Figure 8, setting this ratio to 1 reverts to full fine-tuning, while 0 freezes all parameters. Typical use cases strike a balance (for example, 0.5), allowing the model to learn new tasks while retaining essential prior knowledge.

    Figure 9
    Figure 8: OSFT unfreeze rank ratio trade-off diagram.

    This tunable trade-off provides fine-grained control. Increasing the ratio improves learning capacity, while decreasing it protects existing performance.

    Parameter-efficient fine-tuning (PEFT)

    Parameter-efficient fine-tuning (PEFT) methods further reduce memory and compute costs by modifying only a small subset of parameters or adding lightweight trainable components. These methods are ideal when the base model already contains sufficient knowledge and only minor adaptations are needed.

    Prompt tuning

    The simplest form of PEFT is prompt tuning (Figure 9). Instead of adjusting the model’s core weights, prompt tuning learns a small set of soft prompt vectors that are prepended to every input. These vectors act as a trainable instruction, guiding the model toward task-specific behavior without altering its internal representations.

    Figure 10
    Figure 9: Prompt tuning architecture overview.

    Low-Rank Adaptation (LoRA)

    A more expressive PEFT technique is Low-Rank Adaptation (LoRA) (Figure 10). LoRA injects small trainable matrices into existing model layers. These matrices approximate the weight updates that would occur during full fine-tuning but with a fraction of the parameters.

    Figure 11
    Figure 10: LoRA injection across model layers.

    Under the hood (Figure 11), LoRA decomposes a full-rank weight update ΔW into the product of two smaller matrices A and B, such that ΔW ≈ A × B. This reduces the computational complexity from O(n²) to O(2n).

    Figure 12
    Figure 11: Low-rank matrix decomposition for weight updates.

    While the approximation limits capacity compared to full fine-tuning, LoRA is highly efficient, often achieving comparable performance on lightweight tasks with significantly less training data and hardware cost.

    Modern extensions include QLoRA, which introduces quantization to further shrink memory usage, and OLoRA, which improves initialization stability through QR-based orthonormalization.

    Reinforcement learning for language models

    Another key branch of post-training is reinforcement learning (RL), where models improve through iterative feedback rather than static labels. RL enables guided exploration, helping models refine reasoning, factual accuracy, and alignment with human preferences.

    Reinforcement learning basics

    In classical RL (Figure 12), an agent interacts with an environment by taking actions, receiving rewards, and observing new states. Rewards quantify how desirable an output is according to predefined metrics or evaluators.

    Figure 13
    Figure 12: Reinforcement learning loop (agent–state–reward).

    For example, consider the game chess. The state is simply the current board state. The agent (or model) takes an action by moving a piece. Based on this action, the agent receives a reward from the environment (or engine) based on how good the move was. The agent then also receives an updated board state after the opponent’s subsequent move. The goal of an RL system is to use that reward to guide the agent to make a better move, or action, moving forward. In the context of language models, actions instead correspond to generated token sequences, with the inputs serving as "state".

    Rules-based verifiers

    A practical way to apply RL to models is through rules-based verifiers. These deterministic evaluators automatically check whether a model’s response satisfies objective criteria (for instance, numeric correctness). This allows for scalable and automatic reward assignment, accelerating learning cycles. For example, a math-solving model might receive a reward of 1 for a correct answer and 0 otherwise. Through repeated interactions, it learns which reasoning patterns lead to higher expected rewards.

    Group Relative Policy Optimization (GRPO)

    One of the most common RL approaches for language and reasoning models today is Group Relative Policy Optimization (GRPO). As illustrated in Figure 13, GRPO begins by sampling multiple rollouts, different solution paths the model might take for a single prompt. Using a higher temperature encourages diversity in exploration.

    Figure 14
    Figure 13: GRPO rollout generation and reward assignment.

    A rules-based verifier evaluates each rollout, assigning rewards (in our example binary, 1 for correct, 0 for incorrect). These rewards are then normalized to compute advantages, which weight how strongly the model should prefer certain reasoning paths.

    This strategy offers flexibility:

    • For difficult tasks with few correct rollouts, the model heavily reinforces the rare successful paths.
    • For easy tasks where many rollouts succeed, rewards are spread evenly to maintain creative diversity.

    GRPO therefore improves both reasoning stability and correctness while preserving the model’s expressive capacity.

    Emerging extensions such as Dr-GRPO and DAPO further refine training stability and reward normalization. Other reinforcement approaches, such as Direct Preference Optimization (DPO) for offline alignment and Reinforcement Learning from Human Feedback (RLHF) for interactive tuning, continue to advance the state of post-training research.

    Beyond reinforcement learning

    Post-training ties into a number of additional emerging areas such as:

    • Model merging, which blends weights from multiple fine-tuned models to combine capabilities.
    • Post-training compression, including quantization and sparsification, which reduce model size while maintaining accuracy.
    • Inference optimization, focusing on serving, batching, and latency improvements.

    Together, these form a complete lifecycle: pre-training for language mastery, post-training for specialization, and inference for efficient deployment.

    Getting started with post-training

    Developers can experiment with these algorithms today through the open source Training Hub, an abstraction layer providing unified access to modern post-training methods.

    Training Hub offers a consistent Python interface that supports multiple community back end  implementations for algorithms and training configurations. It is available on GitHub and installable from PyPI.

    Currently, the library supports SFT and includes the first official OSFT implementation out of the box, with forthcoming support for reinforcement learning and parameter-efficient fine-tuning methods such as LoRA expected by the end of 2025.

    Training Hub in action

    Using Training Hub is straightforward. Each algorithm is represented as a Python function. For example, to perform supervised fine-tuning:

    from training_hub import sft
    sft(
        model_path="/path/to/model", # or Hugging Face repo/name
        data_path="/path/to/data",
        ckpt_output_dir="/path/to/save/checkpoints",
        num_epochs=3,
        learning_rate=1e-5,
    )

    This single function handles data loading, training, and checkpointing, as shown in the preceding code snippet. For advanced users, numerous configuration parameters are available (see example code snippet below) including batch size, maximum tokens per GPU, checkpoint intervals, and distributed training options. Multi-GPU training can be enabled by adjusting nproc_per_node, and multi-node training simply extends those settings across nodes.

    # Prepare all training parameters
    training_params = {
        # Required parameters
        'model_path': model_path,
        'data_path': data_path,
        'ckpt_output_dir': ckpt_output_dir,
        
        # Core training parameters
        'num_epochs': num_epochs,
        'effective_batch_size': effective_batch_size,
        'learning_rate': learning_rate,
        'max_seq_len': max_seq_len,
        'max_tokens_per_gpu': max_tokens_per_gpu,
        
        # Data and processing parameters
        'data_output_dir': data_output_dir,
        'warmup_steps': warmup_steps,
        
        # Checkpointing parameters
        'checkpoint_at_epoch': checkpoint_at_epoch,
        'save_samples': save_samples,
        'accelerate_full_state_at_epoch': accelerate_full_state_at_epoch,
        
        # Distributed training parameters
        'nproc_per_node': nproc_per_node,
        'nnodes': nnodes,
        'node_rank': node_rank,
        'rdzv_id': rdzv_id,
        'rdzv_endpoint': rdzv_endpoint,
    }

    For continual learning, the OSFT function provides the same interface (see the following example code snippet), introducing two additional parameters:

    • unfreeze_rank_ratio controls the proportion of trainable weights, as described earlier.
    • osft_memory_efficient_init reduces overhead during SVD-based decomposition.
    from training_hub import osft
    osft(
        model_path="/path/to/model", # or Hugging Face repo/name
        data_path="/path/to/data",
        ckpt_output_dir="/path/to/save/checkpoints",
        unfreeze_rank_ratio=0.25,        # <--- OSFT-specific
        osft_memory_efficient_init=True, # <--- OSFT-specific
        num_epochs=3,
        learning_rate=1e-5,
        effective_batch_size=16,
        max_tokens_per_gpu=2048,
        max_seq_len=1024,
    )

    This modular design will extend to future algorithms such as LoRA or GRPO, each sharing a common configuration schema with minimal algorithm-specific additions.

    Conclusion

    Language model post-training has evolved into a diverse ecosystem of methods that adapt, align, and tune foundational models for practical use. From supervised fine-tuning to continual learning, parameter-efficient adaptation, and reinforcement learning, each approach addresses a different dimension of customization, such as accuracy, memory efficiency, safety, and adaptability.

    With open source tools like Training Hub, these advanced techniques are now accessible to a broad developer community, which accelerates innovation in model alignment, reasoning, and deployment across enterprise and research environments.

    Check out the Training Hub GitHub repository.

    Related Posts

    • Ollama vs. vLLM: A deep dive into performance benchmarking

    • Why vLLM is the best choice for AI inference today

    • Optimize and deploy LLMs for production with OpenShift AI

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    • Benchmarking with GuideLLM in air-gapped OpenShift clusters

    • Ollama or vLLM? How to choose the right LLM serving tool for your use case

    Recent Posts

    • How Red Hat Lightspeed events enhance system life cycle management

    • How to use content templates in Red Hat Lightspeed

    • InterSystems IRIS operations made easy with Red Hat Lightspeed

    • Leverage Red Hat Satellite for Red Hat Lightspeed reporting and automation

    • Announcing resource optimization for Red Hat OpenShift GA

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue