TrustyAI Detoxify: Guardrailing LLMs during training

Detoxifying or preventing toxic content generation from large language models (LLMs) is challenging. Data used to train these models is usually scraped from the internet which often contains toxic content. Without proper guardrails, a model can learn undesirable properties and in turn, generate toxic text. Removing toxic samples from training data can be expensive as it usually requires data annotators to manually identify samples that align with human values. Inherent bias in the annotators themselves can also affect the data labeling process in a negative way. By using the open source project TrustyAI Detoxify in conjunction with Hugging Face's SFTTrainer (Supervised Fine-Tuning Trainer) on Red Hat OpenShift AI, we can help lower the costs of detoxifying LLMs during training.

In this article, we will provide step-by-step guidance on how you can use these open source technologies to detoxify a model.

What is TrustyAI Detoxify?

TrustyAI Detoxify is a library of algorithms and tools for detecting and rephrasing hate speech, abuse, and profanity in LLM generated text. It uses a pair of expert and anti-expert models to generate disagreement scores for next token predictions which approximate toxicity. It masks and rephrases tokens with the highest disagreement scores, reducing the toxicity of the text.

How we can leverage SFTTrainer to optimize LLM detoxification?

SFTTrainer simplifies the process of supervised fine-tuning for LLMs, which is a critical step for training LLMs to be useful assistants/chatbots. Typically, for supervised fine-tuning, one would need to manually prompt tune their dataset to be in instruction or conversation format before model training. SFTTrainer takes care of this step plus training in only a few lines of code.

Furthermore, SFTTrainer supports QLoRA (Quantization and Low-Rank Adapters) which is a Parameter-Efficient Fine-Tuning (PEFT) technique that optimizes memory usage. Training LLMs can be memory intensive given the number of parameters that are updated during the process. With QLoRA, one can compress the model’s weights down to 4 bits. We freeze all of the weights in the original model and add a relatively small amount of trainable parameters to it in the form of adapters. During fine tuning, only the adapter weights are updated.

In the demo below, we’ll use TrustyAI on a Red Hat OpenShift Kubernetes cluster with 2 NVIDIA GPUs within a Jupyter environment.

Demo set up

First, let’s set up your working environment as a project within OpenShift AI.

Log in to the OpenShift AI dashboard on your OpenShift cluster.
Navigate to Data Science Projects.
Click the Create data science project button.
Give your project a name, for example, "detoxify-sft".
Finally, click Create.

Create Workbench

You can define the cluster size and compute resources needed to run the workload.

Click the Workbenches tab and create a workbench with the following specifications:
Name: detoxify-sft
Image selection: TrustyAI
Version selection: 2024.1
Container size: Large
Accelerator: NVIDIA GPU
Number of accelerators: 2
Click Create workbench. You will be redirected to the project dashboard where the workbench is starting. Wait a few minutes for your workbench status to change from Starting to Running.
Access your workbench by clicking Open.

Set up development environment

Once you are in your Jupyter environment, click the Git icon on the left side of the screen. Click Clone a repository and paste the Detoxify SFT repository URL.

Our first step is to install the required Hugging Face libraries including Transformer Reinforcement Learning (TRL), Transformers, and Datasets. Navigate to detoxify-sft -> notebooks -> 1-sft.ipynb. A requirements.txt file has been preconfigured with the required libraries and their correct versions:

!pip install -r requirements.txt

TrustyAI Detoxify: Guardrailing LLMs during training

Share:

What is TrustyAI Detoxify?

How we can leverage SFTTrainer to optimize LLM detoxification?

Demo set up

Create Workbench

Set up development environment

Create and prepare the dataset

Fine-tune LLM using SFTTrainer

Model evaluation

Results

Conclusion

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue