Prepare and label custom datasets with Label Studio

This is the fifth chapter in our “computer vision at the edge” series, in which we will dive into the preparation of a custom dataset. Below you can see all the different episodes in this series:

How to install single node OpenShift on AWS
How to install single node OpenShift on bare metal
Red Hat OpenShift AI installation and set up
Model training in Red Hat OpenShift AI
Prepare and label custom datasets with Label Studio
Deploy computer vision applications at the edge with MicroShift

Data labeling with Label Studio

Data labeling is the practice of accurately labeling data for the purpose of correct identification by training models, algorithms, etc. It is crucial for training models because it provides the necessary annotations and context for algorithms to learn effectively. Machine learning models understand patterns and make accurate predictions or classifications from labeled data. It also ensures that models can generalize well to unseen examples, improving their performance and reliability in real-world applications. Ultimately, robust data labeling enhances the overall quality and effectiveness of machine learning systems, making them more trustworthy and valuable for various industries and domains.

Label Studio offers a comprehensive suite of data labeling capabilities, making it a versatile tool for various machine learning tasks. Its intuitive interface and user-friendly design allow for a seamless and efficient labeling process. Its open source nature allows for customization and integration with existing workflows, making it an accessible choice for both beginners and experienced data annotators. Overall, Label Studio simplifies the data labeling process, enabling users to annotate datasets quickly and accurately. Because of Label Studio’s easy-to-use open source platform, it is an excellent choice for what we are trying to achieve.

If you are reading this, you are probably interested in knowing more about how to prepare and label a custom dataset. In this case, we will cover data labeling with Label Studio to locate objects on images and create the corresponding bounding boxes around them. In the next few sections, we will learn how to deploy the application on our single node and use it to prepare the custom dataset to be later consumed during the YOLO model training.

Label Studio deployment

Label Studio, as previously introduced, is a comprehensive suite for data labeling. It can be installed on a wide variety of infrastructures, like on-premise and cloud, through different installation methods. One of them—and the one that fits better in our environment—is directly deploying the application and dependent components in our Red Hat OpenShift cluster via deployment.

First of all, we need to create a new namespace where we will deploy all the resources associated with the labeling tool.

oc new-project labelstudio

And once we have this new project available, we will dive into the deployment of the different components that make up the tool.

PostgreSQL

In the first instance, Label Studio requires persistent storage to save some metadata. For this reason, it requires the deployment of a PostgreSQL database. In this step, we will show you how to get that up and running in your environment.

Start by applying the database Configmap:

vi db_configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-config
  labels:
    app: postgres
data:
  POSTGRES_DB: db
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD: postgres

oc apply -f db_configmap.yaml

This database will require a persistent volume to store the data. We can either create it from the web console or directly apply the following file:

 vi db_pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

oc apply -f db_pvc.yaml

Now that the volumes are created, we can proceed with the PostgreSQL deployment:

vi db_deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres 
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:alpine
          imagePullPolicy: "IfNotPresent"
          ports:
            - containerPort: 5432
          envFrom:
            - configMapRef:
                name: postgres-config
          env:
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - mountPath: /var/lib/postgresql/data
              name: postgres-vol
      volumes:
        - name: postgres-vol
          persistentVolumeClaim:
            claimName: postgres-pvc

oc apply -f db_deployment.yaml

Finally, we just need to expose the database service:

vi db_service.yaml

apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  type: NodePort
  ports:
    - port: 5432
  selector:
    app: postgres

oc apply -f db_service.yaml

Label Studio

After finishing the storage configuration in our application and deploying the database, we can continue by deploying Label Studio’s application itself.

The images and annotations created with Label Studio will be stored in our node, meaning that we need to create another PVC for storage purposes:

vi ls_pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: labelstudio-data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

oc apply -f ls_pvc.yaml

Apply the deployment. This will pull the latest label-studio image and will connect the PVCs and the PostgreSQL database to the application:

vi ls_deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: labelstudio
spec:
  replicas: 1
  selector:
    matchLabels:
      component: labelstudio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        component: labelstudio
    spec:
      containers:
        - name: labelstudio
          image: heartexlabs/label-studio:latest
          imagePullPolicy: Always
          stdin: true
          tty: true
          env:
            - name: DJANGO_DB
              value: default
            - name: POSTGRE_NAME
              value: postgres
            - name: POSTGRE_USER
              value: postgres
            - name: POSTGRE_PASSWORD
              value: postgres
            - name: POSTGRE_PORT
              value: "5432"
            - name: POSTGRE_HOST
              value: postgres
          volumeMounts:
            - name: labelstudio-data-vol
              mountPath: /label-studio/data
      volumes:
        - name: labelstudio-data-vol
          persistentVolumeClaim:
            claimName: labelstudio-data-pvc

oc apply -f ls_deployment.yaml

Create the service that will be expose the 8080 port:

vi ls_service.yaml

apiVersion: v1
kind: Service
metadata:
  name: labelstudio
spec:
  ports:
    - port: 8080
  selector:
    component: labelstudio
  clusterIP: None

oc apply -f ls_service.yaml

As a last step, we are going to create a route that will make the application accessible from our browser:

vi ls_route.yaml

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: label-studio-route
spec:
  path: /
  to:
    kind: Service
    name: labelstudio
  port:
    targetPort: 8080

oc apply -f ls_route.yaml

We have just finished the Label Studio deployment. It’s time to start playing with it and prepare our custom dataset. Run this command to get the route we just created:

oc get route

NAME                       HOST/PORT                                                                          PATH    SERVICES            PORT      
label-studio-route     label-studio-route-labelstudio.sno.pemlab.rdu2.redhat.com   /            labelstudio             8080

Access the route from a web browser. If everything was configured correctly, the Label Studio login page should show (Figure 1). You can log in by creating a new account or using an existing one.

Label Studio login screen. — Figure 1: Label Studio’s home page to log in or sign up to the application.

Labeling a custom dataset

Once you've accessed the webpage and logged in, we can begin labeling our custom dataset.

We are going to firstly create our project by selecting the Create Project button.
In the Project Name tab, you can use whatever name best suits your dataset; in our case, we are just going to name it Custom dataset.
Next, we will import the images we want to label in the Data Import tab.
In this page you can directly paste a URL to your dataset images or directly Upload the images you want to label from your computer.
Navigate to the Labeling Setup tab, where we are going to select the template we will use to label our images.
Select the Object Detection with Bounding Boxes. This will open a new wizard to configure our different classes.
From there, we will create the new labels for our custom dataset. Delete the existing labels, type the new ones and click Add. In my case, I want to detect different aircrafts. My labels are A380 and B747, as shown in Figure 2.

Figure 2: Creation of two new custom labels for the A380 and B747 classes.
After that, we'll select Save in the upper-right corner to begin labeling our data.

From the project dashboard, select Label all tasks, which will take you to the first image to label. To select a label, you can either click the corresponding label or press the number on your keyboard that corresponds to the label. For example, the A380 is labeled as 1. After pressing 1, click and drag on the image to create a bounding box where the aircraft is located (Figure 3).

Bounding box generation. — Figure 3: Generating a bounding box for the A380 aircraft.

Remember to create a different box for each aircraft present in the image. If in the same image both aircraft types coexist, make sure you are selecting the corresponding label for each one. When done, select the next image on the left side of the screen. Figure 4 shows another example for a B747 (class 2).

Once you are done labeling, Submit them, and return to the project dashboard. In the upper-right corner, select Export. Since we will use this data to train YOLO object detection models, export the data in the YOLO format (Figure 5).

Dataset export in YOLO format. — Figure 5: Selection of YOLO format for the dataset export.

This will trigger the prepared and labeled dataset download. When finished, unzip the file. Now, let’s take a look at the folders that it includes:

/images: contains the original images.
/labels: contains a single text file per image. Each line in the file represents the class number and the coordinates for each bounding box.
classes.txt: list with the labels in order. In our case: A380, B747.

Now that we have our dataset ready, we can push all these files into a Git repository to be imported to Red Hat OpenShift AI.

Video demo

Watch the following video demo to see how to prepare a custom dataset for AI/ML model training.

Next steps

In this article, you learned about the importance of accurately labeled data. To help us put this into practice, we have deployed Label Studio in our OpenShift cluster to label our custom data.

In the next article, we will train a YOLO object detection model with our newly labeled data using Red Hat OpenShift AI.

Last updated: May 23, 2024

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

AI/ML

Automation

Developer tools

Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

E-Books

Cheat Sheets

Documentation

Red Hat Learning

Prepare and label custom datasets with Label Studio

Data labeling with Label Studio

Label Studio deployment

PostgreSQL

Label Studio

Labeling a custom dataset

Video demo

Next steps

Exploring Llama Stack with Python: Tool calling and agents

Enhance data security in OpenShift Data Foundation

AI meets containers: My first step into Podman AI Lab

Live migrating VMs with OpenShift Virtualization

Storage considerations for OpenShift Virtualization

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue

Prepare and label custom datasets with Label Studio

Share:

Data labeling with Label Studio

Label Studio deployment

PostgreSQL

Label Studio

Labeling a custom dataset

Video demo

Next steps

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue