Hyperparameter Optimisation with Ray Tune

An introduction to Hyperparmeter optimization

In the dynamic world of machine learning, optimizing model performance is not just a goal—it's a necessity. This comprehensive guide aims to simplify the intricate process of hyperparameter optimization, leveraging the power of OpenShift AI, Ray Tune, and Model Registry to enhance model accuracy and efficiency. This guide is meticulously detailed based on the example code provided in this repository, offering a practical and hands-on approach to the optimization process.

Setup

Before embarking on this journey, it's essential to have the right tools and resources at your disposal. You'll need:

An OpenShift cluster (4.0+) with OpenShift AI (RHOAI) 2.10+ installed:
- The codeflare, dashboard, ray and workbenches components enabled;
Sufficient worker nodes for your configuration(s)
An AWS S3 bucket to store experimentation results;

Setting the Stage: Setting up a Data Science Project and Ray Clusters on OpenShift AI

The initial step in our optimization journey is setting up our Data Science project within the OpenShift AI cluster. To get started, ensure you have the RedHat OpenShift AI operator installed from the Operator Hub. Once installed, this operator becomes available as a service, facilitating the creation and management of Data Science projects.

Assuming installation is completed, access the OpenShift AI dashboard from the top navigation bar menu:

Image showing how to open up the OpenShift AI dashboard

After this initiate the creation of a new Data Science project. This is where you will be able to view and manage all of the workbenches you create.

Image showing an example of the input text for the "Create data science project" section

Following this if one does not already exist create a cluster storage, where we will store local information.

Image showing example input for "Add cluster storage" section

Once this is complete, create a new workbench with a standard data science image. Below is an example of the workbench settings. The notebook image selections which should be set to the latest versions to avoid issues. If you wish to use preexisting persistent storage, change the configuration as necessary. When the workbench is operational, you can directly access the Jupyter notebook environment.

Image showing example input for the configuration for a new Data sceince workbench

Post the creation of the workbench, we can add a data connection with relevant details. If you are simply running this as an example and wish to create a local Minio storage (not meant for production) feel free to follow the steps found here.

Image showing example input for "Add data connection" section

Assuming all the previous steps have been followed we should now be able to open the workbench allowing us to access the Jupyter notebook. Here we will clone the relevant repository and begin our journey using raytune to perform hyper parameter optimisation.

After cloning you will be able to see 3 different examples in the examples folder. In these examples you will utilize the CodeFlare SDK to configure and launch our cluster, ensuring it is fully equipped to manage the intricate demands of our machine learning tasks. OpenShift AI excels in optimizing for distributed workloads by employing a strategy that involves the integration of Ray Clusters. In this setup, a collection of worker nodes are seamlessly connected to a central Ray head node, facilitating the efficient execution of distributed workloads.

To tailor the Ray cluster to our needs, we specify the CPU and memory resources allocated to each node. Once configured, we bring up the cluster, ready to be utilized for our Hyperparameter Optimization (HPO) tasks. This setup ensures that our project is not only well-organized but also prepared to handle the computational demands of our optimization process.

# Create and configure our cluster object (and appwrapper)
cluster = Cluster(ClusterConfiguration(
	name='terrestial-raytest',
	num_workers=2,
	min_cpus=1,
	max_cpus=1,
	min_memory=4,
	max_memory=4,
	num_gpus=0,
	image="quay.io/rhoai/ray:2.23.0-py39-cu121"
))

Copy snippet

An example of the configuration using codeflare sdk in /demos/raytune-oai-demo.ipynb

In the next few sections we will discuss the code contained in the examples folder of this repository. Feel free to follow along with the code contained in the notebooks.

The Heart of the Matter: Hyperparameter Optimization with Ray Tune

In this demo we're focusing on finding the optimal hyperparameters for a Simple Neural Network model using Ray Tune. This involves tuning two key parameters: hidden_size and learning_rate. Given that we're leveraging a PyTorch example, it's crucial to ensure that all necessary packages, including torch and Ray Tune, are installed in our cluster environment. This necessitates re-instantiating these packages to ensure they're correctly set up.

# Additional libs
runtime_env = {"pip": ["ipython","torch","onnx","ray[train]","protobuf==3.20.1"]}

ray.init(address=ray_cluster_uri, runtime_env=runtime_env, _system_config={"PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION": "python"})

print("Ray cluster is up and running: ", ray.is_initialized())

Copy snippet

Ensuring the environment is correctly setup

Once the Ray Clusters are operational, we proceed to tune the model with Ray Tune, specifying the number of samples (trials) we wish to run. After each trial concludes, we return the model's accuracy and the trial model itself.

Upon completion of the Ray Tune process, we're presented with the best trial and the corresponding optimal hyperparameters. Ray Tune enables us to explore a broad spectrum of hyperparameters, testing various combinations to identify the one that achieves the highest accuracy. There are multiple strategies we can employ within Ray Tune to enhance our tuning process.

The Importance of Metadata

Metadata in HPO experiments is the goldmine of insights. It includes details about each trial's configuration, performance metrics, and even the state of the model at the end of the trial. This information is invaluable for understanding the optimization process, identifying trends, and refining future experiments.

What is the Model Registry

If we view the example raytune-oai-MR-gRPC-demo.ipynb we can see we utilize the Model Registry. The Model Registry is a central repository for model developers to store and manage versions and artifacts metadata. This Go-based application leverages the ml_metadata project, as well as providing a python api for ease of use. It is a key component in the process of managing models. In our use case it is particularly useful for managing the large number of different model versions that are generated and used.

Integrating Model Registry for Metadata Management

To seamlessly integrate Model Registry into our HPO process, it's crucial to confirm its setup and readiness.

To utilize the default Model Registry service, install the model registry operator and start the service as per the instructions provided here.

Upon integration, each trial's metadata is captured and stored via the Model Registry. This encompasses the hyperparameters utilized, the performance metrics achieved, and any other pertinent details. ModelRegistry ensures that this data is well-organized, easily accessible, and prepared for analysis.

In our example code, we generate various types of metadata, including:

kf.HPOConfig (Artifact) for saving HPO configurations.

kf.HPOExperiment (Context) for saving HPO experiment details, serving as a parent to HPOTrial.

kf.HPOTrial (Context) for saving trial information for each experiment, acting as a child to HPOExperiment.

This metadata is saved as part of the HPO run, facilitating a comprehensive analysis. The example code utilizes the python gRPC API to access the Model Registry metadata, with future support for REST APIs. Allowing for the comparison of different trials and the identification of top optimized deployments in real-time, even as the HPO experiment continues for other trials.

Enhancing Model Deployment

Having a comprehensive record of our HPO experiments is not just about understanding past performance. It also informs our future model deployments. By analyzing the metadata, we can identify the most effective configurations and apply them to new models. This ensures that our deployments are not just successful but also optimized for the best possible performance.

The Final Frontier: Saving and Sharing the Best Model

Upon identifying the optimal model, it's time to make it known to the world. We choose to save our model in ONNX format, a universal standard that guarantees compatibility across a wide range of platforms and frameworks. This step is pivotal for deploying our model in various environments, ensuring its accessibility to a broader audience.

OpenShift AI supports deployment in multiple frameworks, with ONNX being one of them. By saving our model in ONNX format, we align with this support, facilitating a smooth deployment process across different platforms.

The Grand Finale: Deploying the Model for Inference

The culmination of our optimization journey is the deployment of our refined model for inference. We upload our model to an AWS S3 bucket, ensuring its accessibility for practical applications. To facilitate deployment, we navigate to our Data Science project, deploy the model directly from the dashboard, and obtain the inference URLs. This allows us to access the model for real-world applications.

If our interest lies in exploring the top 5 optimized models from a 50-trial experiment, we have the flexibility to save multiple models. This approach enables us to conduct further experiments with these models, enhancing our understanding and refining our optimization efforts.

Utilizing a REST API, we can now send data to our model and receive predictions, demonstrating the effectiveness of our optimized model in a practical setting.

Conclusion: The Journey Continues

This guide has provided a glimpse into the world of hyperparameter optimization, showcasing how OpenShift AI, Ray Tune, and the Model Registry can be used to optimize machine learning models. As we continue our journey, we'll explore more advanced techniques and tools, always striving to push the boundaries of what's possible in machine learning.

Last updated: October 29, 2024

Hyperparameter Optimisation with Ray Tune

Share:

An introduction to Hyperparmeter optimization

Setting the Stage: Setting up a Data Science Project and Ray Clusters on OpenShift AI

The Heart of the Matter: Hyperparameter Optimization with Ray Tune

The Importance of Metadata

What is the Model Registry

Integrating Model Registry for Metadata Management

Enhancing Model Deployment

The Final Frontier: Saving and Sharing the Best Model

The Grand Finale: Deploying the Model for Inference

Conclusion: The Journey Continues

The new Multi-Cluster Alerting UI: A developer preview

How to run MicroShift with OpenShift Local and Podman Desktop

Essential Buildpack resources for Node.js developers

How to fine-tune LLMs with Kubeflow Training Operator

Deliver generative AI at scale with NVIDIA NIM on OpenShift AI

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue