Integrated hybrid cloud MLOps and Application Platform

A common platform for machine learning and app development on the hybrid cloud.

Try it in our Sandbox IT Admins, Try it in your own cluster

Red Hat's AI/ML platforms & developer tools

Overview

Red Hat’s integrated hybrid cloud AI/ML platform provides a consistent way to support the end-to-end lifecycle of both machine learning (ML) models and cloud native applications in terms of how they are developed, packaged, deployed, and managed.

This AI/ML platform consists of Red Hat OpenShift AI built on top of Red Hat OpenShift and their ecosystem of open source and ISV software.

Red Hat OpenShift AI (RHOAI), based on Open Data Hub , is an open source, community-driven AI platform. It is designed for the hybrid cloud and built on popular open source AI tools.

RHOAI offers end-to-end ML lifecycle capabilities like workbenches, persistent storage, data connections, data science pipelines, model servers, and more.
RHOAI is delivered both as a cloud service or self-managed service optionally delivered with additional independent software vendor (ISV) features.

Learn more and try in Developer Sandbox

Red Hat OpenShift is a Kubernetes-powered application platform with open source-based application lifecycle capabilities. These capabilities include:

GPU-enabled hardware support.
Container orchestration, management, and security.
DevOps tools like GitOps and Pipelines.
Language runtimes and frameworks like Java, Python, Go, and more.

These frameworks and languages code business logic and Integration services like API Gateway, Single Sign-On (SSO), and 3scale API Management to expose and connect the applications securely and at scale.

Learn more and try in Developer Sandbox

Lifecycle for operationalizing AI/ML Models and intelligent cloud native applications

The following figure shows the steps involved in operationalizing AI/ML models and cloud native applications that utilize those models to deliver intelligent experiences to customers.

Lifecycle for operationalizing AI/ML Models and Intelligent Cloud Native Applications

1.0 Machine learning operations (MLOps) lifecycle steps

Red Hat OpenShift AI is a flexible, scalable MLOps platform built on open source technologies with tools to build, deploy, and manage AI-enabled applications. This MLOps platform provides trusted and operationally consistent capabilities for teams to experiment, serve models, and deliver innovative apps. How this platform enables the MLOps Lifecycle is explained below.

1.1 Gathering and preparing Training Data

1.1 Gathering and preparing training data

In this step, data scientists import data into a specialized environment known as a workbench. This environment is designed for data science tasks and is hosted on OpenShift, offering a containerized, cloud-based workspace. Equipped with tools such as JupyterLab and libraries like TensorFlow, it also includes GPU support for data-intensive and generative AI computations. Data can be uploaded directly to the workbench, retrieved from S3 storage, sourced from databases, or streamed.

RHOAI provides:

Pre-configured workbench images that come with libraries including Pandas for data loading from various formats (CSV, JSON, SQL), and other libraries like Matplotlib and NumPy for data visualization and preprocessing tasks.
Extendability of its core data ingestion capabilities through the integration of specialized data processing applications such as Starburst.
Robust data management, storage, security and compliance in combination with OpenShift, which are vital for adhering to stringent data protection standards.

1.2 Developing, augmenting, and fine tuning the model using ML frameworks

In this step, the model is trained using data that has already been cleaned and prepared. To assess how well the model performs, it is tested against specific subsets of the data designated for validation and testing. These subsets are selected from the overall dataset to check the model's effectiveness in dealing with data it has not previously encountered, ensuring it can accurately make predictions on new samples. Iteration is necessary to achieve the desired result. Data scientists typically go through several iterations of this process, refining the model by repeating the steps of data preparation, model training, and evaluation. Data scientists continue this iterative process until the model's performance metrics meet their expectations.

To support these activities, RHOAI offers:

Specialized workbench images that come equipped with popular machine learning libraries such as TensorFlow, PyTorch, and Scikit-learn, facilitating the model development process.
Additionally, some of these images provide access to underlying GPUs, enabling more efficient model training by significantly reducing training time.

1.3 Deploying the model to production

After the model is trained and evaluated, the model files are uploaded to a model storage location (AWS S3 bucket) using the configuration values of the data connection to the storage location. This step involves the conversion of the model into a suitable format for serving, such as ONNX. Deploying the model for serving inferences involves creating model servers that fetch the exported model from a storage location (AWS S3) using data connections, and exposing the model through a REST or a gRPC interface. To automatically run the previous series of steps, when new data is available, engineers implement data science pipelines which are workflows that execute scripts or Jupyter notebooks in an automated fashion.

RHOAI provides:

Data science pipelines as a combination of Tekton, Kubeflow Pipelines, and Elyra.
A choice to engineers whether they want to work at a high, visual level, by creating the pipelines with Elyra, or at a lower level, by using deeper Tekton and Kubeflow knowledge.

1.4 Monitoring model effectiveness in production

In dynamic production environments, deployed models face the challenge of model drift due to evolving data patterns. This drift can diminish the model's predictive power, making continuous monitoring essential. With RHOAI, machine learning engineers and data scientists can monitor the performance of a model in production by using the metrics gathered with Prometheus.

1.5 Delivering Inference to the application

After a model is deployed in the production environment, real world data is provided as input and through a process known as Inference, where the model calculates or generates the output. In this stage, it is of prime importance to ensure minimal latency for optimal user experience.

RHOAI delivers:

Model inference through secure API endpoints. These APIs offer automatic scaling based on fluctuating workloads, ensuring consistent responsiveness.
Additionally, RHOAI's integrated API gateways provide robust security for these endpoints and any sensitive data they manage, effectively preventing unauthorized access.

1.6 Retraining the model for new data

Model retraining is done based on performance metrics that are monitored in production. Since it is quite common to have dynamically changing data, it is important to standardize and automate model retraining via retraining pipelines using the Pipelines capabilities of RHOAI mentioned previously.

Learn more and try in Developer Sandbox

2.0 Application development lifecycle steps

OpenShift is a complete hybrid cloud platform that brings together developers, platform engineers, and operations teams on one platform to efficiently develop, deploy, and manage a wide variety of workloads. How this platform enables the Cloud Native Intelligent Application Lifecycle is explained below.

2.1 Develop

In this step, developers focus on writing business logic using a consistent, modern, and supported programming language runtime. They also need to use platform services like authentication, logging, scaling, and resource management.

OpenShift provides:

Support for popular runtimes and frameworks like Java, Node.js, .NET, Go, Python, Rust, and others.
An ability for developers to use the Red Hat build of KeyCloak for RBAC (role-based access control), implement authentication within their applications, and utilize Prometheus and Grafana to implement logging and monitoring.
Support for many modern application architectures and their needs with built-in features like Service Mesh based on Istio, cross-application messaging with AMQ and Red Hat Service Interconnect, virtualization with OpenShift Virtualization based on KubeVirt, serverless based on Knative, and event-driven architecture based on Knative eventing.
An ability for Developers to use Operators and Helm Charts to extend the platform with software from ISVs.

2.2 QA

Customer and business needs require developers to provide a continuous delivery of new features and a quick turnaround time to production for bug fixes. Testing cloud native apps is complex due to their distributed nature and the number of dependencies involved.

OpenShift provides:

OpenShift Pipelines, based on the open source project Tekton, which offers Continuous Integration (CI) features to automate build, test, and deploy processes, thereby helping to detect issues early and ensuring high code quality.
Consistency across the hybrid cloud footprint, making it possible to replicate and achieve the same functionality across environments that span the hybrid cloud.

2.3 Deploy

Cloud native applications need an efficient, scalable, and consistent deployment mechanism with version control to make these applications available to users and ongoing resource management to be able to scale these applications up and down based on the demand. Applications need capabilities like automated builds, creation of container images hosting these images in an image registry, and deployment of images across hybrid cloud environments.

OpenShift provides these capabilities natively through:

OpenShift Pipelines based on Tekton.
OpenShift GitOps based on Argo CD and,
OpenShift Container Registry for image management.

2.4 Operate and monitor

In production, cloud native applications need observability capabilities that help identify, troubleshoot, and fix issues in real-time to ensure scalability, security, governance, and compliance with regulatory requirements.

OpenShift provides:

An Observability operator that deploys and maintains a common platform to aid in monitoring and reporting for application workloads on the cluster. It is integrated with Observatorium for centralizing logs and also includes Prometheus, Grafana, and Promtail. Together, this enables quick issue detection, resolution, minimizes downtime, and maintains app reliability and performance.
Capabilities for data protection, access control, and audit of logs, which ensures compliance with standards like GDPR and HIPAA. OpenShift’s resource management features allow setting quotas and limits for pods, preventing resource contention, and enabling elastic scaling to meet varying requirements.

Learn more and try in Developer Sandbox

AI/ML learning exercises

Try these self-directed learning exercises to gain experience and bring your creativity to AI and Red Hat OpenShift AI – Red Hat’s dedicated platform for building AI-enabled applications. Learn about the full suite of MLOps to train, tune, and serve models for purpose-built applications.

Fundamentals of OpenShift AI

Learn the foundations of Red Hat OpenShift AI, that gives data scientists and developers a powerful AI/ML platform for building AI-enabled applications. Data scientists and developers can collaborate to move quickly from experiment to production in a consistent environment.

Try it

Real-time Data Collection and Processing

Create a demo application using the full development suite: MobileNet V2 with Tensor input/output, transfer learning, live data collection, data preprocessing pipeline, and modeling training and deployment on a Red Hat OpenShift AI developer sandbox.

Try it

Data Engineering: Extract Live Data

Learn engineering techniques for extracting live data from images and logs of the fictional bike-sharing app, Pedal. You will deploy a Jupyter Notebook environment on Red Hat OpenShift AI, develop a pipeline to process live image and log data, and also extract meaningful insights from the collected data.

Try it