Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Introduction to machine learning with Jupyter notebooks

May 21, 2021
Ishu Verma
Related topics:
Artificial intelligenceEdge computingKubernetes
Related products:
Red Hat OpenShift

Share:

    Recently, I was working on an edge computing demo that uses machine learning (ML) to detect anomalies at a manufacturing site. This demo is part of the AI/ML Industrial Edge Solution Blueprint announced last year. As stated in the documentation on GitHub, the blueprint enables declarative specifications that can be organized in layers and that define all the components used within an edge reference architecture, such as hardware, software, management tools, and tooling.

    At the beginning of the project, I had only a general understanding of machine learning and lacked the practitioner's knowledge to do something useful with it. Similarly, I’d heard of Jupyter notebooks but didn’t really know what they were or how to use one.

    This article is geared toward developers who want to understand machine learning and how to carry it out with a Jupyter notebook. You'll learn about Jupyter notebooks by building a machine learning model to detect anomalies in the vibration data for pumps used in a factory. An example notebook will be used to explain the notebook concepts and workflow. There are plenty of great resources available if you want to learn how to build ML models.

    What is a Jupyter notebook?

    Computation notebooks have been used as electronic lab notebooks to document procedures, data, calculations, and findings. Jupyter notebooks provide an interactive computational environment for developing data science applications.

    Jupyter notebooks combine software code, computational output, explanatory text, and rich content in a single document. Notebooks allow in-browser editing and execution of code and display computation results. A notebook is saved with an .ipynb extension. The Jupyter Notebook project supports dozens of programming languages, its name reflecting support for Julia (Ju), Python (Py), and R.

    You can try a notebook by using a public sandbox or enabling your own server like JupyterHub. JupyterHub serves notebooks for multiple users. It spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. In this article, JupyterHub will be running on Kubernetes.

    The Jupyter notebook dashboard

    When the notebook server first starts, it opens a new browser tab showing the notebook dashboard. The dashboard serves as a homepage for your notebooks. Its main purpose is to display the portion of the filesystem accessible by the user and to provide an overview of the running kernels, terminals, and parallel clusters. Figure 1 shows a notebook dashboard.

    A screenshot of the Jupyter notebook dashboard.
    Figure 1: A notebook dashboard.
    Figure 1: A notebook dashboard.

    The following sections describe the components of the notebooks dashboard.

    Files tab

    The Files tab provides a view of the filesystem accessible by the user. This view is typically rooted to the directory in which the notebook server was started.

    Adding a notebook

    A new notebook can be created by clicking the New button or uploaded by clicking the Upload button.

    Running tab

    The Running tab displays the currently running notebooks known to the server.

    Working with Jupyter notebooks

    When a notebook is opened, a new browser tab is created that presents the notebook's user interface. Components of the interface are described in the following sections.

    Header

    At the top of the notebook document is a header that contains the notebook title, a menu bar, and a toolbar, as shown in Figure 2.

    A screenshot of the Jupyter header.
    Figure 2: A notebook header.
    Figure 2: Notebook header.

    Body

    The body of a notebook is composed of cells. Cells can be included in any order and edited at will. The contents of the cells fall under the following types:

    • Markdown cells: These contain text with markdown formatting, explaining the code or containing other rich media content.
    • Code cells: These contain the executable code.
    • Raw cells: These are used when text needs to be included in raw form, without execution or transformation.

    Users can read the markdown and text cells and run the code cells. Figure 3 shows examples of cells.

    Examples of cells.
    Figure 3: Examples of cells.
    Figure 3: Examples of cells.

    Editing and executing a cell

    The notebook user interface is modal. This means that the keyboard behaves differently depending on what mode the notebook is in. A notebook has two modes: edit and command.

    When a cell is in edit mode, it has a green cell border and shows a prompt in the editor area, as shown in Figure 4. In this mode, you can type into the cell, like a normal text editor.

    Code cell in edit mode with prompt to allow editing
    Figure 4: A cell in edit mode.

    When a cell is in command mode, it has a blue cell border, as shown in Figure 5. In this mode, you can use keyboard shortcuts to perform notebook and cell actions. For example, pressing Shift+Enter in command mode executes the current cell.

    Cell in command mode
    Figure 5: A cell in command mode.

    Running code cells

    To run a code cell:

    1. Click anywhere inside the [ ] area at the top left of a code cell. This will bring the cell into command mode.
    2. Press Shift+Enter or choose Cell—>Run.

    Code cells are run in order; that is, each code cell runs only after all the code cells preceding it have run.

    Getting started with Jupyter notebooks

    The Jupyter Notebook project supports many programming languages. We’ll use IPython in this example. It uses the same syntax as Python but provides a more interactive experience. You’ll need the following Python libraries to do the mathematical computations needed for machine learning:

    • NumPy: For creating and manipulating vectors and matrices.
    • Pandas: For analyzing data and for data wrangling or munging. Pandas takes data such as a CSV file or a database, and creates from it a Python object called a DataFrame. A DataFrame is the central data structure in the Pandas API and is similar to a spreadsheet as follows:
      • A DataFrame stores data in cells.
      • A DataFrame has named columns (usually) and numbered rows.
    • Matplotlib: For visualizing data.
    • Sklern: For supervised and unsupervised learning. This library provides various tools for model fitting, data preprocessing, model selection, and model evaluation. It has built-in machine learning algorithms and models called estimators. Each estimator can be fitted to some data using its fit method.

    Using a Jupyter notebook for machine learning

    We’ll be using the MANUela ML model as a notebook example to explore various components needed for machine learning. The data used to train the model is located in the raw-data.csv file.

    The notebook follows the workflow shown in Figure 6. An explanation of the steps follows.

    The notebook workflow for machine learning is explained in the sections of text that follow.
    Figure 6: Notebook workflow for machine learning.

    Step 1: Explore raw data

    Use a code cell to import the required Python libraries. Then, convert the raw data file (raw-data.csv) to a DataFrame with a time series, an ID for the pump, a vibration value, and a label indicating an anomaly. The required Python code is shown in a code cell in Figure 7.

    Using a code cell to hold the IPython code that imports libraries and converts the raw data.
    Figure 7: Importing libraries and converting raw data.

    Running the cell produces a DataFrame with raw data, shown in Figure 8.

    Data frame with raw data
    Figure 8: Data frame with raw data.

    Now visualize the DataFrame. The upper graph in Figure 9 shows a subset of the vibration data. The lower graph shows manually labeled data with anomalies (1 = anomaly, 0 = normal). These are the anomalies that the machine learning model should detect.

    A visualization shows raw data and anomalies as two charts.
    Figure 9: Visualizing raw data and anomalies.

    Before it can be analyzed, the raw data needs to be transformed, cleaned, and structured into other formats more suitable for analysis. This process is called data wrangling or data munging.

    We’ll be converting the raw time series data into small episodes that can be used for supervised learning. The code is shown in Figure 10.

    Code for creating a new data frame
    Figure 10: Creating a new data frame.

    We want to convert the data to a new DataFrame with episodes of length 5. Figure 11 shows a sample time series data set.

    Example of time series data
    Figure 11: Example of time series data.

    If we convert our sample data into episodes with length = 5, we get results similar to Figure 12.

    New data frame with data converted into episodes
    Figure 12: New data frame with episodes.

    Let’s now convert our time series data into the episodes, using the code in Figure 13.

    Code for converting data into episodes

    Figure 13: Converting data into episodes.

    Figure 14 explores the data with episodes of length 5 and the label in the last column.

    Episodes of length 5 and the label in the last column

    Figure 14: Episodes of length 5 and the label in the last column.

    Note: In Figure 14, column F5 is the latest data value, where column F1 is the oldest data for a given episode. The label L indicates whether there is an anomaly.

    The data is now ready for supervised learning.

    Step 2: Feature and target columns

    Like many machine learning libraries, Sklern requires separated feature (X) and target (Y) columns. So Figure 15 splits our data into feature and target columns.

    Code for splitting data into feature and target columns

    Figure 15: Splitting data into feature and target columns.

    Step 3: Training and testing data sets

    It’s a good practice to divide your data set into two subsets: One to train a model and the other to test the trained model.

    Our goal is to create a model that generalizes well to new data. Our test set will serve as a proxy for new data. We’ll split the data set into 67% for the training sets and 33% for the test set, as shown in Figure 16.

    Code for splitting data into training and test data sets

    Figure 16: Splitting data into training and test data sets.

    We can see that the anomaly rate for both training and test sets is similar; that is, the data set is pretty fairly divided.

    Step 4: Model training

    We will perform model training with a DecisionTreeClassifier. Decision Trees is a supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

    DecisionTreeClassifier is a class that performs multi-class classification on a dataset, although in this example we’ll be using it for classification into a single class. DecisionTreeClassifier takes as input two arrays: An array X as features and an array Y as labels. After being fitted, the model can then be used to predict the labels for the test data set. Figure 17 shows our code.

    Code for model training with DecisionTreeClassifier

    Figure 17: Model training with DecisionTreeClassifier.

    We can see that the model achieves a high accuracy score.

    Step 5: Save the model

    Save the model and load it again to validate that it works, as shown in Figure 18.

    Code for saving the model

    Figure 18: Saving the model.

    Step 6: Inference with the model

    Now that we've created the machine learning model, we can use it for inference on real-time data.

    In this example, we’ll be using Seldon to serve the model. For our model to run under Seldon, we need to create a class that has a predict method. The predict method can receive a NumPy array X and return the result of the prediction as:

    • A NumPy array
    • A list of values
    • A byte string

    Our code is shown in Figure 19.

    Code for using Seldon to serve the ML Model

    Figure 19: Using Seldon to serve the machine learning model.

    Finally, let’s test whether the model can predict anomalies for a list of values, as shown in Figure 20.

    Code for Inference using the model

    Figure 20: Inference using the model.

    We can see that the model achieves a high score for inference, as well.

    References for this article

    See the following sources for more about the topics discussed in this article:

    • Welcome to Colaboratory
    • About GESIS Notebooks
    • The Project Jupyter homepage
    Last updated: August 15, 2022

    Recent Posts

    • More Essential AI tutorials for Node.js Developers

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue