How to create a PyTorch model

In this learning path, you will set up options for your Jupyter notebook server and select your PyTorch preferences,  then explore the dataset you'll use to create your model. Finally, you will learn how to build, train, and run your PyTorch model.

Explore the Diabetes data set

After starting your server, three sections appear in JupyterLab's launcher:

  • Notebook
  • Console
  • Other

On the left side of the navigation pane, locate the Name explorer panel (Figure 3). This panel is where you can create and manage your project directories.

The Name explorer panel in a JupyterLab workspace shows available options.
Figure 3: The Name explorer panel in a JupyterLab workspace shows available options.

Clone a GitHub repository

Now it's time to populate your JupyterLab notebook with a GitHub repository. Select the Git/Clone a Repository menu option. A dialog box appears (Figure 4).

Enter the URL for the Git repository you want to clone.
Figure 4: Enter the URL for the Git repository you want to clone.

 

Enter the repository URL, which is https://github.com/rh-aiservices-bu/diabetes-pytorch-model, and select Clone to clone the diabetes-pytorch-model repository.

If the notebook prompts you for logging information, enter your username and password, then select Ok.

The diabetes-pytorch-model repository contents

After you've cloned your repository, the diabetes-pytorch-model repository contents appear in a directory under the Name pane (Figure 5).

The user interface shows the contents of the diabetes-pytorch-model repository.
Figure 5: The user interface shows the contents of the diabetes-pytorch-model repository.

 

Change into the notebooks directory. The directory should contain the following files:

  • 00-getting-started.ipynb
  • 01-data-exploration.ipynb
  • 02-model-development.ipynb

Open the 01-data-exploration.ipynb file.

01-data-exploration.ipynb

The data set we are using is composed of diabetes readings for females. It can be used to predict the onset of diabetes based on medical diagnostic measurements. This database is available through the Kaggle environment and described as follows:

“This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether a patient has diabetes, based on diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.”

In this notebook, we'll take an opportunity to explore the data we'll be using. Data exploration and understanding are fundamental steps in machine learning.

The data set consists of about 800 examples of various medical readings for female patients who are members of an indigenous nation. Some of the patients have diabetes. Knowing what medical readings look like for a person with diabetes, can we predict which people may have diabetes based on the medical readings we have gathered?

Note: In machine learning, a data set of 800 examples is considered extremely small.

Data exploration in 01-data-exploration.ipynb

Explore the Diabetes data set using the 01-data-exploration.ipynb notebook as shown in Figure 6. You will work through this notebook, read the explanations, and execute the notebook’s cells.

Open and work through the 01-data-exploration.ipynb notebook.
Figure 6: Open and work through the 01-data-exploration.ipynb notebook.

 

The sections that you will be working through include:

  1. Loading the diabetes.csv data into a DataFrame.
  2. Exploring the diabetes data using a DataFrame.
  3. Looking for correlations in the diabetes data set.

After finishing this notebook, you will understand what the data looks like, which will help you construct the AI model in the next Jupyter notebook.

Load the diabetes.csv data into a DataFrame

Pandas is a column-oriented data manipulation library. It is very commonly used and has a parallel in Apache Spark’s DataFrame API.

One of Pandas’ conveniences is that it can load many data formats easily. Load the data from our Diabetes database by using the following Python code, which loads the data into a DataFrame named df:

df = pd.read_csv("diabetes.csv")

Explore the diabetes training data using a DataFrame

Now that the data is loaded into a DataFrame, we can begin to explore the data. Let’s take a look at our DataFrame using the df.info() method (Figure 7).

Use df.info( ) to examine the columns and data types in your data set.
Figure 7: Use df.info( ) to examine the columns and data types in your data set.

 

info() shows the column names and associated data types. Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age , and Outcome are integer data types (e.g., 34, 2, 98). BMI and DiabetesPedigreeFunction are floating-point data types (e.g., 1.3, 4.56).

Let's use the describe() method (Figure 8) to learn about the statistical information (e.g., mean, standard deviation, min, max) for each of our columns:

df.describe()

Use the describe() method to view statistical information for the data set columns.
Figure 8: Use the describe() method to view statistical information for the data set columns.

 

The data set consists of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. Definitions of the variables follow:

  • Pregnancies: Number of times the individual has been pregnant.
  • Glucose: Plasma glucose concentration at 2 hours in an oral glucose tolerance test.
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skinfold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable; 268 of 768 entries are 1 (has diabetes), while the others are 0 (does not have diabetes).

Let's now look at our data using a series of histograms and plots in order to see how ”distributed” it is. We can run the hist() function to see how our data is distributed for each column in our data set (Figure 9). The display will allow us to determine whether any of the columns have outliers:

df.diff().hist(color="k", alpha=0.5, bins=20)
We can see that our data set is evenly distributed except for some outliers.
Figure 9: We can see that our data set is evenly distributed except for some outliers.

 

The histograms in Figure 9 show that our data set is evenly distributed except for some outliers in BMI, Outcomes, and DiabetesPedigreeFunction.

Outliers are data points that differ significantly from most of the data points. They can be due to variability in measurement or may even indicate a measurement error. It is up to the data engineer or data scientist to determine whether outliers should be removed or kept within the data set. For now, we will keep them. Let's take a closer look at the actual values in our data set. To examine the first 20 rows of data, plus column names such as Pregnancies and Glucose, we use the head() function (Figure 10).

The first values in the data set are displayed with their columns.
Figure 10: The first values in the data set are displayed with their columns.

 

I have annotated the output to highlight 0 values for insulin levels and skin thickness. The 0 values are outliers indicating that we have missing or null data.

Let's check the last 20 rows of our data set, using df.tail(20), and determine whether there are any 0 values (Figure 11).

In the last 20 rows of our data set, there are also 0 values.
Figure 11: In the last 20 rows of our data set, there are also 0 values.

 

The head() and tail() methods allowed us to discover that we have a number of 0 values in our data set. Therefore, it is likely that there are more 0 values present throughout our data set. Let’s determine how many.

We can use the iloc method and some simple statistics to determine the number and percentage of '0' values in our data set:

#Determine the number of '0' values

df2 = df.iloc[:, :-1]



#number & percent of '0's for each attribute

numZero = (df2[:] == 0).sum()

perZero = ((df2[:] == 0).sum())/768*100



print("\nRows, Columns: ",df2.shape)

print("\nNumber of 0's:")

print(numZero)

print("\nPercentage of 0's:")

print(perZero)

The output from this code is a list of the column names in our data set, along with the number and percentage of 0 values (Figure 12).

There are 227 zero values for SkinThickness and 374 zero values (about 50%) for Insulin.
Figure 12: There are 227 zero values for SkinThickness and 374 zero values (about 50%) for Insulin.

 

Looking for correlations in the Diabetes data set

If we want to build and train a model, we need to address these 0 values. We need to ask whether the missing data values are in some way related (correlated) to each other. To look for correlations in our data set, we can use the corr() method, which computes the standard correlation coefficient between every pair of attributes:

corrM = df.corr()

corrM

The coefficient is measured on a scale from 0 to 1. A coefficient close to 1 indicates a strong positive correlation. But when we take a look at our data, we don't see any attributes that are highly correlated (Figure 13). Even our attributes with lots of 0 values, Insulin and SkinThickness, have only a correlation value of 0.436783.

 Our data shows no high correlations between columns, even between Insulin and SkinThickness.
Figure 13: Our data shows no high correlations between columns, even between Insulin and SkinThickness.

 

Even though there are no high correlations in the data, the zero values are still erroneous, and we shouldn't include them in our model.

Missing values and other invalid outliers are common problems in analytical data. To deal with missing data, data scientists can use two primary methods: imputation or data removal. Imputation develops reasonable guesses for missing data, such as replacing these values with a median measurement. But we don't want to do this before we break the data set into training and testing sets for our model. So let's set aside this issue until then. In the meantime, since we are fairly happy with our data analysis, we can go to the next step, which is model development.

notebook 02-model-development.ipynb
Previous resource
Overview: How to create a PyTorch model
Next resource
Build, train, and run your PyTorch model