Page
Explore the MNIST dataset

Welcome to JupyterLab!
After starting your server, three sections appear in JupyterLab's launcher:
- Notebook
- Console
- Other
On the left side of the navigation pane, locate the Name explorer panel. (Figure 2). This is where you can create and manage your project directories.
Clone a GitHub Repository
Now it's time to populate your JupyterLab notebook with a GitHub repository.
Select the Git/Clone a Repository menu option. A dialog box will appear (Figure 3).
Enter the URL of the repository and select Clone to clone the mnist-tensorflow-model repository, which is https://github.com/rh-aiservices-bu/mnist-tensorflow-model and select Clone to clone the mnist-TensorFlow-model repository.
Note: GitHub may ask for your credentials. Enter your username and password, then select Ok.
The mnist-tensorflow-model repository contents
After you've cloned your repository, the mnist-TensorFlow-model repository contents will appear in a directory under the Name pane (Figure 4).
The repository should contain the following files:
-
01-MNIST-Data-Exploration.ipynb
-
02-MNIST-TensorFlow.ipynb
-
README.md
-
requirements.txt
-
Resources.tar.gz
Let's use the 01-MNIST-Data-Exploration.ipynb file.
01-MNIST-Data-Exploration.ipynb
In this notebook, we'll take an opportunity to explore the data we'll be using. Data exploration and understanding are fundamental steps in machine learning (ML).
The data we'll be exploring is the MNIST data set. The data set consists of 60,000 examples of handwritten digits with a label corresponding to the intended digit.
Note: In ML, 60,000 examples are considered a fairly small data set. Creating an AI program to label handwritten digits from the MNIST data set is a very common, hello-world style of AI program.
Unzipping the Resources.tar.gz file
First, you'll have to unpack the tar file that contains the MNIST data.
Because the data set is large, it is provided as a compressed tar file. You will need to unpack the file from a Terminal window.
From the Launcher tab, select Terminal to launch a terminal window (Figure 5).
Now you are working in the Terminal (Figure 6). Change into the mnist-tensorflow-model directory using the following command:
$ cd mnist-tensorflow-model
Unzip the Resources.tar.gz file using the command:
$ tar xvzf Resources.tar.gz
Unzipping the Resources.tar.gz file creates a new Resources folder which appears in the mnist-tensorflow-model directory.
As shown in Figure 7., selecting the Resources directory will display the CSV files containing MIST data: mnist_test.csv and mnist_train.csv
Now that we have a data set to work with, let’s explore it!
Data exploration in 01-MNIST-Data-Exploration.ipynb
Explore the MNIST data set using the 01-MNIST-Data-Exploration.ipynb notebook as shown in Figure 8. You will work through this notebook, reading the explanations and executing the notebook’s cells.
The sections that you will be working through include:
- Load the mnist-train.csv data into a DataFrame
- Explore the mnist_train data using a DataFrame
- Examine ‘features’ of the mnist_train data
- Examine the data graphically using plt.imshow()
After finishing this notebook, you will understand what the data looks like, which help you when you are constructing the AI model in the second notebook.
Load the mnist-train.csv data into a DataFrame
Pandas is a column-oriented data manipulation library. It is very commonly used, and has a parallel in Spark’s DataFrame API.
One of Pandas’s conveniences is that it can load many data formats easily. Load the data by using the following Python code, which loads the data into a DataFrame called train_df.
train_df: pd.DataFrame = pd.read_csv('Resources/mnist_train.csv', header=None)
Explore the mnist_train data using a DataFrame
Now that the data is loaded into a DataFrame we can begin to explore the data. Let’s take a look at the train_df DataFrame (Figure 9).
The DataFrame will be truncated, so you can see only a few rows out of the total 60,000 rows of data.
Take a look at a few rows with the head() or tail() method. Each method accepts an argument for how many rows. The default is 5, which is probably what we want (Figure 10).
Note that when looking at the data, column 0 appears to have the label, while the rest appear to be zeros. If you were to look at a bit more expanded DataFrame, you would see that there's some padding around the digits. We'll examine it a bit closer in the next few steps.
You can also see some basic statistics about the data in the DataFrame with the describe() method. This is useful as a basic sanity check that the data corresponds to what you expect.
These statistics aren't particularly useful for this data set, but it's helpful to see some of them nonetheless. For instance, there are some non-zero pixels in columns 775-780 (Figure 11). That at least tells us that some images extend that far.
Now, this helps us get a feel for the data set, but doesn't really show us what we need to see to understand the data.
Let's now look at a single example. Use the loc property to extract a single row and look at the values field to get the underlying NumPy array (Figure 12).
Now that you have some idea of what the data in the DataFrame looks like, we can proceed. But it is also good to build just a bit more intuition about the data, since we can see the values but they do not yet really give us a clear picture of what we're looking at.
In this case, the best way would be to look at the rows graphically, excluding the first element which is the label. Graphical representations of data often are the most intuitive way to examine data, and especially image data.
For a graphical view, we bust out our trusty Pyplot library. To get a good look at our data, take that first example and check that the label corresponds to the image in the way we expect (Figure 13).
The output shows that our label for row 0 is 5. Now let's look at the rest of it graphically.
We take the rest of row 0 and call it "features," which is a machine learning term that roughly refers to the characteristics of a piece of data. Machine learning is all about generating or predicting a "label" given a set of "features." Let’s continue our examination of features in the next section.
Examine ‘features’ of the mnist_train data set
Now, we could look at our data as a one-dimensional array, but instead since it's an image, we want to reshape it into the proper two-dimensional shape. That will give us the most clarity about what we want to see (Figure 14). Note that the reshape() method takes a single tuple, rather than a set of integer arguments.
Examine the data graphically using plt.imshow()
Let’s examine the data using the pyplot.imshow() method, which takes an array of values (or an n-dimensional array) and prints an image based on their values. The method is useful for activities such as displaying images from pixels, heatmaps, etc. We're using it for the former, but it has many other uses.
In our case, we pass in the reshaped 28x28 array from the previous section, and tell imshow() to use the greyscale color map (Figure 15). You can check the documentation for Pyplot to learn more about the method’s features and other keywords it supports.
Now that we know what our data looks like, we are ready to build our model. Continue to the next learning path where we will build, train, and run a TensorFlow model.