How to access, download, and analyze data for S3 usage

In this learning path, you will start your Jupyter notebook server and select preferences for S3 usage. You will also learn how to access and download the data you create as well as analyze it, using a variety of skills and tools.

Access and download S3 data

You are now inside your JupyterLab environment (Figure 2). It's a web-based environment, but everything you do here is actually happening on the OpenShift Data Science cluster. This means that, without having to install and maintain anything on your own computer, and without consuming lots of local resources such as CPU and RAM, you can conduct your data science work in this powerful and stable managed environment.

The Name explorer panel in a JupyterLab workspace shows available options.
Figure 2. The Name explorer panel in a JupyterLab workspace shows available options.

 

You are now in a window that resembles a file browser on a desktop. The window displays the files and folders that are saved in your personal space inside OpenShift Data Science. The window is pretty empty right now, though. So the first thing we will do is add content into this environment by using Git.

Cloning a GitHub repository

You can clone a Git repository in JupyterLab through the left-hand toolbar or the Git menu option in the main menu (Figure 3).

Access to GitHub repositories is available in the main menu at the top of the screen and in the toolbar on the left.
Figure 3. Access to GitHub repositories is available in the main menu at the top of the screen and in the toolbar on the left.

 

Let's clone a repository using the left-hand toolbar. Click on the GitHub icon, shown in Figure 4.

The Git icon is the third icon from the top in the JupyterLab toolbar.
Figure 4. The GitHub icon is the third icon from the top in the JupyterLab toolbar.

 

Then click on Clone a Repository (Figure 5.)

After selecting the Git icon, select “Clone a Repository”
Figure 5. After selecting the Git icon, select “Clone a Repository”.

 

Enter your Git repository URL, which for this learning path is https://github.com/rh-aiservices-bu/access-s3-data. Then click CLONE (Figure 6).

Finish cloning the repository by clicking the CLONE button.
Figure 6. Finish cloning the repository by clicking the CLONE button.

 

Cloning takes a few seconds, after which you can double-click and navigate to the newly-created folder (access-s3-data) which contains your cloned Git repository.

For this learning path, double-click and navigate to the newly-created folder, named access-s3-data. The Git repository contains an empty datasets directory and the following files (Figure 7):

  • downloadData.ipynb
  • simpleCalc.ipynb
  • Requirements.txt
  • README.md
The user interface shows a list of downloaded files.
Figure 7. The user interface shows a list of downloaded files.

 

Access and download S3 data

In the Name menu, double-click the downloadData.ipynb notebook (Figure 8).

Open the downloadData Jupyter notebook.
Figure 8. Open the downloadData Jupyter notebook.

 

Run each cell in the notebook, using the Shift-Enter key combination, and pay attention to the execution results. Using this notebook, we will:

  • Make a connection to an AWS S3 storage bucket
  • Download a CSV file into the ‘datasets’ folder
  • Rename the downloaded CSV file to 'newtruckdata.csv'

View your new CSV file

Inside the ‘datasets’ directory, double-click the 'newtruckdata.csv' file. File contents should appear as shown in Figure 9.

The user interface shows the contents of the newtruckdata.csv file.
Figure 10. The user interface shows the contents of the newtruckdata.csv file.

 

The file contains the data you will analyze. Now we can move to the next learning resource and perform some analytics.

Previous resource
Overview: How to access, download, and analyze data for S3 usage
Next resource
Analyze your S3 data