How to set up and reproduce data science experiments

In this learning path, you will learn how to set up data science projects. You will also learn how to consistently reproduce or execute Jupyter notebooks in the data science projects and serve the developed models in the form of a web service on top of Red Hat OpenShift.

Review and reproduce notebooks

The peer-review repository is structured to serve as a guide to packaging data, notebooks, models, etc., in a way that makes it easier for new developers to replicate the experiments or workflow (combination of notebook, models, and results) once published.

The first step in the workflow is to configure the path, model, data, and run information in the configuration file. Prior to each replication or run, users need to update the version number under the [Run] section in the configuration (Figure 9). Incrementing the version number allows results from the following run to be recorded under the experiments folder in the user’s respective versioned folder. For example, version 1 of the run would be recorded under a directory named experiments/experiments_1. Maintaining a different version for each run is helpful when performing multiple runs of the experiments and comparing results with the published version.

Configure the parameters to run notebooks, including the version number.
Figure 9: Configure the parameters to run notebooks, including the version number.

The other parameters in the config file can be left untouched for now. To start with, we are attempting to rerun the notebooks with the default data—i.e., data used by the author while publishing the experiment. In later steps, we will show how users can point a run to custom data.

Users can review as well as rerun the notebooks in the typical order of the data science workflow. Users can either execute all the cells in a notebook (using the option Run all cells from the Run drop-down menu as shown in Figure 10) or walk through and execute each cell using the play button (Run the selected cells and advance as shown in Figure 11).

Running all cells of a notebook.
Figure 10: Running all cells of a notebook.

 

You can also run selected cells and advance through them.
Figure 11: You can also run selected cells and advance through them.

Repeat the run step for all the notebooks. The following artifacts are stored at the end of each notebook:

   experiments                   

    └── experiment_1 <- assuming user sets version in config to 1 

        └── data                    

            └── plots <- 01-explore-data.ipynb

            └── train-test <- 02-build-features.ipynb

        └── models

            └── trained <- 03-train-model.ipynb

            └── tuned <- 04-tune-model.ipynb

        └── prediction results <- 05-model-inference.ipynb

After executing all the notebooks once, you will see the files shown in Figure 12 under experiments/experiments_1.

The first run leaves output files under experiments/experiments_1.
Figure 12: The first run leaves output files under experiments/experiments_1.
Previous resource
Explore data, notebooks, and models
Next resource
Reproduce notebooks with custom data