Mastering natural language processing (NLP)
Master NLP using Red Hat OpenShift Data Science
In this tutorial, you are an intern for a city transportation department. You have been given the job of processing potential bus repair issues that the drivers have noticed during their shifts. In order to keep the repair issues organized and visible, you need to learn how to categorize them.
Red Hat OpenShift Data Science provides you with the tools and the steps you'll need to make your repairs possible.
Loading...
Step 1: Starting a Jupyter environment
- Log in to the Red Hat OpenShift Dedicated platform (Figure 1).
-
From the OpenShift Dedicated platform, go to your OpenShift Data Science platform by selecting the OpenShift Data Science icon in the upper right corner of the screen (Figure 2).
-
You're now logged into OpenShift Data Science and are presented with the dashboard (Figure 3).
The Explore tab allows you to view other available managed services and partner applications. However, these services cannot be enabled in the sandbox environment.
OpenShift Data Science brings you on-demand Jupyter Notebooks environments
Don’t worry if you’ve never used notebooks before because this workshop includes a small tutorial about what they are and how to use them.
Now that you are logged in to OpenShift Data Science, select Launch on the JupyterHub card (Figure 4).
If this is the first time you’re launching Jupyter, you are sent to a page that requires you to log in and that asks you to permit the use of your user account to authenticate to Jupyter. You should of course allow this access if you want to do the workshop.
Once you have authorized access, you will be taken to the JupyterHub Spawner Options page. Follow these steps:
-
Locate the Start a notebook server page.
-
For the Notebook Image, select TensorFlow, because this is the flavor of notebook we want to use. It includes the popular machine learning library TensorFlow, which can be used for image recognition.
-
From the Container size dropdown, select Default.
-
At the bottom of the page, select Start Server (Figure 5).
While your environment is starting, go on to read the following section.
Step 2: The Jupyter environment
You are now inside your Jupyter environment. It’s a web-based environment, but everything you do here is in fact happening on the OpenShift Data Science cluster. This means that without having to install and maintain anything on your own computer, and without tying up lots of local resources such as CPU and RAM, you can conduct your data science work in this powerful and stable managed environment.
The file-browser window you’re in right now contains the files and folders saved inside OpenShift Data Science.
It’s pretty empty right now, though, so the first thing you need to do is bring the content of the workshop inside this environment as follows:
- On the left toolbar, select the GitHub icon (Figure 6):
- Select Clone a Repository (Figure 7):
-
Enter the URL https://github.com/rh-aiservices-bu/metrobus-repairs-nlp-workshop.git and select CLONE. (Figure 8.)
-
Cloning the repository takes a few seconds, after which you can double-click and navigate (Figures 9 and 10.) to the newly-created folder,
metrobus-repairs-nlp-workshop.
Ready? Let's go to the next step.
Step 3: Notebooks
This section provides a small introduction to Jupyter Notebooks. If you’re already at ease with Jupyter, you can head directly to the Time to Play section.
What’s a notebook?
A notebook is an environment with cells that can display formatted text or code.
Figure 11. shows both empty and populated cells.
Code cells contain code that can be run interactively. That means you can modify the code, then run it. The code will not run on your computer or in the browser, but directly in the environment to which you are connected, OpenShift Data Science in our case.
To run a code cell, just clicking in it, or on the left side of it, and then click the Run button from the toolbar. You can also press Ctrl+Enter to run a cell, or Shift+Enter to run the cell and automatically select the following one.
The Run button on the toolbar looks as shown in Figure 12.
Running the cell ends by showing the result of the code that was run in that cell, as well as information about when this particular cell has been run. You can also enter notes into a cell by switching the cell type in the menu from Code to Markup.
Note: When you save a notebook, both the code and the results are saved. So you can reopen the notebook to look at the results without having to run the program again, and while still having access to the code.
Time to play
Now that we have covered the basics, give notebooks a try.
In your Jupyter environment (the file explorer-like interface), there is a file called 01_sanbdbox.ipynb
. Double-click on it to launch the notebook, which will open another tab in the content section of the environment. Feel free to experiment, run the cells, add some more cells, and create functions. You can do whatever you want, because the notebook is your personal environment, and there is no risk of breaking anything or affecting other users. This environment isolation is a great advantage of OpenShift Data Science.
You can also create a new notebook by selecting File→New→Notebook from the menu on the top left, then selecting a Python 3 kernel. This selection asks Jupyter to create a new notebook that runs the code cells using Python 3.
You can also create a notebook by simply clicking on the icon in the launcher. (Figure 13.)
To learn more about notebooks, head to the Jupyter site.
Now that you’re more familiar with notebooks, you’re ready to go to the next section.
Step 4: Submitting metro bus repairs using NLP
Now that you know how the environment works, the real work can begin.
Still in your environment, open the 01-Create-Claims-Classification.ipynb
file, and follow the instructions directly in the notebook.
After you run the code in the notebook, it will look like Figure 14.
Once you are finished, you can come back here and head to the next section.
Step 5: Exposing the model as an API
In the previous section, we learned how to create the code that classifies a repair based on the free text we enter. But we can't use a notebook directly like this in a production environment. So now we will learn how to package this code as an API that you can directly query from other applications.
Some explanations first:
- The code that we wrote in the notebook has been repackaged as a single Python file,
prediction.py
. Basically, the file combines the code in all the cells of the notebook. - To use this code as a function you can call, we added a function called
predict
that takes a string as an input, classifies the repair, and sends back the resulting classification. Open the file directly in JupyterLab, and you should recognize our previous code along with this new additional function. - There are other files in the folder that provide functions to launch a web server, and that we will use to serve our API.
After these explanations, you are ready to open the 03_MBR_run_application.ipynb
file and follow the instructions directly in the notebook.
Our API will be served directly from our container using Flask, a popular Python web server. The Flask application, which will call our prediction function, is defined in the wsgi.py
file.
When you execute the following cell, it will be in a
Launch the Flask application (Figure 15.)
Then query the API. (Figure 16.)
Once you are finished, you can come back here and head to the next step.
Now you are ready to test the Flask application. Open the Jupyter notebook named 04_MBR_test_application.ipynb
and follow the instructions in the notebook.
Step 6: Packaging your application
Now that the application code is working, you’re ready to package it as a container image and run it directly in OpenShift as a service that you will be able to call from any other application.
Building the application inside OpenShift
You can access the OpenShift Dedicated dashboard from the application switcher in the top bar of the RHODS dashboard (Figure 17).
-
Open your OpenShift UI and switch to the developer view from the menu on the top left:
-
Make sure you are in the project that was assigned to you:
-
From the +Add menu, click the From Git option:
-
In the Git Repo URL field, enter
https://github.com/rh-aiservices-bu/metrobus-repairs-nlp-workshop.git
: -
Don’t overlook this step. (Figure 21.) Click Show advanced Git options, and in the Git reference field, enter
main
because this is the branch to use in our GitHub project.Leave the other fields at their defaults and scroll down. (Figure 22.) The display shows that OpenShift automatically recognized that our repo contains Python code, and that the right base image has been selected. Pretty neat, eh?
-
If you continue to scroll down, you will see that everything is automatically selected to create a deployment of your application, as well as a route through which you will be able to access it. Everything is almost ready. You will need the URL to this deployment. Select the Routing advanced option before you click the Create button. (Figure 23.) The Routing page displays the URL of the route to the deployment you are creating:
-
If you look at your deployment, you will find an "hourglass" icon denoting that your ‘build’ is still being deployed. (Figure 24.)
-
The automated build process will take a few minutes. Some alerts may appear if OpenShift tries to deploy the application while the build is still running, but that’s OK. Then OpenShift will deploy the application (rollout), and in the topology view, you should obtain a screen similar to Figure 25.
The automated building process takes a few minutes. Your topology view shows the status of the deployment along with the memory and cores assigned to your build.
-
Click on the application line as shown in Figure 26. to open an application detail panel on the right side of the screen. The image shows further details, resources, and the monitoring of your deployment.
-
Scroll down in the detail panel and search for the Routes section, which contains the URL to which you will send images. (Figure 27.)
Once you are finished, you can come back here and head to the next step.
Step 7: Testing the application
You now have an application listening at the route that was created during the deployment. You're probably eager to try out the application—and that is what we do in this step.
App Status
You can test the application by simply clicking on the route's link, or by copying and pasting the link into your browser. (Figure 28.)
Uploading images
Because our application is now a REST API endpoint, there are multiple ways to upload images to it. Here are a few.
cURL on a Linux or macOS command-line
In a terminal shell such as Bash or zsh, enter a cURL command with sample text like “I turn the key and nothing happens” as shown in Figure 29. Replace the localhost in the command with the right hostname for the route, and make sure to include /prediction
:
From Python code
Send a RESTful post request with sample text like “I turn the key and nothing happens” as shown in the following example. Replace the localhost in the command with the right hostname for the route, and make sure to include /prediction
:
From a notebook
You can also test the REST API endpoint from a Jupyter Notebook. Open the notebook named 05_MBR_enter_repair.ipynb
. In the first cell, replace the placeholders with the text as shown in Figure 31.
- The repair text to be categorized
- The route to the service
The repair text goes in the my_text
field in the file, and the route in the my_route
field, as follows:
Run both cells and see the result. (Figure 32.)
Conclusion
We hope you have enjoyed this activity!
You have experienced first-hand how Red Hat OpenShift Data Science can simplify the onboarding of your AI/ML projects, providing an easy-to-use solution for your data scientists and data engineers.
To learn more about Red Hat OpenShift Data Science, please visit the Red Hat OpenShift Data Science page.
More Data Science articles
Looking for more OpenShift Data Science?
Learn more about Red Hat OpenShift Data Science.