Mastering natural language processing (NLP)
How to create a natural language processing (NLP) application using OpenShift Data Science
In this activity, you are an intern for a city transportation department. You have been given the job of processing potential bus repair issues that the drivers have noticed during their shifts. In order to keep the repair issues organized and visible, you need to learn how to categorize them.
Red Hat OpenShift Data Science provides you with the tools and the steps you'll need to make your repairs possible; Red Hat Developer Sandbox gives you free access to a Red Hat OpenShift cluster in order to perform this activity.
- The working environment for this activity is in the browser.
- You will need an OpenShift Sandbox account (it’s free).
What you’ll be doing
- Part 1: Get and/or log in to your OpenShift Sandbox account
- Part 2: Open the Red Hat OpenShift Data Science session
- Part 3: Launch Jupyter
- Part 4: Start a notebook server
- Part 5: Clone a GitHub repository into your environment
- Part 6: Time to play
- Part 7: Submitting metro bus repairs using NLP
- Part 8: Exposing the model as an API
- Part 9: Building the application inside OpenShift
- Part 10: Testing the application
- Part 11: Conclusion
Part 1: Get and/or log in to your OpenShift Sandbox account
Start by logging into your account. If you don’t have one, create one (Figure 1). The page is here: Developer Sandbox for Red Hat OpenShift | Red Hat Developer
Part 2: Open the Red Hat OpenShift Data Science session
From the sandbox dashboard, open your session by selecting the Red Hat OpenShift Data Science option in the upper right corner of the screen and clicking on the link (Figure 2).
You're now logged into OpenShift Data Science and are presented with the dashboard.
OpenShift Data Science brings you on-demand, Jupyter Notebook environments. Don’t worry if you’ve never used notebooks before because this activity includes a small tutorial about what they are and how to use them.
Part 3: Launch Jupyter
Now that you are logged in to Red Hat OpenShift Data Science, select Launch application on the Jupyter card (Figure 3).
If this is the first time you’re launching Jupyter, you are sent to a page that requires you to log in and that asks you for permission to use your user account to authenticate to Jupyter. You should of course allow this access if you want to do the workshop.
Part 4: Start a notebook server
Make sure the TensorFlow notebook image is selected, then select the Start server button at the bottom of the page (Figure 4).
Once the Jupyter server has started, open it.
About the Jupyter environment
You are now inside your Jupyter environment. It’s a web-based environment, but everything you do here is in fact happening on the OpenShift Data Science cluster. This means that without having to install and maintain anything on your own computer, and without tying up lots of local resources such as CPU and RAM, you can conduct your data science work in this powerful and stable managed environment.
The file-browser window you’re in right now contains the files and folders saved inside OpenShift Data Science.
Part 5: Clone a GitHub repository into your environment
It’s pretty empty right now, though, so the first thing you need to do is bring the content of the workshop inside this environment as follows:
On the left toolbar, select the GitHub icon (Figure 5.1):
Select Clone a Repository (Figure 5.2):
Enter the URL https://github.com/rh-aiservices-bu/metrobus-repairs-nlp-workshop.git and select CLONE. (Figure 5.3.)
Cloning the repository takes a few seconds, after which you can double-click and navigate (Figures 5.4 and 5.5.) to the newly-created folder, metrobus-repairs-nlp-workshop.
What’s a notebook?
A notebook is an environment with cells that can display formatted text or code.
Figure 5.6 below shows both empty and populated cells.
Code cells contain code that can be run interactively. That means you can modify the code, then run it. The code will not run on your computer or in the browser, but directly in the environment to which you are connected; OpenShift Data Science in our case.
To run a code cell, just click in it or on the left side of it, and then click the Run button from the toolbar. You can also press Ctrl+Enter to run a cell, or Shift+Enter to run the cell and automatically select the following one.
The Run button on the toolbar looks as shown in Figure 5.7.
Running the cell ends by showing the result of the code that was run in that cell, as well as information about when this particular cell has been run. You can also enter notes into a cell by switching the cell type in the menu from Code to Markup.
Note: When you save a notebook, both the code and the results are saved. So you can reopen the notebook to look at the results without having to run the program again, and while still having access to the code.
Part 6: Time to play
Now that we have covered the basics, give notebooks a try.
In your Jupyter environment (the file explorer-like interface), there is a file called 01_sanbdbox.ipynb. Double-click on it to launch the notebook, which will open another tab in the content section of the environment. Feel free to experiment, run the cells, add some more cells, and create functions. You can do whatever you want, because the notebook is your personal environment, and there is no risk of breaking anything or affecting other users. This environment isolation is a great advantage of OpenShift Data Science.
You can also create a new notebook by selecting File→New→Notebook from the menu on the top left, then selecting a Python 3 kernel. This selection asks Jupyter to create a new notebook that runs the code cells using Python 3.
You can also create a notebook by simply clicking on the icon in the launcher. (Figure 6.1.)
To learn more about notebooks, head to the Jupyter site.
Now that you’re more familiar with notebooks, you’re ready to go to the next section.
Part 7: Submitting metro bus repairs using NLP
Now that you know how the environment works, the real work can begin.
Still in your environment, open the 01-Create-Claims-Classification.ipynb file, and follow the instructions directly in the notebook.
After you run the code in the notebook, it will look like Figure 7.1.
Once you are finished, you can come back here and head to Part 8.
Part 8: Exposing the model as an API
In the previous section, we learned how to create the code that classifies a repair based on the free text we enter. But we can't use a notebook directly like this in a production environment. So now we will learn how to package this code as an API that you can directly query from other applications.
Some explanations first:
- The code that we wrote in the notebook has been repackaged as a single Python file, prediction.py. Basically, the file combines the code in all the cells of the notebook.
- To use this code as a function you can call, we added a function called predict that takes a string as an input, classifies the repair, and sends back the resulting classification. Open the file directly in JupyterLab, and you should recognize our previous code along with this new additional function.
- There are other files in the folder that provide functions to launch a web server, and that we will use to serve our API.
Open the 03_MBR_run_application.ipynb file and follow the instructions directly in the notebook.
Our API will be served directly from our container using Flask, a popular Python web server. The Flask application, which will call our prediction function, is defined in the wsgi.py file.
When you execute the following cell, it will be in a permanent running state. That's normal, because the web server process will keep running. When you are finished with the test you can just select the cell and click the Stop button (next to Run).
Launch the Flask application (Figure 8.2)
Then query the API. (Figure 8.3.)
Once you are finished, you can come back here and head to the next step.
Now you are ready to test the Flask application. Open the Jupyter notebook named 04_MBR_test_application.ipynb and follow the instructions in the notebook.
Part 9: Building the application inside OpenShift
Now that the application code is working, you’re ready to package it as a container image and run it directly in OpenShift as a service that you will be able to call from any other application. You can access the OpenShift Dedicated dashboard from the application switcher in the top bar of the RHODS dashboard (Figure 9.1).
Open your OpenShift UI and switch to the developer view from the menu on the top left:
From the +Add menu, click the From Git option:
In the GitHub Repo URL field (Figure 9.3), enter https://github.com/rh-aiservices-bu/metrobus-repairs-nlp-workshop.git:
Don’t overlook this step. (Figure 9.4.) Click Show advanced Git options, and in the GitHub reference field, enter main because this is the branch to use in our GitHub project.
Next, you will need to scroll to the bottom and expand the Resource Type section and select Deployment as the type of resource to be generated (Figure 9.5).
Click the Create button and the build process will begin (Figure 9.6). If you do not see the details panel on the right side, simply click the center of the application box (on the Python logo).
The automated build process will take a few minutes. Some alerts may appear if OpenShift tries to deploy the application while the build is still running, but that’s OK. Then OpenShift will deploy the application (rollout), and in the topology view, you should obtain a screen similar to Figure 9.6.1.
Note: If the build fails — as in the example here — simply click the Start Build button and try again. It may take multiple attempts, but it will eventually succeed. This behavior is caused by the limits of the free OpenShift Sandbox; In a regular cluster, you will not have an out of memory condition.
Scroll to the bottom of the details panel to see the Routes value. This is the URL of your service (Figure 9.7).
Part 10: Testing the application
You now have an application — what’s know as an OpenShift service — listening at the route that was created during the deployment. You're probably eager to try out the application — and that is what we do next.
You can test the application by simply clicking on the route's link, or by copying and pasting the link into your browser. (Figure 10.1.)
Because our application is now a REST API endpoint, there are multiple ways to upload test data to it. Here are a few.
cURL from a terminal session:
You can use the OpenShift Web Terminal to access your service from a command line. Select the Web Terminal option (1) and a command line will appear (2):
In the terminal shell, enter a cURL command with sample text like, I turn the key and nothing happens as shown in Figure 10.2.1. Replace the localhost in the command with the right hostname for the route, and make sure to include /prediction:
From Python code:
Send a RESTful post request with sample text like, I turn the key and nothing happens as shown in Figure 10.3. Replace the localhost in the command with the right hostname for the route, and make sure to include /prediction:
From a notebook:
You can also test the REST API endpoint from a Jupyter Notebook. Open the notebook named 05_MBR_enter_repair.ipynb. In the first cell, replace the placeholders with the text as shown in Figure 10.4.
- The repair text to be categorized
- The route to the service
The repair text goes in the my_text field in the file, and the route in the my_route field, as follows:
Run both cells and see the result. (Figure 10.4.1.)
Part 11: Conclusion
We hope you have enjoyed this activity!
You have experienced first-hand how Red Hat OpenShift Data Science can simplify the onboarding of your AI/ML projects, providing an easy-to-use solution for your data scientists and data engineers.
To learn more about Red Hat OpenShift Data Science, please visit the Red Hat OpenShift Data Science page.
More Data Science articles
Looking for more OpenShift Data Science?
Learn more about Red Hat OpenShift Data Science.