Page
Prerequisites and step-by-step guide
Prerequisites
Step-by-step guide
1. Launch a Jupyter notebook with TensorFlow server image
To get started, we'll establish a new Data Science project within OpenShift AI that leverages a pre-configured TensorFlow image. This image provides a ready-to-use environment for building and training your machine learning models. Before proceeding, make sure you have the Red Hat Developer Sandbox set up. If not, check here for more information.
Creating a New Workbench in OpenShift AI
Login to OpenShift AI and navigate to the "Data Science project" on the left menu. You will find an already created data science project associated with your username, similar to what is shown in Figure 1 below.
When you click on the project which is listed, you will land on the Overview page of a Data Science project similar to what is shown in Figure 2 below.
1.1 Create a Workbench
- In the Overview page the first big tile on the left top is about the "Workbenches".
Click on "Create a Workbench" to initiate the workbench creation process as shown in Figure 2 below.
1.2 Configure the Workbench
- Provide a descriptive name for your workbench.
- Under "Notebook image" select "TensorFlow" and keep the version set to "Recommended".
- In the "Deployment size" section, choose "Medium" for the container size.
- Change the "Cluster storage" setting to 10Gi.
- Click on "Create workbench" to initiate the creation process and launch your new workbench environment.
After all details are filled in, you will get a similar form as shown in Figure 3 below.
1.3 Verify Workbench Status and access Jupyter Notebook
- After clicking on "Create workbench", monitor the workbench status. It should eventually transition to "Running", indicating the successful creation of the workbench. See Figure 4 below.
Click on the "Open ↗" button located next to the status.
You will then land on the following Login page to authenticate yourself, click on the "DevSandbox" button, as shown in Figure 5 below.
You might encounter a permission prompt. Select the option "Allow selected permissions" to grant the necessary access, similar to as shown in Figure 6 below.
After allowing permissions, you will be redirected to the JupyterLab launcher and select "Python 3.9" similar to Figure 7. The Jupyter Notebook environment will come pre-configured with TensorFlow and its dependencies.
You will land on a blank Jupyter notebook as shown in Figure 8 below.
In the upcoming section, you will execute Python code in the Jupyter notebook. Copy each code snippet provided below, paste it into a notebook cell, and execute it.
Alternatively, you can clone this GitHub repository that contains the complete code in a Jupyter notebook. After cloning the repository, navigate to the "openshift-ai/4_Models_inferencing" directory and open the "mobilenetv2prediction_manual.ipynb" notebook. This notebook includes all the necessary code snippets, along with detailed explanations and a discussion of the results, essential for completing this learning exercise.
2. Install Python dependencies
In this exercise, we will use a sklearn model to create, train, and make predictions from the trained model. We will also handle model saving locally. Before saving, we'll save the model in pickle format and convert it to an ONNX model, as shown in the code snippets below. This approach ensures that our model is both efficient and compatible with various deployment environments.
Installing the Python packages which we required to execute code like onnx, onnxruntime, seaborn, tf2onnx.
!pip install onnx onnxruntime seaborn tf2onnx
Expected output:
Collecting onnxruntime
Downloading onnxruntime-1.18.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 70.9 MB/s eta 0:00:00a 0:00:01
Collecting seaborn
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.9/294.9 kB 313.6 MB/s eta 0:00:00
The imported libraries and modules facilitate various tasks in machine learning and data processing. NumPy and Pandas handle numerical and data manipulation, while Keras provides tools for building and training neural networks. Scikit-learn aids in data preprocessing and model evaluation, and tf2onnx along with onnx allows for model interoperability between frameworks. Pickle is used for object serialization, and Path simplifies file path operations.
import numpy as np
import pandas as pd
import datetime
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, Activation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import class_weight
import tf2onnx
import onnx
import pickle
from pathlib import Path
3. Load the CSV data
The CSV data used to train the model contains several fields relevant to transaction characteristics, as shown in the code snippets below. These include the distance from home and the last transaction, the ratio of the purchase price to the median purchase price, and whether the transaction was from a repeat retailer. It also records if the credit card chip or PIN number was used, and if the transaction was an online order. Finally, it indicates whether the transaction is classified as fraudulent.
Download the csv data file and store it "/data'' directory.
!wget https://raw.githubusercontent.com/rh-aiservices-bu/fraud-detection/main/data/card_transdata.csv && mkdir data && mv card_transdata.csv data/
Expected output:
https://raw.githubusercontent.com/rh-aiservices-bu/fraud-detection/main/data/card_transdata.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76277977 (73M) [text/plain]
Saving to: 'card_transdata.csv'
Load the CSV data file using the Pandas library and print the first 5 rows of the columns, as shown in Figure 9 below.
Data = pd.read_csv('data/card_transdata.csv')
Data.head()
Expected output:
The code prepares data for training a machine learning model by first separating features (X) from the target variable (y), then splitting the data into training, testing, and validation sets. It scales the training data using "StandardScaler" to standardize feature values, ensuring the model trains more effectively. Additionally, it saves the test data and scaler to disk using pickle and calculates class weights to address class imbalance by giving more importance to the minority class (fraudulent transactions).
X = Data.drop(columns = ['repeat_retailer','distance_from_home', 'fraud'])
y = Data['fraud']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, shuffle = False)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train, test_size = 0.2, stratify = y_train)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.values)
Path("artifact").mkdir(parents=True, exist_ok=True)
with open("artifact/test_data.pkl", "wb") as handle:
pickle.dump((X_test, y_test), handle)
with open("artifact/scaler.pkl", "wb") as handle:
pickle.dump(scaler, handle)
class_weights = class_weight.compute_class_weight('balanced',classes = np.unique(y_train),y = y_train)
class_weights = {i : class_weights[i] for i in range(len(class_weights))}
No output expected here.
4. Build the model
We define a neural network using Keras with several Dense layers, Dropout layers, and Batch Normalization, configured for binary classification. It compiles the model with the Adam optimizer and binary cross-entropy loss, and then prints the model summary. Using the following model code snippet.
model = Sequential()
model.add(Dense(32, activation = 'relu', input_dim = len(X.columns)))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.summary()
The model summary shows a sequential neural network with various layers including Dense, Dropout, Batch Normalization, and Activation. It indicates the output shape and the number of trainable parameters for each layer, with a total of 2593 parameters.
Expected output:
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_4 (Dense) (None, 32) 192
dropout_3 (Dropout) (None, 32) 0
dense_5 (Dense) (None, 32) 1056
batch_normalization_2 (Bat (None, 32) 128
chNormalization)
activation_2 (Activation) (None, 32) 0
dropout_4 (Dropout) (None, 32) 0
dense_6 (Dense) (None, 32) 1056
batch_normalization_3 (Bat (None, 32) 128
chNormalization)
activation_3 (Activation) (None, 32) 0
dropout_5 (Dropout) (None, 32) 0
dense_7 (Dense) (None, 1) 33
=================================================================
Total params: 2593 (10.13 KB)
Trainable params: 2465 (9.63 KB)
Non-trainable params: 128 (512.00 Byte)
_________________________________________________________________
5. Train the model
Training a model is frequently the most time-intensive phase of the machine learning process. Large models may require multiple GPUs and several days to complete. For this straightforward model, however, we can expect the training to take approximately a minute or more.
# Train the model and get performance
import os
epochs = 2
history = model.fit(X_train, y_train, epochs=epochs, \
validation_data=(scaler.transform(X_val.values),y_val), \
verbose = True, class_weight = class_weights)
print("Training of model is complete")
We are training the model for 2 epochs, with each epoch consisting of 20,000 steps.
Expected output:
Epoch 1/2
20000/20000 [==============================] - 35s 2ms/step - loss: 0.2558 - accuracy: 0.9307 - val_loss: 0.2805 - val_accuracy: 0.9301
Epoch 2/2
20000/20000 [==============================] - 33s 2ms/step - loss: 0.2337 - accuracy: 0.9489 - val_loss: 0.2573 - val_accuracy: 0.9378
Training of the model is complete.
6. Save the model file
It converts a Keras model to ONNX format for compatibility with ModelMesh, a platform for managing machine learning models and saving the trained model in the model directory, using the following code snippet.
# Save the model as ONNX for easy use of ModelMesh
model_proto, _ = tf2onnx.convert.from_keras(model)
os.makedirs("models/fraud/1", exist_ok=True)
onnx.save(model_proto, "models/fraud/1/model.onnx")
7. Confirm the model file was created successfully
List the available models in the "model" directory using the following command. The output should include the model's name, size, and date
! ls -alRh ./models/
Expected output:
./models/:
total 12K
drwxr-sr-x. 3 1004770000 1004770000 4.0K Jul 1 14:43 .
drwxrwsr-x. 13 1004770000 1004770000 4.0K Jul 1 15:52 ..
drwxr-sr-x. 3 1004770000 1004770000 4.0K Jul 1 14:43 fraud
./models/fraud:
total 12K
drwxr-sr-x. 3 1004770000 1004770000 4.0K Jul 1 14:43 .
drwxr-sr-x. 3 1004770000 1004770000 4.0K Jul 1 14:43 ..
drwxr-sr-x. 2 1004770000 1004770000 4.0K Jul 1 14:43 1
./models/fraud/1:
total 24K
drwxr-sr-x. 2 1004770000 1004770000 4.0K Jul 1 14:43 .
drwxr-sr-x. 3 1004770000 1004770000 4.0K Jul 1 14:43 ..
-rw-r--r--. 1 1004770000 1004770000 13K Jul 1 15:53 model.onnx
8. Test the model
Importing dependency libraries about the sklearn so we can handle the ".pickle" models, matplotlib and seaborn to print the diagrams and many more.
from sklearn.metrics import confusion_matrix
import numpy as np
import pickle
import seaborn as sns
from matplotlib import pyplot as plt
import onnxruntime as rt
This loads previously saved objects using pickle. It reads the scalar object from 'artifact/scaler.pkl' and the test data (X_test, y_test) from 'artifact/test_data.pkl'. This is useful for reusing the scalar to transform data or evaluate the model on previously saved test data without retraining the model.
with open('artifact/scaler.pkl', 'rb') as handle:
scaler = pickle.load(handle)
with open('artifact/test_data.pkl', 'rb') as handle:
(X_test, y_test) = pickle.load(handle)
Create an ONNX inference runtime session and predict values for all test inputs:
sess = rt.InferenceSession("models/fraud/1/model.onnx", providers=rt.get_available_providers())
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
y_pred_temp = sess.run([output_name], {input_name: scaler.transform(X_test.values).astype(np.float32)})
y_pred_temp = np.asarray(np.squeeze(y_pred_temp[0]))
threshold = 0.95
y_pred = np.where(y_pred_temp > threshold, 1,0)
It calculates the accuracy of predictions by comparing y_test (true labels) with y_pred (predicted labels) and prints the accuracy score. It then generates a confusion matrix using confusion_matrix from Scikit-learn, visualizes it with a heatmap using Seaborn, and displays it using Matplotlib. This helps in evaluating the model’s performance by showing the number of true positives, true negatives, false positives, and false negatives.
accuracy = np.sum(np.asarray(y_test) == y_pred) / len(y_pred)
print("Accuracy: " + str(accuracy))
c_matrix = confusion_matrix(np.asarray(y_test),y_pred)
ax = sns.heatmap(c_matrix, annot=True,fmt='d', cbar=False, cmap='Blues')
ax.set_xlabel("Prediction")
ax.set_ylabel("Actual")
ax.set_title('Confusion Matrix')
plt.show()
The confusion matrix heatmap reveals that the model achieves an accuracy of 97.21%, indicating strong overall performance. It correctly classifies a significant number of transactions, with 181,306 true negatives and 13,107 true positives. The model exhibits high precision, effectively identifying fraudulent transactions with minimal false positives (1,193). However, the recall is lower, with 4,394 false negatives, suggesting that some fraudulent transactions are not detected. Overall, while the model is accurate and precise, there is room for improvement in capturing more fraudulent activities.
Expected output:
Accuracy: 0.972065
Here is the order of the fields from Sally's transaction details:
- distance_from_last_transaction
- ratio_to_median_price
- used_chip
- used_pin_number
- online_order
We are predicting whether a specific transaction (Sally's transaction) is fraudulent and determining the likelihood of fraud.
sally_transaction_details = [
[0.3111400080477545,
1.9459399775518593,
1.0,
0.0,
0.0]
]
prediction = sess.run([output_name], {input_name: scaler.transform(sally_transaction_details).astype(np.float32)})
print("Is Sally's transaction predicted to be fraudulent? (true = YES, false = NO) ")
print(np.squeeze(prediction) > threshold)
print("How likely was Sally's transaction to be fraudulent? ")
print("{:.5f}".format(np.squeeze(prediction)) + "%")
The model predicts that Sally's transaction is not fraudulent, with a fraud likelihood of only 0.00119%. This suggests the transaction is considered safe based on the model's analysis.
Expected output:
Is Sally's transaction predicted to be fraudulent? (true = YES, false = NO)
False
How likely was Sally's transaction to be fraudulent?
0.00119%
Summary
In this learning exercise, we focused on training, testing and saving a fraud model with OpenShift AI and utilized OpenShift AI to simplify environment management.