Running Spark Jobs On OpenShift

Introduction:

A feature of OpenShift is jobs and today I will be explaining how you can use jobs to run your spark machine, learning data science applications against Spark running on OpenShift.  You can run jobs as a batch or scheduled, which provides cron like functionality. If jobs fail, by default OpenShift will retry the job creation again. At the end of this article, I have a video demonstration of running spark jobs from OpenShift templates against Spark running on OpenShift v3.

Environment:

  • Infinispan 9.0.0
  • Spark 2.0.1
  • OpenShift Dedicated v3.3
  • Oshinko

Spark Batch Job Example:

apiVersion: batch/v1
kind: Job
metadata:
name: recommend-mllib-scheduled
spec:
parallelism: 1
completions: 1
template:
metadata:
name: recommend-mllib
spec:
containers:
- name: recommend-mllib-job
image: docker.io/metadatapoc/recommend-mllib:latest
imagePullPolicy: "Always"
env:
- name: SPARK_MASTER_URL
value: "spark://instance:7077"
- name: RECOMMEND_SERVICE_SERVICE_HOST
value: "jboss-datagrid-service"
- name: SPARK_USER
value: bob
restartPolicy: Never

Scheduled Job (Running Spark Job Every 5 mins):

apiVersion: batch/v2alpha1
kind: ScheduledJob
metadata:
name: sparkrecommendcron
spec:
schedule: "*/5 * * * ?"
jobTemplate:
spec:
template:
spec:
containers:
- name: pi
image: docker.io/metadatapoc/recommend-mllib:latest
imagePullPolicy: "Always"
env:
- name: SPARK_MASTER_URL
value: "spark://instance:7077"
- name: RECOMMEND_SERVICE_SERVICE_HOST
value: "jboss-datagrid-service"
- name: SPARK_USER
value: bob
restartPolicy: Never

Environment Setup

oc cluster up
oc new-app -f http://goo.gl/ZU02P4
oc policy add-role-to-user edit -z oshinko
oc new-app -f https://goo.gl/XDddW5

Once you have oshinko and infinispan/jdg setup you will need to spin up a spark cluster.
You can follow these setups in the screenshots below:
 

Spark Job Template

Spark jobs may run as scheduled jobs or as one-time batch jobs. You have the option of a source 2 image or to build a custom container which extends our Openshift-Spark image and run a spark-submit job all within OpenShift. I will be demonstrating the custom container extended and spark-submit job run. I have created a template that will wrap around the OpenShift job and run our spark job against the cluster and it will require some inputs:
i) name of the job
ii) spark master ip or service name
iii) JBoss data grid ip or service name
 

Video Demonstration:

Links to Project and Example Source Code Used in Demo

RadAnalytics – http://radanalytics.io/

To download and learn more about Red Hat JBoss Data Grid, an in-memory data grid to accelerate performance that is fast, distributed, scalable, and independent from the data tier.


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 


For more information about Red Hat OpenShift and other related topics, visit: OpenShift, OpenShift Online.

Leave a Reply