Running Spark Jobs On OpenShift

-1+1 (No Ratings Yet)
Loading...

Introduction:

A feature of OpenShift is jobs and today I will be explaining how you can use jobs to run your spark machine, learning data science applications against Spark running on OpenShift.  You can run jobs as a batch or scheduled, which provides cron like functionality. If jobs fail, by default OpenShift will retry the job creation again. At the end of this article, I have a video demonstration of running spark jobs from OpenShift templates against Spark running on OpenShift v3.

Environment:

  • Infinispan 9.0.0
  • Spark 2.0.1
  • OpenShift Dedicated v3.3
  • Oshinko

Spark Batch Job Example:

apiVersion: batch/v1
kind: Job
metadata:
name: recommend-mllib-scheduled
spec:
parallelism: 1
completions: 1
template:
metadata:
name: recommend-mllib
spec:
containers:
- name: recommend-mllib-job
image: docker.io/metadatapoc/recommend-mllib:latest
imagePullPolicy: "Always"
env:
- name: SPARK_MASTER_URL
value: "spark://instance:7077"
- name: RECOMMEND_SERVICE_SERVICE_HOST
value: "jboss-datagrid-service"
- name: SPARK_USER
value: bob
restartPolicy: Never

Scheduled Job (Running Spark Job Every 5 mins):

apiVersion: batch/v2alpha1
kind: ScheduledJob
metadata:
name: sparkrecommendcron
spec:
schedule: "*/5 * * * ?"
jobTemplate:
spec:
template:
spec:
containers:
- name: pi
image: docker.io/metadatapoc/recommend-mllib:latest
imagePullPolicy: "Always"
env:
- name: SPARK_MASTER_URL
value: "spark://instance:7077"
- name: RECOMMEND_SERVICE_SERVICE_HOST
value: "jboss-datagrid-service"
- name: SPARK_USER
value: bob
restartPolicy: Never

Environment Setup

oc cluster up
oc new-app -f http://goo.gl/ZU02P4
oc policy add-role-to-user edit -z oshinko
oc new-app -f https://goo.gl/XDddW5

Once you have oshinko and infinispan/jdg setup you will need to spin up a spark cluster.
You can follow these setups in the screenshots below:
 

Spark Job Template

Spark jobs may run as scheduled jobs or as one-time batch jobs. You have the option of a source 2 image or to build a custom container which extends our Openshift-Spark image and run a spark-submit job all within OpenShift. I will be demonstrating the custom container extended and spark-submit job run. I have created a template that will wrap around the OpenShift job and run our spark job against the cluster and it will require some inputs:
i) name of the job
ii) spark master ip or service name
iii) JBoss data grid ip or service name
 

Video Demonstration:

Links to Project and Example Source Code Used in Demo

RadAnalytics – http://radanalytics.io/

To download and learn more about Red Hat JBoss Data Grid, an in-memory data grid to accelerate performance that is fast, distributed, scalable, and independent from the data tier.


Join the Red Hat Developer Program (it’s free) and get access to related cheat sheets, books, and product downloads.

 


For more information about Red Hat OpenShift and other related topics, visit: OpenShift, OpenShift Online.

Share
What did you think of this article?
-1+1 (No Ratings Yet)
Loading...

Leave a Reply