Shipping_containers_at_ClydeContainerizing things is particularly popular these days.   Today we'll talk about the idioms we can use for containerization, and specifically play with apache spark and cassandra in our use case for creating easily deployed, immutable microservices.


Note: This post is done using centos7 as a base for the containers, but these same recipes will apply with RHEL and Fedora base images.

There are a few different ways to build a container.  For example, for beginners, you can build a container  as a "lightweight VM".  but in some ways, this can be a bit of an anti-pattern.  Rather than serving as lightweight VMs, we are seeing that the common practice for containerization of an application is evolving in lock step with the microservices movement, which is quite different from the pre-container world view of scalable systems (which involved, typically, mutable services on VMs and heavy-weight provisioning).

One of the most exciting things about containers, is that they (if designed properly) can embrace the immutability principle, which is a fundamental concept behind many increasingly popular functional programming languages (which gave rise to frameworks such as Spark and MapReduce), and highly reliable software systems which are easy to parallelize operations on top of.  Immutable services can, by definition, be deployed without any heavy weight installers or configuration management... this paves the way for a new paradigm in load balancing, high availability, and dynamic resource sharing.

Enough high level stuff...  To really grok these connections, you need to build a distributed system with containers from scratch.   In this post, I'll walk through two different ways to containerize an application.  There are many ways to containerize an application, but currently the microservices movement, which is becoming a more popular idiom for using docker, provides a lot of benefits which cannot be realized on just "any" Dockerfile.   Additionally, for folks interested in spark, we will demonstrate how to orchestrate and test a spark microservices cluster using Docker on Centos.  And, even if you don't fully buy into the microservices  ecosystem in its current state, building clean and simple containers which do one thing perfectly is probably never a mistake.

For those wondering how you can use vagrant in the post VM universe which we are now living in, this post should also be helpful.  To leverage the snippets here, you will need Docker and Vagrant  installed.  If you are on a non linux system, you can install VMWare/VirtualBox and have vagrant launch the containers for you in a VM of your choice which is docker friendly.  In any case, I have tried to keep the code snippets to the point so that this post can be read in isolation.

I hope that, by the end of this post, a light bulb will go off in your head, and you to will see the connection between immutable infrastructure, idiomatically designed microservices, and portable testing of distributed applications (in this case, we will use vagrant, but you can use any orchestration framework you want to test a microservice, thanks to its self-sufficient and easily composable nature - even a shell script).

This intimate connection will be realized in this post by building some dockerized spark containers (first, the wrong way), and then rebuilding them as proper standalone microservices.

So lets get started !

In the first attempt, we will create some simple containers, and use SSH to get into them, so that we can start some services.  This is a common design pattern for typical cloud based infrastructures and testing, and it is extremely flexible and easy to hack around with - but a its also an anti-pattern... the mutable nature of the VMs leads to many things which can go wrong, and also to systems which might be poorly documented, with multipurpose functionality that isn't composable, or easily load balanced.

However, by implementing this anti-pattern, we can better appreciate the microservices design pattern, in the way that it really properly implemented.   The reason this is important is that the definition of a microservice is quite vague, and in my opinion, it allows for quite a few anti-patterns, like ssh provisioning.  According to Wikipedia:

"In computing, microservices is a software architecture design pattern, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task."

This leaves alot to the imagination.  So... lets play around with some different ways to create a docker container that runs Apache Spark.  Apache Spark is a distributed Big Data processing framework with a master/slave architecture (you need one master, and at least one slave).    However the principals can apply to any distributed system.

SSHD: A "naughty" microservice  ?

As a long time vagrant user, I'm used to building vagrant infrastructure using this workflow.

  1. Define a "box" (i.e. an OS image).
  2. Make sure the "box" has SSH credentials, and SSH running.
  3. Write a Vagrantfile with my application semantics.

Come to think of it - vagrant boxes are actually in some ways designed along a microservice pattern - where the box itself does  one thing, and one thing very well : spin up and run SSHD.  After that, puppet, chef, or pure shell commands are run in the box to provision software.

So, here is an SSHD based microservice, in a Dockerfile.

FROM centos:centos7
RUN yum clean all
RUN yum install -y yum-utils
RUN yum-config-manager --save --setopt=fedora.skip_if_unavailable=true
RUN yum update -y
RUN yum install -y wget
RUN wget -O /etc/yum.repos.d/bigtop.repo

#### Now install SSH so we can layer in the spark components.
RUN yum -y install openssh-server openssh-clients sudo
RUN sed -i.bak s/UsePAM yes/UsePAM no/ /etc/ssh/sshd_config
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
# requiretty off
RUN sed -i.bak 's/requiretty/!requiretty/' /etc/sudoers
# setup vagrant account
RUN mkdir /root/.ssh/
RUN chmod 0755 /root/.ssh
RUN wget --no-check-certificate -O /root/.ssh/authorized_keys
RUN chmod 0644 /root/.ssh/authorized_keys
CMD /usr/sbin/sshd -D

This Dockerfile does a few simple things.  It adds some yum repos to a image, and installs openssh*.  This is preparation for starting and configuring our spark as a service which we will run inside of a container.  Its useful for legacy interop (that is, for creating a Dockerfile which can service as a drop in replacement for a VM). We can now start some services in the container in a our provisioning framework.

(1..$spark_num_instances).each do |i|
 config.vm.define "scale#{i}" do |scale|
 scale.vm.provider "docker" do |d|
    d.build_dir = "spark/"
    d.create_args = ["--privileged=true", "-m", CONF["docker"]['memory_size'] + "m"]
    d.has_ssh = true
  if "#{i}" == "1"
scale.vm.provision "shell", inline:"yum install -y java-1.7.0-openjdk-devel.x86_64"
scale.vm.provision "shell", inline:"yum install -y spark-worker"
scale.vm.provision "shell", inline:"echo "export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk/" >> /etc/spark/conf/"
scale.vm.provision "shell", inline:"echo "export STANDALONE_SPARK_MASTER_HOST=scale1.docker" >> /etc/spark/conf/"

So, with these two snippets, we now have a Vagrantfile which is going to layer some services into our docker container, using the SSHD microservice as the mechanism to get into the container. So, while its easy to modify the Vagrantfile to add new services into the image, there are some heavy costs we're paying:

  • The container is not immutable - its state is modified after it is started.  This means that it cannot be orchestrated as an atomic unit.
  • The container will take a while to start : because every time we start it we have to do some stuff to it after the fact.

Microservice Provisioning, the right way.

Consider, the evolution of the container ecosystem.  We now have tools such as kubernetes which is rapidly becoming a standard for highly available containerized applications.  Kubernetes is based on the idea that a microservice, in and of itself, runs as a application service.  In this sense, microservices can be composed into higher level applications which run in a fault tolerant, distributed context.

The goal of a microservice should, at least partially, be leveraged as part of software that is maintained using a higher level framework, such as the components of Red Hat's Atomic or even, a tool like vagrant, with minimal changes.

Since we cannot really assume much about a higher level framework, this pretty much means that our SSHD service container is not a particularly good design choice (it requires the framework to setup and install stuff on our containers, which assumes the framework can SSH into those containers, knows what to install on which container, etc...).   This low level coupling is obviously suboptimal for a typical application, which  would like to be loosely coupled to composed services (i.e.  PostgreSQL (a RDBMS),  Vowpallwabbit (a real time classifier) , SOLR (a popular search engine), Apache (the popular web server), which comprise some typical resources an end to end application might want to rely on).

So, now lets look at a "real" microservice implementation, which satisfies the atomicity and immutability principles.

First, we will create a "jvm" container, which installs java and nothing else.  I won't show that container here, for space purposes.  But, once you have it, you can simply create it locally using "docker build -t jvm ./".  At that point, you can now build containers that leverage the base JVM container.

Now, for spark we will build a container which might look something like this (the if statement for looking at the hostname isn't particularly important here, its an implementation detail for easy test spin up).

FROM jvm
RUN yum clean all
RUN yum install -y tar yum-utils wget
RUN yum-config-manager --save --setopt=fedora.skip_if_unavailable=true
RUN yum update -y
RUN yum install -y java-1.7.0-openjdk-devel.x86_64
COPY spark-1.2.0-bin-hadoop2.4.tgz /opt/
RUN tar -xzf /opt/spark-1.2.0-bin-hadoop2.4.tgz -C /opt/
RUN echo "SPARK_HOME=/opt/spark-1.2.0-bin-hadoop2.4" >> /etc/environment
RUN echo "JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk/" >> /opt/spark-1.2.0-bin-hadoop2.4/conf/
CMD if [[ `hostname` = 'scale1.docker' ]] ; then /opt/spark-1.2.0-bin-hadoop2.4/sbin/ ; else ping -c 2 scale1.docker && /opt/spark-1.2.0-bin-hadoop2.4/sbin/ -h spark://scale1.docker:7077 ; fi ; tailf /opt/spark-1.2.0-bin-hadoop2.4/logs/*

So what are the differences here?

  • This time, we ran directly from a spark tarball.  Since our container is doing one, and only one thing, the value of using a RPM or DEB packaged spark distribution is diminished - and we just use a tar distribution.  This is a common idiom you will see in microservices.
  • The container STARTS with a single service which is essential to our high level application.   Each container either runs a spark master, or a spark slave.
  • For those new to docker, note that we use COPY here, to copy a LOCAL tgz file into /opt/

Idiomatic Microservices : Immutability leads to testability...

Now, lets look at how we orchestrate these components.  In Vagrant, we can easily create a lightweight orchestration layer that leverages dockers --link option, as a replacement for heavyweight fully functional orchestration frameworks.  This allows us to test our microservices in a lightweight and cross-platform manner.  This highlights the benefits of building stand-alone microservices : They are testable in any environment, because they have a minimal dependency on the orchestration layer which launches them.   Here is how our Vagrantfile for building these microservices will look, again, this a snippet, for explaining the high level concepts.

# number of instances : First one is master.
$spark_num_instances = 2
$cassandra_num_instances = 1
Vagrant.configure("2") do |config|
    # nodes definition
    (1..$spark_num_instances).each do |i|
        config.vm.define "scale#{i}" do |scale|
            scale.vm.provider "docker" do |d|
                d.build_dir = "spark/"
       = "scale#{i}"
                d.create_args = ["--privileged=true"]
                d.remains_running = true
                if "#{i}" == "1"
                    d.ports = [ "4040:4040", "7707:7707" ]
                    d.create_args = d.create_args << "--link" << "scale1:scale1.docker"
            scale.vm.synced_folder "./", "/scale-shared/"
            scale.vm.hostname = "scale#{i}.docker"

    #With cassandra we don't have master/slave architecture.
    (1..$cassandra_num_instances).each do |i|
        config.vm.define "cassandra#{i}" do |scale|
            scale.vm.provider "docker" do |d|
                d.build_dir = "cassandra/"
                d.create_args = ["--privileged=true", "-m", $conf["docker"]['memory_size'] + "m"]
                d.remains_running = true
            scale.vm.synced_folder "./", "/scale-shared/"
            scale.vm.hostname = "cassandra#{i}.docker"

The above vagrantfile creates containers using the --link option of docker which allows us to link one container to another, when it spins up.  So, it essentially follows this path outlined below.  Note that this implementation only allows one cassandra node.

  1. If we are starting the 1st container, start up a spark master.
  2. If we are starting any other spark container, start a slave.  link it to the already created master so that they can talk.
  3. Start a single cassandra node.

To tie all of this together, here is how I test this setup.

sleep 5
# Remove all containers.
docker rm -f `docker ps --no-trunc -aq`
service docker restart
vagrant destroy --force && vagrant up --no-parallel
## Run calculate pi.
echo "RUNNING smoke tests..."
docker exec -i -t scale1 
--class org.apache.spark.examples.SparkPi 
--master spark://scale1.docker:7077 /scale-shared/spark-examples_2.10-1.1.1.jar 10000

Note thate, the vagrant assigned names for the containers for us can be used to execute one off commands in a smoke test of your microservices architecture.  This can all be done using pure docker commands also, quite easily, of course.

This post should help get you started containerizing your bigdata, or other services, in a way which is truly immutable and thus easily testable. 

Watch it in action !

Here's a video which shows the entire application, including a distributed spark test against multiple running microservice containers in action. In this video, we are only running spark, but you can easily add some links to the cassandra container to test that manually as well. The current code for this is here, but it may be moved at some point.

Feel free to reach out to me personally or leave comments about these ideas.  Special thanks to Tim St. Clair at Red Hat for helping me to develop some of these concepts and differentiate "real" microservices  from "naughty" ones !

Last updated: March 15, 2023