Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Microservice principles and Immutability - demonstrated with Apache Spark and Cassandra

<p>&nbsp;</p> <quillbot-extension-portal></quillbot-extension-portal>

January 20, 2015
jay vyas
Related topics:
Artificial intelligenceContainersDevOps
Related products:
Red Hat OpenShift

    Shipping_containers_at_ClydeContainerizing things is particularly popular these days.   Today we'll talk about the idioms we can use for containerization, and specifically play with apache spark and cassandra in our use case for creating easily deployed, immutable microservices.

     

    Note: This post is done using centos7 as a base for the containers, but these same recipes will apply with RHEL and Fedora base images.

    There are a few different ways to build a container.  For example, for beginners, you can build a container  as a "lightweight VM".  but in some ways, this can be a bit of an anti-pattern.  Rather than serving as lightweight VMs, we are seeing that the common practice for containerization of an application is evolving in lock step with the microservices movement, which is quite different from the pre-container world view of scalable systems (which involved, typically, mutable services on VMs and heavy-weight provisioning).

    One of the most exciting things about containers, is that they (if designed properly) can embrace the immutability principle, which is a fundamental concept behind many increasingly popular functional programming languages (which gave rise to frameworks such as Spark and MapReduce), and highly reliable software systems which are easy to parallelize operations on top of.  Immutable services can, by definition, be deployed without any heavy weight installers or configuration management... this paves the way for a new paradigm in load balancing, high availability, and dynamic resource sharing.

    Enough high level stuff...  To really grok these connections, you need to build a distributed system with containers from scratch.   In this post, I'll walk through two different ways to containerize an application.  There are many ways to containerize an application, but currently the microservices movement, which is becoming a more popular idiom for using docker, provides a lot of benefits which cannot be realized on just "any" Dockerfile.   Additionally, for folks interested in spark, we will demonstrate how to orchestrate and test a spark microservices cluster using Docker on Centos.  And, even if you don't fully buy into the microservices  ecosystem in its current state, building clean and simple containers which do one thing perfectly is probably never a mistake.

    For those wondering how you can use vagrant in the post VM universe which we are now living in, this post should also be helpful.  To leverage the snippets here, you will need Docker and Vagrant  installed.  If you are on a non linux system, you can install VMWare/VirtualBox and have vagrant launch the containers for you in a VM of your choice which is docker friendly.  In any case, I have tried to keep the code snippets to the point so that this post can be read in isolation.

    I hope that, by the end of this post, a light bulb will go off in your head, and you to will see the connection between immutable infrastructure, idiomatically designed microservices, and portable testing of distributed applications (in this case, we will use vagrant, but you can use any orchestration framework you want to test a microservice, thanks to its self-sufficient and easily composable nature - even a shell script).

    This intimate connection will be realized in this post by building some dockerized spark containers (first, the wrong way), and then rebuilding them as proper standalone microservices.

    So lets get started !



    In the first attempt, we will create some simple containers, and use SSH to get into them, so that we can start some services.  This is a common design pattern for typical cloud based infrastructures and testing, and it is extremely flexible and easy to hack around with - but a its also an anti-pattern... the mutable nature of the VMs leads to many things which can go wrong, and also to systems which might be poorly documented, with multipurpose functionality that isn't composable, or easily load balanced.

    However, by implementing this anti-pattern, we can better appreciate the microservices design pattern, in the way that it really properly implemented.   The reason this is important is that the definition of a microservice is quite vague, and in my opinion, it allows for quite a few anti-patterns, like ssh provisioning.  According to Wikipedia:

    "In computing, microservices is a software architecture design pattern, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task."

    This leaves alot to the imagination.  So... lets play around with some different ways to create a docker container that runs Apache Spark.  Apache Spark is a distributed Big Data processing framework with a master/slave architecture (you need one master, and at least one slave).    However the principals can apply to any distributed system.


    SSHD: A "naughty" microservice  ?

    As a long time vagrant user, I'm used to building vagrant infrastructure using this workflow.

    1. Define a "box" (i.e. an OS image).
    2. Make sure the "box" has SSH credentials, and SSH running.
    3. Write a Vagrantfile with my application semantics.

    Come to think of it - vagrant boxes are actually in some ways designed along a microservice pattern - where the box itself does  one thing, and one thing very well : spin up and run SSHD.  After that, puppet, chef, or pure shell commands are run in the box to provision software.

    So, here is an SSHD based microservice, in a Dockerfile.

    FROM centos:centos7
    RUN yum clean all
    RUN yum install -y yum-utils
    RUN yum-config-manager --save --setopt=fedora.skip_if_unavailable=true
    RUN yum update -y
    RUN yum install -y wget
    RUN wget -O /etc/yum.repos.d/bigtop.repo http://bigtop01.cloudera.org:8080/view/Releases/job/Bigtop-0.8.0/label=centos6/6/artifact/output/bigtop.repo
    
    #### Now install SSH so we can layer in the spark components.
    RUN yum -y install openssh-server openssh-clients sudo
    RUN sed -i.bak s/UsePAM yes/UsePAM no/ /etc/ssh/sshd_config
    RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
    RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
    # requiretty off
    RUN sed -i.bak 's/requiretty/!requiretty/' /etc/sudoers
    # setup vagrant account
    RUN mkdir /root/.ssh/
    RUN chmod 0755 /root/.ssh
    RUN wget http://github.com/mitchellh/vagrant/raw/master/keys/vagrant.pub --no-check-certificate -O /root/.ssh/authorized_keys
    RUN chmod 0644 /root/.ssh/authorized_keys
    CMD /usr/sbin/sshd -D
    

    This Dockerfile does a few simple things.  It adds some yum repos to a image, and installs openssh*.  This is preparation for starting and configuring our spark as a service which we will run inside of a container.  Its useful for legacy interop (that is, for creating a Dockerfile which can service as a drop in replacement for a VM). We can now start some services in the container in a our provisioning framework.

    (1..$spark_num_instances).each do |i|
     config.vm.define "scale#{i}" do |scale|
     scale.vm.provider "docker" do |d|
        d.build_dir = "spark/"
        d.create_args = ["--privileged=true", "-m", CONF["docker"]['memory_size'] + "m"]
        d.has_ssh = true
      if "#{i}" == "1"
    scale.vm.provision "shell", inline:"yum install -y java-1.7.0-openjdk-devel.x86_64"
    scale.vm.provision "shell", inline:"yum install -y spark-worker"
    scale.vm.provision "shell", inline:"echo "export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk/" >> /etc/spark/conf/spark-env.sh"
    scale.vm.provision "shell", inline:"echo "export STANDALONE_SPARK_MASTER_HOST=scale1.docker" >> /etc/spark/conf/spark-env.sh"
    

    So, with these two snippets, we now have a Vagrantfile which is going to layer some services into our docker container, using the SSHD microservice as the mechanism to get into the container. So, while its easy to modify the Vagrantfile to add new services into the image, there are some heavy costs we're paying:

    • The container is not immutable - its state is modified after it is started.  This means that it cannot be orchestrated as an atomic unit.
    • The container will take a while to start : because every time we start it we have to do some stuff to it after the fact.

    Microservice Provisioning, the right way.

    Consider, the evolution of the container ecosystem.  We now have tools such as kubernetes which is rapidly becoming a standard for highly available containerized applications.  Kubernetes is based on the idea that a microservice, in and of itself, runs as a application service.  In this sense, microservices can be composed into higher level applications which run in a fault tolerant, distributed context.

    The goal of a microservice should, at least partially, be leveraged as part of software that is maintained using a higher level framework, such as the components of Red Hat's Atomic or even, a tool like vagrant, with minimal changes.

    Since we cannot really assume much about a higher level framework, this pretty much means that our SSHD service container is not a particularly good design choice (it requires the framework to setup and install stuff on our containers, which assumes the framework can SSH into those containers, knows what to install on which container, etc...).   This low level coupling is obviously suboptimal for a typical application, which  would like to be loosely coupled to composed services (i.e.  PostgreSQL (a RDBMS),  Vowpallwabbit (a real time classifier) , SOLR (a popular search engine), Apache (the popular web server), which comprise some typical resources an end to end application might want to rely on).

    So, now lets look at a "real" microservice implementation, which satisfies the atomicity and immutability principles.

    First, we will create a "jvm" container, which installs java and nothing else.  I won't show that container here, for space purposes.  But, once you have it, you can simply create it locally using "docker build -t jvm ./".  At that point, you can now build containers that leverage the base JVM container.

    Now, for spark we will build a container which might look something like this (the if statement for looking at the hostname isn't particularly important here, its an implementation detail for easy test spin up).

    FROM jvm
    RUN yum clean all
    RUN yum install -y tar yum-utils wget
    RUN yum-config-manager --save --setopt=fedora.skip_if_unavailable=true
    RUN yum update -y
    RUN yum install -y java-1.7.0-openjdk-devel.x86_64
    COPY spark-1.2.0-bin-hadoop2.4.tgz /opt/
    RUN tar -xzf /opt/spark-1.2.0-bin-hadoop2.4.tgz -C /opt/
    RUN echo "SPARK_HOME=/opt/spark-1.2.0-bin-hadoop2.4" >> /etc/environment
    RUN echo "JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk/" >> /opt/spark-1.2.0-bin-hadoop2.4/conf/spark-env.sh
    CMD if [[ `hostname` = 'scale1.docker' ]] ; then /opt/spark-1.2.0-bin-hadoop2.4/sbin/start-master.sh ; else ping -c 2 scale1.docker && /opt/spark-1.2.0-bin-hadoop2.4/sbin/start-slave.sh -h spark://scale1.docker:7077 ; fi ; tailf /opt/spark-1.2.0-bin-hadoop2.4/logs/*
    

    So what are the differences here?

    • This time, we ran directly from a spark tarball.  Since our container is doing one, and only one thing, the value of using a RPM or DEB packaged spark distribution is diminished - and we just use a tar distribution.  This is a common idiom you will see in microservices.
    • The container STARTS with a single service which is essential to our high level application.   Each container either runs a spark master, or a spark slave.
    • For those new to docker, note that we use COPY here, to copy a LOCAL tgz file into /opt/

    Idiomatic Microservices : Immutability leads to testability...

    Now, lets look at how we orchestrate these components.  In Vagrant, we can easily create a lightweight orchestration layer that leverages dockers --link option, as a replacement for heavyweight fully functional orchestration frameworks.  This allows us to test our microservices in a lightweight and cross-platform manner.  This highlights the benefits of building stand-alone microservices : They are testable in any environment, because they have a minimal dependency on the orchestration layer which launches them.   Here is how our Vagrantfile for building these microservices will look, again, this a snippet, for explaining the high level concepts.

    # number of instances : First one is master.
    $spark_num_instances = 2
    $cassandra_num_instances = 1
    Vagrant.configure("2") do |config|
        # nodes definition
        (1..$spark_num_instances).each do |i|
            config.vm.define "scale#{i}" do |scale|
                scale.vm.provider "docker" do |d|
                    d.build_dir = "spark/"
                    d.name = "scale#{i}"
                    d.create_args = ["--privileged=true"]
                    d.remains_running = true
                    if "#{i}" == "1"
                        d.ports = [ "4040:4040", "7707:7707" ]
                    else
                        d.create_args = d.create_args << "--link" << "scale1:scale1.docker"
                    end
                end
                scale.vm.synced_folder "./", "/scale-shared/"
                scale.vm.hostname = "scale#{i}.docker"
           end
        end
    
        #With cassandra we don't have master/slave architecture.
        (1..$cassandra_num_instances).each do |i|
            config.vm.define "cassandra#{i}" do |scale|
                scale.vm.provider "docker" do |d|
                    d.build_dir = "cassandra/"
                    d.create_args = ["--privileged=true", "-m", $conf["docker"]['memory_size'] + "m"]
                    d.remains_running = true
                end
                scale.vm.synced_folder "./", "/scale-shared/"
                scale.vm.hostname = "cassandra#{i}.docker"
            end
        end
    end

    The above vagrantfile creates containers using the --link option of docker which allows us to link one container to another, when it spins up.  So, it essentially follows this path outlined below.  Note that this implementation only allows one cassandra node.

    1. If we are starting the 1st container, start up a spark master.
    2. If we are starting any other spark container, start a slave.  link it to the already created master so that they can talk.
    3. Start a single cassandra node.

    To tie all of this together, here is how I test this setup.

    echo "WARNING REMOVING ALL CONTAINERS in 5 SECONDS !"
    sleep 5
    # Remove all containers.
    docker rm -f `docker ps --no-trunc -aq`
    echo "NOW RESTARTING DOCKER !"
    service docker restart
    echo "NOW CREATING VAGRANT DOCKER CLUSTER "
    vagrant destroy --force && vagrant up --no-parallel
    ## Run calculate pi.
    echo "RUNNING smoke tests..."
    docker exec -i -t scale1 
    /opt/spark-1.2.0-bin-hadoop2.4/bin/spark-submit 
    --class org.apache.spark.examples.SparkPi 
    --master spark://scale1.docker:7077 /scale-shared/spark-examples_2.10-1.1.1.jar 10000
    echo "DONE TESTING .  RESULT OF PI CALCULATION ABOVE "
    

    Note thate, the vagrant assigned names for the containers for us can be used to execute one off commands in a smoke test of your microservices architecture.  This can all be done using pure docker commands also, quite easily, of course.

    This post should help get you started containerizing your bigdata, or other services, in a way which is truly immutable and thus easily testable. 


    Watch it in action !

    Here's a video which shows the entire application, including a distributed spark test against multiple running microservice containers in action. In this video, we are only running spark, but you can easily add some links to the cassandra container to test that manually as well. The current code for this is here https://github.com/jayunit100/SparkStreamingCassandraDemo/tree/master/deploy/, but it may be moved at some point.

    Feel free to reach out to me personally or leave comments about these ideas.  Special thanks to Tim St. Clair at Red Hat for helping me to develop some of these concepts and differentiate "real" microservices  from "naughty" ones !

    Last updated: March 15, 2023

    Recent Posts

    • Best Practice Configuration and Tuning for Linux and Windows VMs

    • Red Hat UBI 8 builders have been promoted to the Paketo Buildpacks organization

    • Using eBPF in Red Hat products

    • How we made one data layer serve the UI, the mocks, and the E2E tests

    • Build trusted Python containers with Project Hummingbird and Calunga

    What’s up next?

     

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility