Convergence, Immutability, and Image-based Deployments

As our industry continues to adopt lean methodologies in an effort to improve the workflow of product deliverables, it’s important that the products developed using these patterns are reliable. When speaking from an application infrastructure perspective, or the Ops side of DevOps, this means that we must continue to improve resiliency, predictability, and consistency, alongside streamlining our development workflows to allow for failing fast, and failing often.When faced with a critical incident, it’s dissatisfying to find that the root cause was an environment delta that only affected a subset of your infrastructure.  You begin asking questions like, “Why aren’t all our nodes configured with the same parameters? Why aren’t we running the same package versions on all of our nodes? Why is the staging environment different from production?”

Around many IT shops, the concept of deploying gold-master machine images would seemly solve a lot of these problems when faced with the aforementioned questions. By using ‘blessed’ machine images across your environments, you can ensure that the application stack is consistent across each of your node clusters, and having tested those stacks in the lower-level environments, confidence is raised as to how these systems will perform when presented with production loads. As an added bonus, the common use-case of image-based deployments starts the application node builds ‘further down the road’, rather than starting stack configurations with nodes that merely have base OS installs.

Let’s merge this idea with the concept of configuration management. Inside (and out) of the DevOps paradigm, many shops build ‘infrastructure as code’, leveraging configuration management tools like Puppet, Chef, or Salt. While the use of such tools draws parallel to image-based deployments, configuration tools are meant to maintain the desired node state; when applying configuration management code to base OS installs, you arrive at the desired state by building up the node’s packages and configuration parameters to meet the configuration requirements. This takes time, though eventually you end up with an application infrastructure that is standardized across your environments.

To increase the speed of rolling out application nodes, you have the ability to combine these two concepts by applying configuration management code to machine instances that have ‘most’ of the application stack previously installed and configured.  This way, there is less puppet code to apply (for example) to a node when prepping it for deployment to your infrastructure.

A marriage made in Eden, right? Well, let’s consider this scenario:

Your team has deployed nodes from machine images into all of the required environments. The state of these nodes is kept consistent with configuration management. The deployments are faster than they were previously because now your images are built with most of the required elements to deploy your applications. You’ve conducted test-driven deployments, so the number of nodes you need at varying points of your application infrastructure is well established, so the infrastructure holds up well under production load.

Next, new features are implemented in your application code, which constitute changes to the infrastructure. No problem – the instances you’ve deployed run $CONFIG_MGMT_TOOL, so it’s simple to account for the changes in the infrastructure as required by the application. Changes are tested and pushed through the environments, and (therefore) the nodes that exist in them. This happens on a regulated basis, and your nodes accept these changes without errors. After months of operating with nodes that have been in production, the time comes when you need to spin up a new instance. Your images are now months old, and your development pipeline has been designed to simply modify the existing images in place. How confident are you that new nodes can spin up without encountering any issues, and how fast will those nodes be placed into your environment and become useful inside your application’s infrastructure?  This is the quandary of convergence.

In a PuppetCamp London 2013 presentation entitled “Taking Control of Chaos with Docker and Puppet”, presented by Tomas Doran at Yelp, an argument that supports convergence and immutability is discussed. Doran argues that, “you’re doing it wrong, unless you converge in exactly one puppet run…you should never keep a machine… (and) unless you regularly rebuild, you don’t know that you can rebuild…” For me, this presentation summarizes the case for building immutable instances in your continuous deployment work stream.  By continuously rebuilding nodes from scratch, you prove that your configuration management system is capable of building clean instances every time, and if you are dynamically growing and shrinking your infrastructure, you remain confident that new nodes will deploy and immediately begin adding value. Comedic divergence – that’s +10 points for using a derivative of the buzzword ‘value-add’, if you’re keeping score :-).

Managing machine images (and the instances spawned from them) is a topic worthy of an introspective article alone. Even as I’m writing this article, in my mind the topic being addressed here is not whether machine images are useful inside your architecture; it is the idea that immutable node instances can be the deploy-able artifacts delivered to production, and how expensive taking that approach will be.

The new maxim being heard around DevOps shops is, “Containerization is the new virtualization”. VMs are currently our industry standard, but we now know that VMs are ‘batteries included’ with heavy burden. Deploying VMs comes with the fixed cost of managing the entire OS, when perhaps the service(s) in use on that VM are minimal. Ah ha! Now, it makes sense. Containers are lightweight, they provide the appropriate level of isolation, and the convergence problem is minimized, since you’re not wasting cycles deploying entire machines full of configurations and services unusable or unimportant to the mission.

Technologies like Docker provide us with a way to package and ship isolated and purposeful applications/services inside of a lightweight container, though as cutting edge and ‘sexy’ as this technology is, the jury is still out on how well it measures up in production environments.

To close, it’s my view (as an Operations professional) that currently in our industry, it’s time to validate the viability of incorporating machine images, deploy LINUX containers, and ultimately solve the environmental deltas problem with a solution that conforms to our cycle time goals. As we adopt the DevOps culture within our development organizations, CI/CD certainly should not translate to: “Let’s deploy nodes to our application’s infrastructure more quickly, and more often, without taking stability, standardization, and operational confidence into consideration.”


Join the Red Hat Developer Program (it’s free) and get access to related cheat sheets, books, and product downloads.

 

Share

  1. Thanks for the shout out, and great article. I’m being (a little) extreme in that presentation. :_)

    I _do_ firmly believe that you want to be able to converge in exactly one puppet/chef run. This is not negotiable, as you open a massive box of fail if you have to build machines in order etc..

    For testing, it is 100% true that you don’t know if you can cleanly build a node unless you’re doing so from your tests (and so you should be!). The harder problem is you cannot _prove_ (it reduces to the halting problem) that the current config management code can update servers from an arbitrary state to the desired state – ergo building cleanly (which you can at least prove by doing it) is highly desirable.

    Always replacing machines (phoenix / immutable servers) is really really powerful for cases where you have stateless instances (web servers for example); however being pragmatic there are whole classes nodes (e.g. mysql servers, HDFS nodes) where blowing the machine away to do config updates is never going to be a workable thing.

    The set of cases where you can provably build cleanly, but not permute the current infrastructure into the new state is a fairly limited set, but still in some ways a scary possibility – despite much testing effort, you still don’t have any sort of real confidence…

    I explored this topic (infrastructure testing) in more detail in my puppet conf talk last year: http://puppetlabs.com/presentations/test-driven-infrastructure-development

    Containers allow you to explore a nice hybrid of the replace vs update strategies; for example I replace the postfix container for any and all configuration updates, but the postfix spool directory is kept on a persistent volume. This gives you the best of both worlds.

Leave a Reply