One of the cool things about separating the container runtimes into different tools is that you can start to combine them to help secure one other.
Lots of people would like to build OCI/container images within a system like Kubernetes. Imagine you have a CI/CD system that is constantly building container images, a tool like Red Hat OpenShift/Kubernetes would be useful for distributing the load of builds. Until recently, most people were leaking the Docker socket into the container and then allowing the containers to do
docker build. As I pointed out years ago, this is one of the most dangerous things you can do. Giving people root access on the system or sudo without requiring a password is more secure than allowing access to the Docker socket.
Because of this, many people have been attempting to run Buildah within a container. We have been watching and answering questions on this for a while. We have built an example of what we think is the best way to run Buildah inside of a container and have made these container images public at quay.io/buildah.
These images are built off the Dockerfiles provided in the Buildah repo `buildahimage` directory: https://github.com/containers/buildah/tree/master/contrib/buildahimage.
I will examine the stable Dockerfile.
# stable/Dockerfile # # Build a Buildah container image from the latest # stable version of Buildah on the Fedoras Updates System. # https://bodhi.fedoraproject.org/updates/?search=buildah # This image can be used to create a secured container # that runs safely with privileges within the container. # FROM fedora:latest # Don't include container-selinux and remove # directories used by dnf that are just taking # up space. RUN yum -y install buildah fuse-overlayfs --exclude container-selinux; rm -rf /var/cache /var/log/dnf* /var/log/yum.* # Adjust storage.conf to enable Fuse storage. RUN sed -i -e 's|^#mount_program|mount_program|g' -e '/additionalimage.*/a "/var/lib/shared",' /etc/containers/storage.conf
We use the fuse-overlay program inside of the container rather than using the host kernel overlay. The reason is that, currently, kernel overlay mounts require the SYS_ADMIN capability, and we want to be able to run our Buildah containers without any additional privileges than a normal root container. Fuse-overlay works quite well and gives us better performance than using the VFS storage driver. Note that using Fuse requires people running the Buildah container to provide the /dev/fuse device.
podman run --device /dev/fuse quay.io/buildahctr ... RUN mkdir -p /var/lib/shared/overlay-images /var/lib/shared/overlay-layers; touch /var/lib/shared/overlay-images/images.lock; touch /var/lib/shared/overlay-layers/layers.lock
Here I am setting up a directory for additional-stores. Container/storage supports this concept of adding additional read-only image stores. For example, you could set up an overlay storage area on one machine and NFS mount the storage onto another machine and use the images without having to pull them down. We plan to use this storage so that we can volume-mount some image storage from the host to be used within the container.
# Set up environment variables to note that this is # not starting with user namespace and default to # isolate the filesystem with chroot. ENV _BUILDAH_STARTED_IN_USERNS="" BUILDAH_ISOLATION=chroot
Finally, we default the Buildah container to run with chroot isolation. Setting the environment variable
BUILDAH_ISOLATION tells Buildah to default to using chroot. We don’t need to run with extra isolation because we are already running within a container. Having Buildah create its own namespace separated containers requires SYS_ADMIN privileges and requires us to relax SELinux and SECCOMP rules on the running container, defeating the purpose of running builds within a locked-down container.
Running Buildah inside a container
The way we designed the Buildah container images above allows us to get the ultimate flexibility on how we launch the containers.
Security vs. speed
In the world of computer security, there is always a battle between the speed that a process can run and the amount of security we can wrap it with. When building containers, we have the same tradeoffs. In the following section, I will describe the tradeoff between speed and security.
The container image above is going to keep its storage in
/var/lib/containers, so we need to volume-mount content into this directory, and this volume can dramatically change the speed of building the container image.
Let’s look at three potential examples.
1. For the most secure, I could create a new directory for containers/image for each container and volume-mount it into the container. We will also put the context directory into the container under
# mkdir /var/lib/containers1 # podman run -v ./build:/build:z -v /var/lib/containers1:/var/lib/containers:Z quay.io/buildah/stable\ buildah -t image1 bud /build # podman run -v /var/lib/containers1:/var/lib/containers:Z quay.io/buildah/stable buildah push \ image1 registry.company.com/myuser # rm -rf /var/lib/containers1
Security: Buildah running within this container is fully locked down, and it is running with dropped capabilities, SECOMP enforcing, and SELinux enforcing. You could even run this container with User Namespace separation by adding something like
Performance: This is the least performant because it will need to pull down all of the images it will use from container registries, and it cannot take advantage of caches. When the Buildah container is done, it should push the image to the registry and destroy the content. Future container images that you might want to build off this new container image will have to pull the new image from a registry, because the image was removed from the host.
2. If we want to match Docker performance, we could volume mount in the hosts container/storage into the container.
# podman run -v ./build:/build:z -v /var/lib/containers:/var/lib/containers --security-opt label:disabled quay.io/buildah/stable buildah -t image2 bud /build # podman run -v /var/lib/containers:/var/lib/containers --security-opt label:disabled \ quay.io/buildah/stable buildah push image2 registry.company.com/myuser
Security: This is the least secure way of building containers, the container is allowed to modify the container storage on the host and potentially could cause Podman or CRI-O to do things with rogue images. Also, to make this work, I had to disable SELinux separation. SELinux would block the Buildah container processes from interacting with the hosts storage. Note that this is still better than running with the Docker socket, because the container is still locked down with the other security features and cannot easily launch a container on the host.
Performance: This is the best performance because it can take advantage of the cache. If Podman or CRI-O previously pulled the image to the host, the Buildah process inside of the container will not need to re-pull the image. Future builds based on this image can also take advantage of the cache.
3. A third way to build the containers would be to create a project container image directory and to share this image directory between all of the containers in the project.
# mkdir /var/lib/project3 # podman run --security-opt label:level=s0:C100, C200 -v ./build:/build:z \ -v /var/lib/project3:/var/lib/containers:Z quay.io/buildah/stable buildah -t image3 bud /build # podman run --security-opt label:level=s0:C100, C200 \ -v /var/lib/project3:/var/lib/containers quay.io/buildah/stable buildah push image3 \ registry.company.com/myuser
In the third example, I don’t remove the project directory (
/var/lib/project3) between runs, so future builds within the same project can take advantage of the cache.
Security: This is the middle ground of security. The containers do not get access to the hosts content and cannot cause Podman/CRI-O to do bad things by writing content to their image store. Containers can affect other container builders within the same project.
Performance: This setup might be less performant than sharing the cache with the host, because it cannot take advantage of images previously pulled by Podman/CRI-O. But once one Buildah pulls an image, all of the other builds can take advantage of the image.
Containers/storage has a cool feature called additional stores, which allows container engines to use external container overlay image stores read/only when running and building a container. Basically, you can add one or more read/only stores to the storage.conf file and then when running a container, the container engine will search each of the stores for the image you want to run. And, it will only pull the image from a registry, if none of the stores finds the image. The container engine will only be able to write to its single writable stores.
If you go back to look at the Dockerfile we used to build the quay.io/buildah/stable image, you will see these lines:
# Adjust storage.conf to enable Fuse storage. RUN sed -i -e 's|^#mount_program|mount_program|g' -e '/additionalimage.*/a "/var/lib/shared",' /etc/containers/storage.conf RUN mkdir -p /var/lib/shared/overlay-images /var/lib/shared/overlay-layers; touch /var/lib/shared/overlay-images/images.lock; touch /var/lib/shared/overlay-layers/layers.lock
The first line is modifying
/etc/containers/storage.conf inside of the container image. It is telling the storage driver to use "additionalimagestores" in the
/var/lib/shared directory. In the next line, I create the shared directory and add a couple of lock files to keep containers/storage happy. Basically, this is creating an empty container image store.
If we volume-mount in containers/storage on top of this directory, then Buildah will be able to use the images.
If we go back to example one above, where we were able to take advantage of the hosts containers/store within the Buildah image, we get the best performance, because Podman/CRI-O might have previously pulled down the image. But we get the worst security because the container could write to the store. With additional images, we can get the best of both worlds.
# mkdir /var/lib/containers4 # podman run -v ./build:/build:z -v /var/lib/containers/storage:/var/lib/shared:ro -v \ /var/lib/containers4:/var/lib/containers:Z quay.io/buildah/stable \ buildah -t image4 bud /build # podman run -v /var/lib/containers/storage:/var/lib/shared:ro \ -v >/var/lib/containers4:/var/lib/containers:Z quay.io/buildah/stable buildah push image4 \ registry.company.com/myuser # rm -rf /var/lib/continers4
Notice how I mounted the
/var/lib/containers/storage from the host onto
/var/lib/shared in the container read/only. When Buildah runs within the container, it can take advantage of any previously pulled images by Podman/CRI-O to speed things up, but it still can only write to its own storage. Note also that I can now do this with SELinux container separation enabled.
One potential issue with this is that you should not remove any images from the underlying storage. If you do, you might cause the Buildah container to blow up.
But that’s not all...
Additional stores are even better than that. You can set up a networked shared storage with all of your container images stored in it. And then you can share this storage to all of your Buildah containers. Imagine you had a hundred images that your CI/CD system used regularly for building container images. You could set up one host with the storage pre-pulled of all the images. Then, you could use your favorite network storage tool (NFS, Gluster, Ceph, ISCSI, S3...) and share the storage with all of your Buildah or Kubernetes nodes.
Just volume-mount this networked storage into the Buildah containers at
/var/lib/shared, and instantly your Buildah container no longer has to pull images at all. They are all pre-populated, and you are ready to roll.
Of course, this could also be taken advantage of by your Kubernetes and container infrastructure to launch and run containers all over the place without having to pull the images at all. I could even imagine a container registry that, when receiving a container image pushed to it, would explode the container image onto the shared storage. Then, instantly, all of your nodes would have access to the updated image.
I have heard of huge multi-gigabyte container images. Using additional stores means you would no longer need to copy them around your environment, and your container startup times would be instantaneous.
In my next article, I will cover a new feature we are working on—overlay volume mounts—which will make building images even faster.
Running Buildah within a container in Kubernetes/CRI-O or Podman, or even Docker is easy to do, and it can be done a much more securely then leaking in the docker.socket. We have added a lot of flexibility with the image to allow you to run it in different ways depending on your security and performance needs.
Additional stores can be used to help speed up or even eliminate the need to pull down container images.
We've also put together a demo to help illustrate the concepts discussed here:Last updated: March 28, 2023