The need for speed and the kernel datapath – recent improvements in UDP packets processing

Networking hardware is becoming crazily fast, 10Gbs NICs are entry-level for server h/w, 100Gbs cards are increasingly popular and 200Gbs are already surfacing. While the Linux kernel is striving to cope with such speeds with large packets and all kind of aggregation, ISPs are requesting much more demanding workload with NFV and line rate packet processing even for 64 bytes packets.

Is everything lost and are we all doomed to rely on some kernel bypass solution? Possibly, but let’s first inspect what is really the current status for packet processing in the kernel data path, with a perspective look at the relevant history and the recent improvements.

We will focus on UDP packets reception: UDP flood is a common tool to stress the networking stack allowing arbitrary small packets and defeating packet aggregation (GRO), in place for other protocols.

Continue reading “The need for speed and the kernel datapath – recent improvements in UDP packets processing”

Share

Towards Faster Ruby Hash Tables

Hash tables are an important part of dynamic programming languages. They are widely used because of their flexibility, and their performance is important for the overall performance of numerous programs. Ruby is not an exception. In brief, Ruby hash tables provide the following API:

  • insert an element with given key if it is not yet on the table or update the element value if it is on the table
  • delete an element with given key from the table
  • get the value of an element with given key if it is in the table
  • the shift operation (remove the earliest element inserted into the table)
  • traverse elements in their inclusion order, call a given function and depending on its return value, stop traversing or delete the current element and continue traversing
  • get the first N or all keys or values of elements in the table as an array
  • copy the table
  • clear the table

Continue reading “Towards Faster Ruby Hash Tables”

Share

Docker project: Can you have overlay2 speed and density with devicemapper? Yep.

It’s been a while since our last deep-dive into the Docker project graph driver performance.  Over two years, in fact!  In that time, Red Hat engineers have made major strides in improving container storage:

All of that, in the name of providing enterprise-class stability, security and supportability to our valued customers.

As discussed in our previous blog, there are a particular set of behaviors and attributes to take into account when choosing a graph driver.  Included in those are page cache sharing, POSIX compliance and SELinux support.

Reviewing the technical differences between a union filesystem and devicemapper graph driver as it relates to performance, standards compliance and density, a union filesystem such as overlay2 is fast because

  • It traverses less kernel and devicemapper code on container creation (devicemapper-backed containers get a unique kernel device allocated at startup).
  • Containers sharing the same base image startup faster because of warm page cache
  • For speed/density benefits, you trade POSIX compliance and SELinux (well, not for long!)

There was no single graph driver that could give you all these attributes at the same time — until now.

How we can make devicemapper as fast as overlay2

With the industry move towards microservices, 12-factor guidelines and dense multi-tenant platforms, many folks both inside Red Hat as well as in the community have been discussing read-only containers.  In fact, there’s been a –read-only option to both the Docker project, and kubernetes for a long time.  What this does is create a mount point as usual for the container, but mount it read-only as opposed to read-write.  Read-only containers are an important security improvement as well as they reduce the container’s attack surface.  More details on this can be found in a blog post from Dan Walsh last year.

When a container is launched in this mode, it can no longer write to locations it may expect to (i.e. /var/log) and may throw errors because of this.  As discussed in the Processes section of 12factor.net, re-architected applications should store stateful information (such as logs or web assets) in a stateful backing service.  Attaching a persistent volume that is read-write fulfills this design aspect:  the container can be restarted anywhere in the cluster, and its persistent volume can follow it.

In other words, for applications that are not completely stateless an ideal deployment would be to couple read-only containers with read-write persistent volumes.  This gets us to a place in the container world that the HPC (high performance/scientific computing) world has been at for decades:  thousands of diskless, read-only NFS-root booted nodes that mount their necessary applications and storage over the network at boot time.  No matter if a node dies…boot another.  No matter if a container dies…start another.

Continue reading “Docker project: Can you have overlay2 speed and density with devicemapper? Yep.”

Share

How to avoid wasting megabytes of memory a few bytes at a time

Maybe you have so much memory in your computer that you never have to worry about it — then again, maybe you find that some C or C++ application is using more memory than expected. This could be preventing you from running as many containers on a single system as you expected, it could be causing performance bottlenecks, and it could even be forcing you to pay for more memory in your servers.

You do some quick “back of the envelope” calculations to estimate the amount of memory your application should be using, based on the size of each element in some key data structures, and the number of those data structures in each array. You think to yourself, “That doesn’t add up! Why is the application using so much more memory?” The reason it doesn’t add up is that you didn’t take into account the memory space being wasted in the layout of the data structures.

Continue reading “How to avoid wasting megabytes of memory a few bytes at a time”

Share

“Don’t cross the streams”: Thread safety and memory accesses at the speed of light

The classic 1984 movie Ghostbusters offered an important safety tip for all of us:


Don’t cross the streams.” – “Why not?” – “It would be bad.” – “I’m fuzzy on the whole good/bad thing. What do you mean, ‘bad’?” – “Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light.” – “Right. That’s bad. Okay. All right. Important safety tip. Thanks…”


Similarly, in computing, there are also cases where data crossing through memory between different instruction streams would cause a similar effect to a software application – “all execution as we know it stopping instantaneously”.

This is due to the performance optimizations that both hardware and software implement to reorder and eliminate memory accesses. Ignoring these “memory access reordering” issues can result in extremely problematic debugging scenarios.

The bug from hell was a scenario where Java’s OpenJDK runtime parallel garbage collector very occasionally crashed because one thread’s write would signal that the data structure had been updated. This signal occurred before the actual update writes (to the same data structure), and the result was that other threads would end up reading invalid values. We’re going to take a deeper look into this scenario to understand exactly what went on in this notorious issue.

Continue reading ““Don’t cross the streams”: Thread safety and memory accesses at the speed of light”

Share