Profiling NodeJS applications with Linux Performance Tools

Using Linux Perf Tools

The Performance Analysis Tool for Linux (perfis a powerful tool to profile applications. It works by using a mix of hardware counters (is fast) and software counters, all provided by the Linux Performance Counter (LPC) subsystem that takes charge of the complex task of wrapping the CPU counters for the different type of CPUs. So you can have access to a very efficient way to get information of running processes through their C API or a convenient command in this case (perf).

This command gives you access to a great variety of system and process level events but in this entry, I will use it to investigate CPU bounded issues.  

Continue reading “Profiling NodeJS applications with Linux Performance Tools”


The need for speed and the kernel datapath – recent improvements in UDP packets processing

Networking hardware is becoming crazily fast, 10Gbs NICs are entry-level for server h/w, 100Gbs cards are increasingly popular and 200Gbs are already surfacing. While the Linux kernel is striving to cope with such speeds with large packets and all kind of aggregation, ISPs are requesting much more demanding workload with NFV and line rate packet processing even for 64 bytes packets.

Is everything lost and are we all doomed to rely on some kernel bypass solution? Possibly, but let’s first inspect what is really the current status for packet processing in the kernel data path, with a perspective look at the relevant history and the recent improvements.

We will focus on UDP packets reception: UDP flood is a common tool to stress the networking stack allowing arbitrary small packets and defeating packet aggregation (GRO), in place for other protocols.

Continue reading “The need for speed and the kernel datapath – recent improvements in UDP packets processing”


Towards Faster Ruby Hash Tables

Hash tables are an important part of dynamic programming languages. They are widely used because of their flexibility, and their performance is important for the overall performance of numerous programs. Ruby is not an exception. In brief, Ruby hash tables provide the following API:

  • insert an element with given key if it is not yet on the table or update the element value if it is on the table
  • delete an element with given key from the table
  • get the value of an element with given key if it is in the table
  • the shift operation (remove the earliest element inserted into the table)
  • traverse elements in their inclusion order, call a given function and depending on its return value, stop traversing or delete the current element and continue traversing
  • get the first N or all keys or values of elements in the table as an array
  • copy the table
  • clear the table

Continue reading “Towards Faster Ruby Hash Tables”


Docker project: Can you have overlay2 speed and density with devicemapper? Yep.

It’s been a while since our last deep-dive into the Docker project graph driver performance.  Over two years, in fact!  In that time, Red Hat engineers have made major strides in improving container storage:

All of that, in the name of providing enterprise-class stability, security and supportability to our valued customers.

As discussed in our previous blog, there are a particular set of behaviors and attributes to take into account when choosing a graph driver.  Included in those are page cache sharing, POSIX compliance and SELinux support.

Reviewing the technical differences between a union filesystem and devicemapper graph driver as it relates to performance, standards compliance and density, a union filesystem such as overlay2 is fast because

  • It traverses less kernel and devicemapper code on container creation (devicemapper-backed containers get a unique kernel device allocated at startup).
  • Containers sharing the same base image startup faster because of warm page cache
  • For speed/density benefits, you trade POSIX compliance and SELinux (well, not for long!)

There was no single graph driver that could give you all these attributes at the same time — until now.

How we can make devicemapper as fast as overlay2

With the industry move towards microservices, 12-factor guidelines and dense multi-tenant platforms, many folks both inside Red Hat as well as in the community have been discussing read-only containers.  In fact, there’s been a –read-only option to both the Docker project, and kubernetes for a long time.  What this does is create a mount point as usual for the container, but mount it read-only as opposed to read-write.  Read-only containers are an important security improvement as well as they reduce the container’s attack surface.  More details on this can be found in a blog post from Dan Walsh last year.

When a container is launched in this mode, it can no longer write to locations it may expect to (i.e. /var/log) and may throw errors because of this.  As discussed in the Processes section of, re-architected applications should store stateful information (such as logs or web assets) in a stateful backing service.  Attaching a persistent volume that is read-write fulfills this design aspect:  the container can be restarted anywhere in the cluster, and its persistent volume can follow it.

In other words, for applications that are not completely stateless an ideal deployment would be to couple read-only containers with read-write persistent volumes.  This gets us to a place in the container world that the HPC (high performance/scientific computing) world has been at for decades:  thousands of diskless, read-only NFS-root booted nodes that mount their necessary applications and storage over the network at boot time.  No matter if a node dies…boot another.  No matter if a container dies…start another.

Continue reading “Docker project: Can you have overlay2 speed and density with devicemapper? Yep.”


How to avoid wasting megabytes of memory a few bytes at a time

Maybe you have so much memory in your computer that you never have to worry about it — then again, maybe you find that some C or C++ application is using more memory than expected. This could be preventing you from running as many containers on a single system as you expected, it could be causing performance bottlenecks, and it could even be forcing you to pay for more memory in your servers.

You do some quick “back of the envelope” calculations to estimate the amount of memory your application should be using, based on the size of each element in some key data structures, and the number of those data structures in each array. You think to yourself, “That doesn’t add up! Why is the application using so much more memory?” The reason it doesn’t add up is that you didn’t take into account the memory space being wasted in the layout of the data structures.

Continue reading “How to avoid wasting megabytes of memory a few bytes at a time”