Towards Faster Ruby Hash Tables

Hash tables are an important part of dynamic programming languages. They are widely used because of their flexibility, and their performance is important for the overall performance of numerous programs. Ruby is not an exception. In brief, Ruby hash tables provide the following API:

  • insert an element with given key if it is not yet on the table or update the element value if it is on the table
  • delete an element with given key from the table
  • get the value of an element with given key if it is in the table
  • the shift operation (remove the earliest element inserted into the table)
  • traverse elements in their inclusion order, call a given function and depending on its return value, stop traversing or delete the current element and continue traversing
  • get the first N or all keys or values of elements in the table as an array
  • copy the table
  • clear the table

Since Ruby 1.9, traversing hash tables and shift operation should be done in the order elements were included. In other words, elements included earlier should be traversed first. The current implementation of Ruby hash tables is based on an old package written by Peter Moore, with modern murmurhash and siphash algorithms used for hashing Ruby values.

Moore’s original implementation has been changed a lot and the current Ruby hash tables have the following organization:

The typical Ruby hash table (very small hash tables have a different organization) has an array of bins, each bin containing a pointer to a list of table elements (entries, in the Ruby hash table terminology). The calculated hash is translated into the table bin by a modulo operation, which is actually a mask operation as the bin array is of the size of a power of 2.

As you can see, entries have 3 pointers: two for a doubly linked list of entries in their order of inclusion and one for a list of entries corresponding the same bin (in other words, for colliding entries). Using a single linked list for collisions saves memory but makes deleting arbitrary entries during traversal, not a trivial task. So the actual Ruby hash table API is even more complicated than what I described.

When a Ruby hash table’s load factor becomes too big, a new, larger bin array is created.

Modern processors have several levels of cache. Usually, the CPU reads one or a few lines of the cache from memory (or another level of cache). So CPU is much faster at reading data stored close to each other. The current implementation of Ruby hash tables does not fit well with modern processor cache organization, which requires better data locality for faster program speed.

So, how can we improve data locality?

First, we can remove the collision lists by switching to an open addressing hash table where each bin corresponds at most to one entry. Although open addressing needs more bins than the original separate chaining hash tables, we can afford this as we remove one pointer by switching to open addressing. As a result of this switch, we improve data locality for processing entries by their key.

The tendency has been to move from chaining hash tables to open addressing hash tables due to their better fit to modern CPU memory organizations. CPython recently made this switch. GCC has widely-used such hash tables internally for more than 15 years.

To improve data locality for processing entries by their inclusion order, we can remove the doubly linked list and put the entries into an array. It also removes two pointers, and the new entry becomes exactly two times smaller than the original one. That also decreases entry footprint and increases data locality. Removing the lists lets us avoid pointer chasing, a common problem that produces bad data locality.

So our proposed Ruby hash table has the following organization:

The entry array contains table entries in the same order as they were inserted. When the first entry is deleted, a variable containing the index of the current first entry (start) is incremented. In all other cases of deletion, we just mark the entry as deleted by using a reserved hash value.

Bins provide access to entries by their keys. The key hash is mapped to a bin containing the index of the corresponding entry in the entry array (using an index instead of a pointer can be memory saving when we use 8-, 16- or 32-bit indices for reasonable size tables on 64-bit machines).

The size of the array bin is always a power of two, which it makes mapping very fast by using the corresponding lower bits of the hash. Generally, it is not a good idea to ignore some part of the hash. But the alternative approach is worse. For example, we could use a modulo operation for mapping and a prime number for the size of bin array. Unfortunately, the modulo operation for large 64-bit numbers is extremely slow (it takes more 100 cycles on modern Intel CPUs).

Still, the remaining bits of the hash value can be used when the mapping results in a collision. That is also an improvement in comparison with the original tables. We use a secondary hash value function analogous to one in the new CPython hash table implementation. The secondary hash value function is a function of the collision bin index and original hash value. This choice of function guarantees that we can traverse all bins and finally find the corresponding bin, as after several iterations the function becomes X_next = (5 * X_prev + 1) mod pow_of_2, which is a full cycle linear congruential generator since it satisfies the requirements of the Hull-Dobell theorem. By the way, I suspect that the function lookdict in CPython might cycle in extremely rare cases, as X_prev in the function can be more than the size of the table.

When an entry is removed from the table, besides marking the slot as deleted using a special hash value, we also mark the bin by a special value in order to find entries, which had a collision with the removed entries.

There are two reserved values for bins. One denotes an empty bin, while the other one denotes a bin containing a deleted entry.

The length of the bin array is always twice the entry array length. This maintains a healthy table load factor. To improve table memory footprint, we use 8, 16, 32-bits indices on a 64-bit system for smaller tables. As bins are a smaller part of the all table footprint for small tables, we could even increase the bin array two times to decrease the load factor and the collision number even more. But currently, this approach is not used.

The trigger of rebuilding the table is always the case when we cannot insert an entry at the end of the entry array. As opposed to the original hash tables, new arrays can be smaller after rebuilding. This can happen when we remove a lot of entries.

Table rebuilding is done by creating new array entries and bins of appropriate size. We could try to reuse the arrays in some cases by moving non-deleted entries to the start of the array. But this makes little sense, as the most expensive part is entry moves and such an approach just complicates the implementation.

Analysis of the hash table usage in Ruby on Rails done by Koichi Sasada shows that the majority of hash tables are small. We prevent allocations of bins and use a linear search for very small tables less than 5 entries. It saves memory and makes an access to the entries by keys faster for such small tables by saving time to maintain and work with data in the bins.

The original Ruby hash tables use the same approach for tables of less than 7 entries but the implementation is more complicated as it needs a different table representation.

Here is the performance comparison of original and proposed hash tables on 27 MRI hash table benchmarks. We use two major targets for this: x86-64 and ARM.

The average performance improvement on Intel Haswell for the proposed hash tables is about 46%. On ARM, it is even bigger 59%. The proposed implementation also consumes less memory in average than the old one. The below graph shows memory consumption in bytes on x86-64 depending on the number of entries in the hash table.

The proposed implementation is also more straightforward. It permits to simplify the Ruby hash table API in future by avoiding of the original hash tables’ problems with deleting entries during traversal.

Although the new design seems simple, achieving acceptable results required a lot of testing of different variants of the implementation. That is how performance work is usually done.

As an example of such work, one variant of the new hash table implementation had a faster solution to solve the problem of possible denial attacks based on hash collisions:

  • When a new table entry was inserted, we just counted collisions with entries with the same hash (more accurately a part of the hash) but different keys and when some threshold was achieved, we rebuilt the table and started to use a crypto-level hash function. In reality, the crypto-level function would be never used until someone really tries to do the denial attack.
  • Such approach permitted to use faster non-crypto level hash functions in the majority of cases. It also would permit easily to switch to the usage of other slower crypto-level functions (without losing speed in real world scenario), e.g. sha-2 or sha-3. The cryptographic function siphash currently used in MRI is pretty a new function and it is not time-tested yet as older functions.

Although it is not documented, Ruby function hash should be the same as hash function used inside the hash tables and its value should be always the same for an object during one MRI run. Therefore, this approach did not survive the final variant of the hash tables.

It took 8 months from submitting the initial code to committing it into MRI repository (please see issue 12142). Now the new hash table implementation is a part of MRI 2.4.

The hash table implementation discussion contains more 170 posts. Reading the posts can be quite educative and can be a good case of studying open source development aspects and issues.

On a side note, I’d like to thank my managers for permission to invest part of my time to work on Ruby performance in addition to my major responsibilities.


 


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Unlock your Red Hat JBoss Data Grid data with Red Hat JBoss Data Virtualization

Welcome to another episode of the series: “Unlock your Red Hat JBoss Data Grid (JDG) data with Red Hat JBoss Data Virtualization (JDV).”

This post will guide you through an example of connecting to Red Hat JBoss Data Grid data source, using Teiid Designer. In this example, we will demonstrate connecting to a local JDG data source.  We’re using the JDG 6.6.1, but you can connect to any local or remote JDG source (version 6.6.1) if you wish, using the same steps.

Continue reading “Unlock your Red Hat JBoss Data Grid data with Red Hat JBoss Data Virtualization”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Programmatic Debugging: Part 1 the challenge

As every developer knows, debugging an application can be difficult and often enough you spend as much or more time debugging an application as originally writing it. Every programmer develops their collection of tools and techniques. Traditionally these have included full-fledged debuggers, instrumentation of the code, and tracing and logging. Each of these has their particular strengths and weaknesses.

Continue reading “Programmatic Debugging: Part 1 the challenge”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Docker project: Can you have overlay2 speed and density with devicemapper? Yep.

It’s been a while since our last deep-dive into the Docker project graph driver performance.  Over two years, in fact!  In that time, Red Hat engineers have made major strides in improving container storage:

All of that, in the name of providing enterprise-class stability, security and supportability to our valued customers.

As discussed in our previous blog, there are a particular set of behaviors and attributes to take into account when choosing a graph driver.  Included in those are page cache sharing, POSIX compliance and SELinux support.

Reviewing the technical differences between a union filesystem and devicemapper graph driver as it relates to performance, standards compliance and density, a union filesystem such as overlay2 is fast because

  • It traverses less kernel and devicemapper code on container creation (devicemapper-backed containers get a unique kernel device allocated at startup).
  • Containers sharing the same base image startup faster because of warm page cache
  • For speed/density benefits, you trade POSIX compliance and SELinux (well, not for long!)

There was no single graph driver that could give you all these attributes at the same time — until now.

How we can make devicemapper as fast as overlay2

With the industry move towards microservices, 12-factor guidelines and dense multi-tenant platforms, many folks both inside Red Hat as well as in the community have been discussing read-only containers.  In fact, there’s been a –read-only option to both the Docker project, and kubernetes for a long time.  What this does is create a mount point as usual for the container, but mount it read-only as opposed to read-write.  Read-only containers are an important security improvement as well as they reduce the container’s attack surface.  More details on this can be found in a blog post from Dan Walsh last year.

When a container is launched in this mode, it can no longer write to locations it may expect to (i.e. /var/log) and may throw errors because of this.  As discussed in the Processes section of 12factor.net, re-architected applications should store stateful information (such as logs or web assets) in a stateful backing service.  Attaching a persistent volume that is read-write fulfills this design aspect:  the container can be restarted anywhere in the cluster, and its persistent volume can follow it.

In other words, for applications that are not completely stateless an ideal deployment would be to couple read-only containers with read-write persistent volumes.  This gets us to a place in the container world that the HPC (high performance/scientific computing) world has been at for decades:  thousands of diskless, read-only NFS-root booted nodes that mount their necessary applications and storage over the network at boot time.  No matter if a node dies…boot another.  No matter if a container dies…start another.

Continue reading “Docker project: Can you have overlay2 speed and density with devicemapper? Yep.”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Take advantage of your Red Hat Developers membership and download RHEL today at no cost.


For more information about Red Hat OpenShift and other related topics, visit: OpenShift, OpenShift Online.

Lightweight Application Instrumentation with PCP

Wait… what?

I was involved in diagnosing a production system performance problem: a web application serving thousands of interactive users was acting up.  Symptoms included significant time running kernel code on behalf of the application (unexpectedly), and at those times substantial delays were observed by end users.

As someone with a systems programming background, I figured I had a decent shot at figuring this one out. Naively I reached for strace(1), the system call and signal tracer, to provide insights (this was long before perf(1) came along, in my defence).

Firing up strace, however, things rapidly went from bad to oh-so-much-worse, with the application becoming single threaded and almost entirely locking up under ptrace(2) control. Nothing was able to return responsiveness once that flat spin had been induced. Sadly an unscheduled downtime resulted, and I wandered off to lick my wounds, wondering what on earth just happened.

Why?

Without going into the details of what actually happened, nor the weird and wonderful things that are going on under the hood inside strace – suffice to say this was a pathological scenario and strace was certainly the wrong tool for the job. Hindsight is 20/20!

However, lesson learned – and it’s not only strace of course – there are many analysis tools which take the behavior modifying approach of “switch on special/new code paths, export lots of special/new diagnostics” that can make production system failure situations far, far worse.

The kernel and many system services provide a wealth of always-enabled instrumentation, and in my experience it provides good return on investment when business-critical applications to do the same. Knowing that counters, gauges and other measures are always there, always updated, and – ideally – always being sampled and recorded, builds high levels of confidence in their safety and at acceptable (known, fixed, low) costs.

How?

There are many different projects and APIs for instrumenting applications, with a variety of different design goals, trade-offs and overheads. Many articles have been devoted to the sorts of things worth instrumenting within an application, so lets skip over that (extremely important!) topic here and instead focus on underlying mechanisms.

One thing to note first up is that all the approaches require some form of inter-process communication mechanism, to get the metric values out of the application address space and into the monitoring tools – this can involve varying degrees of memory copying, context switching, synchronization and various other forms of impact on the running application.

In the Performance Co-Pilot (pcp.io) toolkit the MMV – “Memory Mapped Value” – approach tackles this issue of providing low-cost, lightweight metric value extraction from running applications.

The approach is built around shared memory, where the application registers metrics and is assigned fixed memory locations for the safe manipulation of each metric value. The application is then left to update each in-memory value according to its needs and the semantics of each metric.

The memory locations are allocated, and fixed, in such a way that they can also be safely accessed by separate (collector, monitoring and/or analysis) processes. Naturally, the overheads of actually counting events, accumulating byte counts, gauging utilization and so on cannot be removed, but the goal is to make that the only cost incurred.

pcp-mmap-diagram
Application instrumentation via shared memory mappings

In the MMV model, at the points where metrics are updated, the only cost involved is the memory mapping update, which is a single memory store operation. There is no need to explicitly transfer control to any another thread or process, nor allocate memory, nor make system or library calls. The external PCP sampling process(es) will only sample values at times driven by those tools, placing no overhead on the instrumented application.

The other good news is the MMV approach scales well as metrics are added; applications with many hundreds of metrics are able to update values with the same overheads as lightly instrumented applications.

On the other hand, to attain this level of performance there are trade-offs being made. Its assumed that always-enabled sampling is the analysis model (so this technique is not suited to event tracing, which is more the domain of complementary approaches like dtrace, ETW, LTTng and systemtap).  So it is not suited for compound data structures. But for the kinds of performance values we’re looking at here, where each metric is usually an independent numeric value, this proves to be a worthwhile trade-off in practice for always-enabled instrumentation.

Where?  When?

All Red Hat Enterprise Linux releases since 6.6 onward include MMV as an instrumentation approach you can use.  Sample instrumented application code is available in the pcp-devel package.

The service involved with reading the memory mappings is pmcd(1) and its pmdammv(1) shared library helper.  Many PCP tools exist that will record, visualize, infer and report on your new application metrics.

High-level language projects that generate MMV mappings natively (Speed for Golang, and Parfait for Java) are also available from Github and Maven Central.


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Take advantage of your Red Hat Developers membership and download RHEL today at no cost.

JBoss EAP 7 Domain deployments – Part 2: Domain deployments through the EAP 7.0 Management Console

In this blog series we will present several ways to deploy an application on an EAP Domain. The series consists of 5 parts. Each one will be a standalone article, but the series as a whole will present a range of useful topics for working with JBoss EAP.

  • Part 1: Setup a simple EAP 7.0 Domain.
  • Part 2: Domain deployments through the new EAP 7.0 Management Console (this article)
  • Part 3:  Introduction to DMR (Dynamic Model Representation) and domain deployments from the Common Language Interface CLI.
  • Part 4: Domain deployment from the REST Management API.
  • Part 5: Manage EAP 6 Hosts from EAP 7.0 domain

In part 1 of this series on JBoss EAP 7 Domain deployments, we set up a simple EAP 7.0 domain with three hosts:

Review the domain Configuration

The domain controller host0, and two slaves hosts running several EAP 7.0 instances.

JBoss EAP Simple Domain

In the following tutorial we are going to see how to deploy an application on JBoss EAP domain using the new EAP 7.0 Management Console.

Continue reading “JBoss EAP 7 Domain deployments – Part 2: Domain deployments through the EAP 7.0 Management Console”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

The Benefits of Red Hat Enterprise Linux for Real Time

To deliver the best of predictability on real-time workloads, the Red Hat Enterprise Linux for Real Time provides state-of-art on determinism for the bullet-proof RHEL (Red Hat Enterprise Linux) platform. The availability of this product raises some questions, like: Do I need a real-time operating system? Or What are the benefits and drawbacks of running the RHEL for Real Time? This article aims to clarify how to leverage the success of your business using a real-time operating system, and what kind of workloads or type of industry can benefit from RT.

Introduction

Linux is the best choice for High-Performance Computing (HPC) due to the years of Linux kernel optimization focused on delivering high average throughput for a vast number of different workloads. Being optimized for throughput means that the algorithms for processing data are geared towards processing the most amount of data in the least amount of time. Examples of throughput-oriented operations are transferring megabytes/second over a network connection or the amount of data read from or written to a storage medium. These optimizations are the basis for the success of RHEL on servers and HPC environments.

Nevertheless, these optimizations for high throughput can cause drawbacks on other specific workloads. For example, the RHEL kernel uses a busy loop wait approach to avoid the scheduling overhead on some mutual exclusion methods, like spin locks and read/write locks. While busy waiting to enter in a critical section of code, the task waiting delays the scheduling of other potentially higher priority tasks in the same CPU. As a result, the higher priority task can not be scheduled and executed until the waiting task has completed, causing a delay on the high priority task’s response.

Although this delay is acceptable for the majority of common workloads, it is not acceptable for the class of tasks where correctness depends on meeting timing deadlines. This class of tasks, often classified as real-time tasks, has strict timing constraints where an answer must be delivered within a certain time period, and a late answer means it is wrong or fails.

For example, processing a 30 frames per second video requires the ability to deliver one frame every 33 milliseconds. If the system fails to deliver a frame every 33 milliseconds, the video processing will not be only late, but also wrong. It is natural to think then that real-time means delivering a quick response to an event and assume that real-time can be achieved only by making the system run faster. This assumption is a misconception, however. For instance, if the above-mentioned system can run fast enough to deliver a frame every 16.6 ms (60 frames per seconds), the video will be reproduced twice as fast. A faster response is not the expected behavior for processing this video, so the system will deliver not only early results but also wrong results. Hence, real-time systems are those that deliver a predictable timing behavior instead of just trying to deliver faster results.

To provide an enterprise environment for real-time workloads on Linux, Red Hat provides the RHEL for Real Time product. RHEL for Real Time is composed of the RHEL kernel optimized for determinism, along with a set of integrated tuning tools to provide the state-of-art of determinism on Linux. The deterministic timing behavior depends on both an application’s determinism using its own algorithms and on the Linux kernel determinism in managing the systems shared resources.

Continue reading “The Benefits of Red Hat Enterprise Linux for Real Time”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 

Take advantage of your Red Hat Developers membership and download RHEL today at no cost.

DevNation Live Blog: Analyzing Java applications using Thermostat

Omair Majid, a Red Hat Senior Software Engineer, addressed the primordial issue of performance on the Java Virtual Machine. Performance issues of OS, CPU, Memory, and IO origins plague modern systems and present a complex issue to developers so the Thermostat tool focuses on alleviating and easing serviceability while enhancing monitoring of the JVM.

After highlighting the problem many sources of performance information (CPU Usage, Memory Regions, GC, Classloading, JMX, JIT, IO calls, threading, etc…) and many ways to collect that information (heap dumps, invoking GC, deadlock detection, injecting custom code) Omair introduced the Thermostat project. Thermostat is a serviceability and monitoring tool for OpenJDK that integrates horizontally (across hosts/containers/nodes) and vertically (JVM/app server/middleware libraries/custom app code).

Thermostat can be run via the CLI or via a provided GUI and these two interfaces have feature parity so you will not have to choose one or the other because of missing features. Thermostat even has a rich plugin architecture to allow you to add your own features to the platform. Plugins provide a way to extend thermostat for monitoring domain specific metrics so instead of general information like object allocations, you can collect more relevant information like how many emails are being sent, or how long your transactions are taking. By using the custom plugin architecture in Thermostat you can extend its capabilities to fix your needs.

Thermostat can be run with different configurations, the two basic being local and shell. Just running thermostat local  auto detects JVMs running on your local machine and you get some nice host level metrics right away:

IMG_20160628_102622

 

It will also break down the memory sections for fine-tuned garbage collection monitoring and debugging:

IMG_20160628_103017

Omair then went on to live debug garbage collections issues, memory leaks, multi-threading blocks, slow response time and more all using Thermostat. After seeing how Thermostat can be used to easily monitor low-level operations and performance of your JVM I know I will be using it the next time the JVM give me trouble. Check out some of the latest features such as a Heap Tree Map, more profiling, GC details, tab completion, as well as future extensions such as Byteman integration and easing the development of custom plugins. Contributions welcome 🙂

Thermostat is available via Red Hat Software Collections (RHSCL).


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 


More about DevNation:  DevNation 2014 was our inaugural open source, polyglot conference for application developers and maintainers. View some of the DevNation 2015 session recordings here.  DevNation 2016 will be in San Francisco, USA, the week of June 26.  Be sure to follow its status and register at www.devnation.org.

DevNation 2016: William Cohen on “Optimizing code for modern processors”

DevNation sneak peek is a behind-the-scenes preview of sessions and information that will take place at DevNation 2016. Sign up for DevNation at www.devnation.org. Learn more. Code more. Share more. Join the Nation.

The difference between processor peak performance and actual delivered performance can be very large. As such, the goal of my DevNation 2016 presentation is to talk about the hardware mechanisms that modern processors use to provide that incredible peak performance such as caches, branch prediction, and superscalar execution. I’ll show show how developers might write code to better exploit those mechanisms, and how to identify problem areas in code.

The presentation isn’t going to instantly make you an expert in optimizing code.  However, it provides a good starting point for developers that want their code to run more quickly and efficiently on modern processors.

Performance is a topic that many people care about. I am happy that I am able to help people better understand performance issues by giving presentation and talking with people at DevNation and Red Hat Summit.

Additionally, last year Joe Mario organized a performance expert Bird-of-a-Feather session that allowed people to talk to many people with expertise in a number of different performance areas.  This year there is a similar session scheduled for Thursday evening that I am really looking forward to.  So many knowledgeable people in one place!

Tuesday
10:15 a.m. to 11:15 a.m.
Room 133
DSCF2568About the presenter: 

William Cohen has been a developer of performance tools at Red Hat for the past decade (and beyond) and has worked on a number of the performance tools in Red Hat Enterprise Linux and Fedora — tools such as OProfile, PAPI, SystemTap, and Dyninst.


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!

 


More about DevNation:  DevNation 2014 was our inaugural open source, polyglot conference for application developers and maintainers. View some of the DevNation 2015 session recordings here.  DevNation 2016 will be in San Francisco, USA, the week of June 26.  Be sure to follow its status and register at www.devnation.org.

How to avoid wasting megabytes of memory a few bytes at a time

Maybe you have so much memory in your computer that you never have to worry about it — then again, maybe you find that some C or C++ application is using more memory than expected. This could be preventing you from running as many containers on a single system as you expected, it could be causing performance bottlenecks, and it could even be forcing you to pay for more memory in your servers.

You do some quick “back of the envelope” calculations to estimate the amount of memory your application should be using, based on the size of each element in some key data structures, and the number of those data structures in each array. You think to yourself, “That doesn’t add up! Why is the application using so much more memory?” The reason it doesn’t add up is that you didn’t take into account the memory space being wasted in the layout of the data structures.

Continue reading “How to avoid wasting megabytes of memory a few bytes at a time”


Join Red Hat Developers, a developer program for you to learn, share, and code faster – and get access to Red Hat software for your development.  The developer program and software are both free!