Performance

Speed up SystemTap scripts with statistical aggregates

Speed up SystemTap scripts with statistical aggregates

A common question that SystemTap can be used to answer involves how often particular events occur on the system. Some events, such as system calls, can happen frequently and the goal is to make the SystemTap script as efficient as possible.

Using the statistical aggregate in the place of regular integers is one way to improve the performance of SystemTap scripts. The statistical aggregates record data on a per-CPU basis to reduce the amount of coordination required between processors, allowing information to be recorded with less overhead. In this article, I’ll show an example of how to reduce overhead in SystemTap scripts.

Continue reading “Speed up SystemTap scripts with statistical aggregates”

Share
Speed up SystemTap script monitoring of system calls

Speed up SystemTap script monitoring of system calls

SystemTap has extensive libraries called tapsets that allow developers to instrument various aspects of the kernel’s operation. SystemTap allows the use of wildcards to instrument multiple locations in particular subsystems.  SystemTap has to perform a significant amount of work to create instrumentation for each of the places being probed.  This overhead is particularly apparent when using the wildcards for the system call tapset that contains hundreds of entries (syscall.* and syscall.*.return). For some subsets of data collection, replacing the wildcard-matched syscall probes in SystemTap scripts with the kernel.trace("sys_enter")  and the kernel.trace("sys_exit") probe will produce smaller instrumentation modules that compile and start up more quickly. In this article, I’ll show a few examples of how this works.

Continue reading “Speed up SystemTap script monitoring of system calls”

Share
How data layout affects memory performance

How data layout affects memory performance

The mental model most people have of how computer memory (aka Random Access Memory or RAM) operates is inaccurate. The assumption that any access to any byte in memory has the same low cost does not hold on modern processors. In this article, I’ll explain what developers need to know about modern memory and how data layout can affect performance.

Current memory is starting to look more like an extremely fast block storage device. Rather than reading or writing individual bytes, the processor is reading or writing groups of bytes that fill a cache line (commonly 32 to 128 bytes in size). An access to memory requires well over a hundred clock cycles, two orders of magnitude slower than executing an instruction on the processor. Thus, programmers might reconsider the data structures used in their program if they are interested in obtaining better performance.

Continue reading “How data layout affects memory performance”

Share
Register Transfer Language for CRuby

Register Transfer Language for CRuby

For the last two years, I have been trying to improve CRuby performance. I have been working simultaneously on two major fronts: introducing register transfer language (RTL) for the CRuby virtual machine (VM) and just-in-time (JIT) compilation. For background on the goal of having Ruby 3 be 3 times faster than version 2 (3X3), see my previous article, “Towards the Ruby 3×3 Performance Goal“.

The JIT project (MJIT) is advancing successfully. The JIT approach and engine I proposed and implemented has been adopted by the CRuby community. Takashi Kokubun hardened the code and adapted it to the current CRuby stack machine and recently MJIT became an experimental feature of the CRuby 2.6 release.

Introducing a Register Transfer Language (RTL) to the CRuby VM turned out to be an even harder task than introducing the initial JIT compiler. The required changes to the VM are far more invasive than the ones needed for the JIT compiler.

This article describes the advantages and disadvantages of RTL for CRuby.

Continue reading “Register Transfer Language for CRuby”

Share
Algorithms != Programs and Programs are not “One size fits all”

Algorithms != Programs and Programs are not “One size fits all”

You’ve probably been taught that picking an algorithm that has the best Big-O asymptotic cost will yield the best performance. You might be surprised to find that on current hardware, this isn’t always the case. Much of algorithmic analysis assumes very simple costs where the order of operations doesn’t matter. Memory access times are assumed to be the same. However, the difference between a cache hit (a few processor clock cycles) and a cache miss that requires access to main memory (a couple hundred cycles) is immense.

This article series is the result of the authors (William Cohen and Ben Woodard) discussion that there is a disconnect on the typical ideas of algorithm efficiency taught in computer science and computer engineering versus what is currently encountered in actual computer systems.

Continue reading “Algorithms != Programs and Programs are not “One size fits all””

Share
Automating tests and metrics gathering for Kubernetes and OpenShift  (part 3)

Automating tests and metrics gathering for Kubernetes and OpenShift (part 3)

This is the third of a series of three articles based on a session I held at Red Hat Tech Exchange EMEA. In the first article, I presented the rationale and approach for leveraging Red Hat OpenShift or Kubernetes for automated performance testing, and I gave an overview of the setup. In the second article, we looked at building an observability stack. In this third part, we will see how the execution of the performance tests can be automated and related metrics gathered.

An example of what is described in this article is available in my GitHub repository.

Continue reading “Automating tests and metrics gathering for Kubernetes and OpenShift (part 3)”

Share
Speeding up Open vSwitch with partial hardware offloading

Speeding up Open vSwitch with partial hardware offloading

Open vSwitch (OVS) can use the kernel datapath or the userspace datapath. There are interesting developments in the kernel datapath using hardware offloading through the TC Flower packet classifier, but in this article, the focus will be on the userspace datapath accelerated with the Data Plane Development Kit (DPDK) and its new feature—partial flow hardware offloading—to accelerate the virtual switch even more.

This article explains how the virtual switch worked before versus now and why the new feature can potentially save resources while improving the packet processing rate.

Continue reading “Speeding up Open vSwitch with partial hardware offloading”

Share
Building an observability stack for automated performance tests on Kubernetes and OpenShift (part 2)

Building an observability stack for automated performance tests on Kubernetes and OpenShift (part 2)

This is the second of a series of three articles based on a session I held at Red Hat Tech Exchange in EMEA. In the first article, I presented the rationale and approach for leveraging Red Hat OpenShift or Kubernetes for automated performance testing, and I gave an overview of the setup.

In this article, we will look at building an observability stack. In production, the observability stack can help verify that the system is working correctly and performing well. It can also be leveraged during performance tests to provide insight into how the application performs under load.

An example of what is described in this article is available in my GitHub repository.

Continue reading “Building an observability stack for automated performance tests on Kubernetes and OpenShift (part 2)”

Share
Using a Kotlin-based gRPC API with Envoy proxy for server-side load balancing

Using a Kotlin-based gRPC API with Envoy proxy for server-side load balancing

These days, microservices-based architectures are being implemented almost everywhere. One business function could be using a few microservices that generate lots of network traffic in the form of messages being passed around. If we can make the way we pass messages more efficient by having a smaller message size, we could  the same infrastructure to handle higher loads.

Protobuf (short for “protocol buffers”) provides language- and platform-neutral mechanisms for serializing structured data for use in communications protocols, data storage, and more. gRPC is a modern, open source remote procedure call (RPC) framework that can run anywhere. Together, they provide an efficient message format that is automatically compressed and provides first-class support for complex data structures among other benefits (unlike JSON).

Microservices environments require lots of communication between services, and for this to happen, services need to agree on a few things. They need to agree on an API for exchanging data, for example, POST (or PUT) and GET to send and receive messages. And they need to agree on the format of the data (JSON). Clients calling the service also need to write lots of boilerplate code to make the remote calls (frameworks!). Protobuf and gRPC provide a way to define the schema of the message (JSON cannot) and generate skeleton code to consume a gRPC service (no frameworks required).

Continue reading “Using a Kotlin-based gRPC API with Envoy proxy for server-side load balancing”

Share
Using eXpress Data Path (XDP) maps in RHEL 8 Beta: Part 2

Using eXpress Data Path (XDP) maps in RHEL 8 Beta: Part 2

Diving into XDP

In the first part of this series on XDP, I introduced XDP and discussed the simplest possible example. Let’s now try to do something less trivial, exploring some more-advanced eBPF features—maps—and some common pitfalls.

XDP is available in Red Hat Enterprise Linux 8 Beta, which you can download and run now.

Continue reading “Using eXpress Data Path (XDP) maps in RHEL 8 Beta: Part 2”

Share