performance

Quarkus: Modernize “helloworld” JBoss EAP quickstart, Part 2

Quarkus: Modernize “helloworld” JBoss EAP quickstart, Part 2

In part one of this series, we took a detailed look at Red Hat JBoss Enterprise Application Platform (JBoss EAP) quickstarts helloworld quickstart as a starting point for understanding how to modernize a Java application using technologies (CDI and Servlet 3) supported in Quarkus. In this part, we’ll continue our discussion of modernization with a look at memory consumption.

Measuring performances is a fundamental topic when dealing with a modernization process, and memory consumption reporting is part of performance analysis. It’s worth starting with these tools from the very beginning so that they can be used to evaluate the improvements achieved during the modernization process.

Continue reading “Quarkus: Modernize “helloworld” JBoss EAP quickstart, Part 2″

Share
How to reduce Red Hat Fuse image size

How to reduce Red Hat Fuse image size

Red Hat Fuse is a leading integration platform, which is capable of solving any given problem with simple enterprise integration patterns (EIP).  Over time, Red Hat Fuse has evolved to cater to a wide range of infrastructure needs.

For more information on each of these, check out the Red Hat Fuse documentation. The Fuse on Red Hat OpenShift flavor uses a Fuse image that has runtime components packaged inside a Linux container image.  This article will discuss how to reduce the size of the Fuse image. The same principle can be used for other images.

Continue reading “How to reduce Red Hat Fuse image size”

Share
Customize the compilation process with Clang: Making compromises

Customize the compilation process with Clang: Making compromises

In this two-part series, we’re looking at the Clang compiler and various ways of customizing the compilation process. These articles are an expanded version of the presentation, called Merci le Compilo, which was given at CPPP in June.

In part one, we looked at specific options for customization. And, in this article, we’ll look at some examples of compromises and tradeoffs involved in different approaches.

Continue reading “Customize the compilation process with Clang: Making compromises”

Share
Customize the compilation process with Clang: Optimization options

Customize the compilation process with Clang: Optimization options

When using C++, developers generally aim to keep a high level of abstraction without sacrificing performance. That’s the famous motto “costless abstractions.” Yet the C++ language actually doesn’t give a lot of guarantees to developers in terms of performance. You can have the guarantee of copy-elision or compile-time evaluation, but key optimizations like inlining, unrolling, constant propagation or, dare I say, tail call elimination are subject to the goodwill of the standard’s best friend: the compiler.

This article focuses on the Clang compiler and the various flags it offers to customize the compilation process. I’ve tried to keep this from being a boring list, and it certainly is not an exhaustive one.

Continue reading “Customize the compilation process with Clang: Optimization options”

Share
Probing golang runtime using SystemTap

Probing golang runtime using SystemTap

I recently saw an article from Uber Engineering describing an issue they were having with an increase in latency. The Uber engineers suspected that their code was running out of stack space causing the golang runtime to issue a stack growth, which would introduce additional latency due to memory allocation and copying. The engineers ended up modifying the golang runtime with additional instrumentation to report these stack growths to confirm their suspicions. This situation is a perfect example of where SystemTap could have been used.

SystemTap is a tool that can be used to perform live analysis of a running program. It is able to interrupt normal control flow and execute code specified by a SystemTap script, which can allow users to temporarily modify a running program without having to change the source and recompile.

Continue reading “Probing golang runtime using SystemTap”

Share
How to use the Linux perf tool to count software events

How to use the Linux perf tool to count software events

The Linux perf tool was originally written to allow access to the performance monitoring hardware that counts hardware events, such as instructions executed, processor cycles, and cache misses. However, it can also be used to count software events, which can be useful in gauging how frequently some part of the system software is executed.

Recently someone at Red Hat asked whether there was a way to get a count of system calls being executed on the system. The kernel has a predefined software trace point, raw_syscalls:sys_enter, which collects that exact information; it counts each time a system call is made. To use the trace point events, the perf command needs to be run as root.

Continue reading “How to use the Linux perf tool to count software events”

Share
Speed up SystemTap scripts with statistical aggregates

Speed up SystemTap scripts with statistical aggregates

A common question that SystemTap can be used to answer involves how often particular events occur on the system. Some events, such as system calls, can happen frequently and the goal is to make the SystemTap script as efficient as possible.

Using the statistical aggregate in the place of regular integers is one way to improve the performance of SystemTap scripts. The statistical aggregates record data on a per-CPU basis to reduce the amount of coordination required between processors, allowing information to be recorded with less overhead. In this article, I’ll show an example of how to reduce overhead in SystemTap scripts.

Continue reading “Speed up SystemTap scripts with statistical aggregates”

Share
Speed up SystemTap script monitoring of system calls

Speed up SystemTap script monitoring of system calls

SystemTap has extensive libraries called tapsets that allow developers to instrument various aspects of the kernel’s operation. SystemTap allows the use of wildcards to instrument multiple locations in particular subsystems.  SystemTap has to perform a significant amount of work to create instrumentation for each of the places being probed.  This overhead is particularly apparent when using the wildcards for the system call tapset that contains hundreds of entries (syscall.* and syscall.*.return). For some subsets of data collection, replacing the wildcard-matched syscall probes in SystemTap scripts with the kernel.trace("sys_enter")  and the kernel.trace("sys_exit") probe will produce smaller instrumentation modules that compile and start up more quickly. In this article, I’ll show a few examples of how this works.

Continue reading “Speed up SystemTap script monitoring of system calls”

Share
How data layout affects memory performance

How data layout affects memory performance

The mental model most people have of how computer memory (aka Random Access Memory or RAM) operates is inaccurate. The assumption that any access to any byte in memory has the same low cost does not hold on modern processors. In this article, I’ll explain what developers need to know about modern memory and how data layout can affect performance.

Current memory is starting to look more like an extremely fast block storage device. Rather than reading or writing individual bytes, the processor is reading or writing groups of bytes that fill a cache line (commonly 32 to 128 bytes in size). An access to memory requires well over a hundred clock cycles, two orders of magnitude slower than executing an instruction on the processor. Thus, programmers might reconsider the data structures used in their program if they are interested in obtaining better performance.

Continue reading “How data layout affects memory performance”

Share