Performance

Assembly Line for Computations

The simple programmer’s model of a processor executing machine language instructions is a loop of the following steps with each step finished before moving on the the next step:

  1. Fetch instruction
  2. Decode instruction and fetch register operands
  3. Execute arithmetic computation
  4. Possible memory access (read or write)
  5. Writeback results to register

As mentioned in the introduction blog article even if the processor can get each step down to a single cycle that would would be 2.5ns (5*0.5ns) for a 2GHz (2×10^9 cycles per second) processor, only 400 million instructions per second. If the instructions can be processed in assembly-line fashion and the steps of instructions be overlapped, the performance could be significantly improved. Rather than completing one instruction every 2.5ns, the processor could potentially complete an instruction every clock cycle, a five-fold improvement in speed. This technique is known as pipelining.

Continue reading “Assembly Line for Computations”

Share

Programmer’s Model of a Processor Executing Instructions Versus Reality

Everything on a computer system eventually ends up being run as a sequence of machine instructions. People want to keep things simple and understandable even if that is not really the way that things work. The simple programmer’s model of a Reduced Instruction Set Computer (RISC) processor executing those machine language instruction is a loop of the following steps each step finished before moving on the the next step:

Continue reading Programmer’s Model of a Processor Executing Instructions Versus Reality

Share
Controlling resources with cgroups for performance testing

Controlling resources with cgroups for performance testing

Introduction

Today I want to write about the options available to limit resources in use for running performance tests in a shared environment. A very powerful tool for this is cgroups [1] – a Linux kernel feature that allows limiting the resource usage (CPU, memory, disk I/O, etc..) of a collection of processes.
Nowadays it is easy with virtual machines or container technologies, like Docker, which is using cgroups by the way, to compartmentilize applications and make sure that they are not eating resources that have not been allocated to them. But you may face, as I did, situations where these technologies are not available and if your application happens to run on a Linux distribution like RHEL 6 and later you can directly rely on cgroups for this purpose. 

Continue reading “Controlling resources with cgroups for performance testing”

Share
Dirty Tricks: Launching a helper process under memory and latency constraints (pthread_create and vfork)

Dirty Tricks: Launching a helper process under memory and latency constraints (pthread_create and vfork)

You need to launch a helper process, and while Linux’s fork is copy-on-write (COW), the page tables still need to be duplicated, and for a large virtual address space that could result in running out of memory and performance degradation. There are a wide array of solutions available to use, but one of them, namely vfork is mostly avoided due to a few difficult issues. First is that vfork pauses the parent thread while the child executes and eventually calls an exec family function, this is a huge latency problem for applications. Secondly is that there are a great many number of considerations to take into account when using vfork in a threaded application, and missing any one of those considerations can lead to serious problems.

It should be possible for posix_spawn to safely do all of this work via POSIX_SPAWN_USEVFORK, but often there is quite a lot of “work” that needs to be done just before the helper calls an exec family function, and that has lead to ever increasingly complex versions of posix_spawn like posix_spawn_file_actions_addclose, posix_spawn_file_actions_adddup2, posix_spawn_file_actions_destroy, posix_spawnattr_destroy, posix_spawnattr_getsigdefault, posix_spawnattr_getflags, posix_spawnattr_getpgroup, posix_spawnattr_getschedparam, posix_spawnattr_getschedpolicy, and posix_spawnattr_getsigmask. It might be simpler if the GNU C Library documented a small subset of functions you can safely call, which is in fact what the preceding functions are modelling. If you happen to select a set of operations that can’t be supported by posix_spawn with vfork then the implementation falls back to fork and you don’t know why. Therefore it is hard to use posix_spawn robustly.

Continue reading “Dirty Tricks: Launching a helper process under memory and latency constraints (pthread_create and vfork)”

Share

Tuned: the tuning profile delivery mechanism for RHEL

What is “Tune-D” ?

Tuned is a tuning profile delivery mechanism included in Red Hat Enterprise Linux.  As demonstrated by D. John Shakshober (aka Shak) at Red Hat Summit, tuned improves performance for most workloads by quite a bit.  What’s a tuning profile, you ask?  Using the throughput-performance profile (enabled by default in RHEL7) as an example:

tuned-throughput-performance

 

These settings tune RHEL for the datacenter, whether public cloud, or private.  You can easily create your own profiles, too!

Red Hat delivers tuned profiles for most of our product portfolio:

Continue reading “Tuned: the tuning profile delivery mechanism for RHEL”

Share
How to load test and tune performance on your API

How to load test and tune performance on your API

The role of APIs has evolved a lot over the past few years. Not long ago, web APIs were mainly used as simple integration points between internal systems. That is no longer true. Nowadays, APIs often are the core system of a company, one on top of which several client – web and mobile – applications are built.

When APIs were only used for back-office tasks such as extracting reports, their performance was never a key factor. However, APIs have slowly moved towards the critical path between an end-user and the service a company offers. This increase in criticality entails a direct consequence: performance of APIs really matters now.

Continue reading “How to load test and tune performance on your API”

Share
Introducing the "rhel-tools" for RHEL Atomic Host

Introducing the "rhel-tools" for RHEL Atomic Host

RH_Icon_Container_with_App_FlatThe rise of the purpose-built Linux distribution

Recently, several purpose-built distributions have been created specifically to run Linux containers.  There seem to be more popping up every day.  For our part, in April 2014 at the Red Hat Summit, Red Hat announced its intention to deliver a purpose-built, container-optimized version of Red Hat Enterprise Linux 7 called RHEL Atomic Host.  After over a year in the making, we are excited that launch day has finally come!

What’s important to know about Red Hat Enterprise Linux Atomic Host, you ask?  Well, plenty…but for the sake of this blog, I’ll stick to areas I know as a performance engineer:

  • RHEL Atomic leverages years of engineering effort that went into RHEL7.
  • It uses the same exact kernel as RHEL7.
  • Significantly reduced on-disk and in-memory footprint.
  • Utilizes OSTree technology for upgrades and rollbacks.
  • Optimized device-mapper container storage performance out of the box.
  • Optimized container scalability out of the box.
  • Includes purpose-built rhel-tools container for system administration tasks

Continue reading “Introducing the "rhel-tools" for RHEL Atomic Host”

Share
Low Latency Performance Tuning for Red Hat Enterprise Linux 7

Low Latency Performance Tuning for Red Hat Enterprise Linux 7

velocimetroCounting micro-nanoseconds?  We are, because we know our customers are.  Some of the world’s largest stock exchanges including the Chicago Mercantile Exchange (CME), New York Stock Exchange (NYSE), E*TRADE, Union Bank, countless hedge funds and high-frequency trading shops run on Red Hat’s products.  In fact, the majority of the world’s financial transactions are executed with Red Hat Enterprise Linux in the critical path.

Continue reading “Low Latency Performance Tuning for Red Hat Enterprise Linux 7”

Share