Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Practical micro-benchmarking with 'ltrace' and 'sched'

March 11, 2016
DJ Delorie
Related topics:
Linux
Related products:
Red Hat Enterprise Linux

Share:

    Recently I was asked to look into an issue related to QEMU's I/O subsystem performance - specifically, I was looking for differences in performance between glibc's malloc and other malloc implementations'. After a good deal of benchmarking I was unable to see a clear difference between our malloc (glibc) and others, not because the implementations were similar, but because there was too much noise in the system; the benchmarks were being polluted by other things, not just in QEMU, but elsewhere in the operating system. I really just wanted to see how much time malloc was using, but it was a small signal in a large noisy system.

    To get to the results I needed, I had to isolate malloc's activity so I could measure it more accurately. In this blog, I'll document how I extracted the malloc workload, and how to isolate that workload from other system activity so it can be measured more effectively.

    (Note, of course, that benchmarks are statistical things, and not all of the methods that worked for me may show statistically significant results for you. YMMV - "your metrics may vary" - but I am hopeful these techniques will prove useful in your own scenario.)

    Step One - ltrace

    The first step was to analyze the target workload to see what it was doing. If you're benchmarking your own function, you can add 'printf()' statements or other trace functionality to record every call, but in this case - benchmarking a system library - we don't have that option.

    (Ok, it's open source, we have that option, but it's not practical).

    The ltrace utility is designed to capture calls to system libraries, so it's a good fit for our situation. The syntax we'll use lets us connect to a running process, and store the output to a file. You'll need your process ID (pid) and the name of a file (filename) in which to place the ltrace output:

    ltrace -p <pid> -f -o <filename> -e malloc+free
    

    If we run QEMU in one window, and this command in another window, we can start ltrace right before starting the test case within QEMU, then CTRL-C'ing ltrace once the test case is done in order to capture only the relevant data. Your output should look something like this:

    15671 exe->malloc(144)                    = 0x7fa9f6e82eb0 <0.000356>
    15671 exe->malloc(128)                    = 0x7fa9f1513580 <0.000318>
    15662 exe->malloc(40)                     = 0x7fa9f2338f90 <0.000365>
    15662 exe->free(0x7fa9f79e0000)           = <void> <0.000366>
    15662 exe->free(0x7fa9f6c3d100)           = <void> <0.000363>
    

    The actual results are not relevant to this blog, but I observed that malloc was called many times with only a few different sizes requested, so if I benchmarked calls of those sizes I'd get a general idea of how our malloc compares to others. A simple sed/awk/grep/sort/uniq/pr script (ok, not so simple) gives us a histogram of malloc calls. Alternatively, you can just read the logs and interpret them yourself. ;)

    Step 2 - The Microbenchmark

    If you want to time something that takes minutes, you can use a stopwatch. If it only takes seconds, gettimeofday() is sufficient; however, a single call to malloc might only take microseconds or even nanoseconds.

    Despite the wide range of system performance tools available, we'll choose an old simple solution - ReaD Time Stamp Counter (RDTSC) - which tells us how many CPU clock cycles (or the equivalent) have passed. Since we're comparing malloc implementations, a relative time is sufficient and we don't need to know (or even worry about) converting clock cycles to microseconds. The RDTSC instruction is cpu-dependent, but it's available on the one I'm using (x86-64) and is typically used like this:

    static __inline__ int64_t rdtsc(void)
    {
      unsigned a, d;
      asm volatile("rdtsc" : "=a" (a), "=d" (d));
      return ((unsigned long)a) | (((unsigned long)d) << 32);
    }
    
    . . .
    
    int64_t clocks_before, clocks_after, clocks_per_malloc;
    
    clocks_before = rdtsc ();
    p = malloc (i); /* Test goes here */
    clocks_after = rdtsc ();
    clocks_per_malloc = clocks_after - clocks_before;
    

    Now we just run that code a few thousand times and we should get some nice clean results, and since there are only a small number of paths that the code can follow through malloc we should see only a few different times reported... or not?

    A graph of how long two malloc() implementations take for the same workload, sorted by duration, showing percentile of malloc calls and average time for that percentile.

    One drawback of the RDTSC instruction is that the CPU is allowed to reorder it relative to other instructions, which causes noise in our results. Fortunately, Intel has provided an RDTSCP instruction that's more deterministic. We'll pair that with a CPUID instruction which acts as a memory barrier, resulting in this:

    static __inline__ int64_t rdtsc_s(void)
    {
      unsigned a, d; 
      asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((unsigned long)a) | (((unsigned long)d) << 32); 
    }
    
    static __inline__ int64_t rdtsc_e(void)
    {
      unsigned a, d; 
      asm volatile("rdtscp" : "=a" (a), "=d" (d)); 
      asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
      return ((unsigned long)a) | (((unsigned long)d) << 32); 
    }
    
    . . .
    
    clocks_before = rdtsc_s ();
    p = malloc (i); /* Test goes here */
    clocks_after = rdtsc_e ();
    clocks_per_malloc = clocks_after - clocks_before;
    

    Results after switching to cpuid/rdtsc barriers

    This is much better. Still, given that malloc is a relatively small function, you'd expect a discrete number of elapsed times, but the graph shows a much wider range of times (one for each possible code path - not counting the cases where the kernel is called and the graph goes off the chart on the upper right). Even among the "fast" calls there are smooth transitions. Clearly, other things are interfering with our measurement. There is still hope, however, because we can use the kernel's scheduling API to reduce interference.

    Since the test case itself is multi-threaded, the first thing we need to do is make sure each thread is assigned to a specific CPU to avoid our test interfering with itself. We can read /proc/cpuinfo to see how many CPUs exist, and use sched to assign each thread to its own CPU:

    cpu_set_t my_cpu;
    /* Skip CPU0 - let the OS run on that one */
    int my_cpu_num = (tid % (num_cpus-1))+1;
    
    CPU_ZERO (&my_cpu);
    CPU_SET (my_cpu_num, &my_cpu);
    if (sched_setaffinity (0, sizeof(my_cpu), &my_cpu) == -1)
      perror ("setaffinity failed");
    

    Next we need to make sure the kernel doesn't schedule anything else to the CPUs we're using, and there happens to be a kernel command line option to "isolate" CPUs from the system so that other, outside tasks won't get scheduled to them. We have 8 CPUs, and as the comment above notes, we let the OS use CPU #0. It's time to reboot and add this to the boot-loader kernel command line:

    linux . . . isolcpus=1,2,3,4,5,6,7
    

    Go ahead and boot using this additional configuration, and once the system is back up and running, you can confirm which CPUs are in use via the taskset command. You can check the PID of your shell, for example. Notice the "affinity mask" will be set when isolcpus is enabled:

    [without isolcpus]
    $ taskset -p $$
    pid 26993's current affinity mask: ff
    [with isolcpus]
    $ taskset -p $$
    pid 26993's current affinity mask: 01
    

    What else could be interrupting our tests? Interrupts! They're not tasks and are not governed by the task scheduler. Interrupt affinity is controlled separately, by /proc/irq/*/smp_affinity. If you look at those files, you'll see a mask similar to what taskset uses, and most of them are still ff despite using CPU isolation. You can assign interrupts to CPU0 like this (it will probably take several tries to get them all):

    cd /proc/irq
    for i in */smp_affinity; do echo 1 > $i; done
    

    You probably don't want to run your machine like this for long, as it's effectively a single-core machine; however, the results are now almost as clean as we can get them:

    Results after isolating the tests

    Note that a larger percentile of the calls show the shortest time (i.e. transitions have shifted right), and the transition to calls taking more cycles is sharper. This indicates that fewer of the fast calls are getting interrupted by scheduling or hardware issues.

    This is starting to look pretty good, but there's one more thing we can do to clean it up more. If you happen to have access to a machine with a lot of CPU cores and RAM, you'll discover that there's still room for improvement (could be tasks that can't be moved off the relevant CPUs, or hardware contention). An example of those results can be seen below:

    Results on a many-cpu machine

    (For the record, no, I can't afford a machine like that :-)

    Conclusion

    In summary, you can see a huge difference between the first and last graph, but that improvement is based on many steps, and there's still some room for improvement (an Intel paper uses a kernel module to isolate their tests even further, but that's beyond the scope of what I need to test). As we run more benchmarks and clean up the results, we learn more about what we're benchmarking. We learn what "noise" needs to be removed (like we've done above) and what "noise" is actually part of the result we're looking for.

    For now, this is good enough for my purposes. Look forward to a blog post about changes in malloc!


    1How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, Gabriele Paoloni, Linux Kernel/Embedded Software Engineer, Intel Corporation, September 2010

    Last updated: March 16, 2018

    Recent Posts

    • AI meets containers: My first step into Podman AI Lab

    • Live migrating VMs with OpenShift Virtualization

    • Storage considerations for OpenShift Virtualization

    • Upgrade from OpenShift Service Mesh 2.6 to 3.0 with Kiali

    • EE Builder with Ansible Automation Platform on OpenShift

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue