Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Practical micro-benchmarking with 'ltrace' and 'sched'

March 11, 2016
DJ Delorie
Related topics:
Linux
Related products:
Red Hat Enterprise Linux

    Recently I was asked to look into an issue related to QEMU's I/O subsystem performance - specifically, I was looking for differences in performance between glibc's malloc and other malloc implementations'. After a good deal of benchmarking I was unable to see a clear difference between our malloc (glibc) and others, not because the implementations were similar, but because there was too much noise in the system; the benchmarks were being polluted by other things, not just in QEMU, but elsewhere in the operating system. I really just wanted to see how much time malloc was using, but it was a small signal in a large noisy system.

    To get to the results I needed, I had to isolate malloc's activity so I could measure it more accurately. In this blog, I'll document how I extracted the malloc workload, and how to isolate that workload from other system activity so it can be measured more effectively.

    (Note, of course, that benchmarks are statistical things, and not all of the methods that worked for me may show statistically significant results for you. YMMV - "your metrics may vary" - but I am hopeful these techniques will prove useful in your own scenario.)

    Step One - ltrace

    The first step was to analyze the target workload to see what it was doing. If you're benchmarking your own function, you can add 'printf()' statements or other trace functionality to record every call, but in this case - benchmarking a system library - we don't have that option.

    (Ok, it's open source, we have that option, but it's not practical).

    The ltrace utility is designed to capture calls to system libraries, so it's a good fit for our situation. The syntax we'll use lets us connect to a running process, and store the output to a file. You'll need your process ID (pid) and the name of a file (filename) in which to place the ltrace output:

    ltrace -p <pid> -f -o <filename> -e malloc+free
    

    If we run QEMU in one window, and this command in another window, we can start ltrace right before starting the test case within QEMU, then CTRL-C'ing ltrace once the test case is done in order to capture only the relevant data. Your output should look something like this:

    15671 exe->malloc(144)                    = 0x7fa9f6e82eb0 <0.000356>
    15671 exe->malloc(128)                    = 0x7fa9f1513580 <0.000318>
    15662 exe->malloc(40)                     = 0x7fa9f2338f90 <0.000365>
    15662 exe->free(0x7fa9f79e0000)           = <void> <0.000366>
    15662 exe->free(0x7fa9f6c3d100)           = <void> <0.000363>
    

    The actual results are not relevant to this blog, but I observed that malloc was called many times with only a few different sizes requested, so if I benchmarked calls of those sizes I'd get a general idea of how our malloc compares to others. A simple sed/awk/grep/sort/uniq/pr script (ok, not so simple) gives us a histogram of malloc calls. Alternatively, you can just read the logs and interpret them yourself. ;)

    Step 2 - The Microbenchmark

    If you want to time something that takes minutes, you can use a stopwatch. If it only takes seconds, gettimeofday() is sufficient; however, a single call to malloc might only take microseconds or even nanoseconds.

    Despite the wide range of system performance tools available, we'll choose an old simple solution - ReaD Time Stamp Counter (RDTSC) - which tells us how many CPU clock cycles (or the equivalent) have passed. Since we're comparing malloc implementations, a relative time is sufficient and we don't need to know (or even worry about) converting clock cycles to microseconds. The RDTSC instruction is cpu-dependent, but it's available on the one I'm using (x86-64) and is typically used like this:

    static __inline__ int64_t rdtsc(void)
    {
      unsigned a, d;
      asm volatile("rdtsc" : "=a" (a), "=d" (d));
      return ((unsigned long)a) | (((unsigned long)d) << 32);
    }
    
    . . .
    
    int64_t clocks_before, clocks_after, clocks_per_malloc;
    
    clocks_before = rdtsc ();
    p = malloc (i); /* Test goes here */
    clocks_after = rdtsc ();
    clocks_per_malloc = clocks_after - clocks_before;
    

    Now we just run that code a few thousand times and we should get some nice clean results, and since there are only a small number of paths that the code can follow through malloc we should see only a few different times reported... or not?

    A graph of how long two malloc() implementations take for the same workload, sorted by duration, showing percentile of malloc calls and average time for that percentile.

    One drawback of the RDTSC instruction is that the CPU is allowed to reorder it relative to other instructions, which causes noise in our results. Fortunately, Intel has provided an RDTSCP instruction that's more deterministic. We'll pair that with a CPUID instruction which acts as a memory barrier, resulting in this:

    static __inline__ int64_t rdtsc_s(void)
    {
      unsigned a, d; 
      asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((unsigned long)a) | (((unsigned long)d) << 32); 
    }
    
    static __inline__ int64_t rdtsc_e(void)
    {
      unsigned a, d; 
      asm volatile("rdtscp" : "=a" (a), "=d" (d)); 
      asm volatile("cpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
      return ((unsigned long)a) | (((unsigned long)d) << 32); 
    }
    
    . . .
    
    clocks_before = rdtsc_s ();
    p = malloc (i); /* Test goes here */
    clocks_after = rdtsc_e ();
    clocks_per_malloc = clocks_after - clocks_before;
    

    Results after switching to cpuid/rdtsc barriers

    This is much better. Still, given that malloc is a relatively small function, you'd expect a discrete number of elapsed times, but the graph shows a much wider range of times (one for each possible code path - not counting the cases where the kernel is called and the graph goes off the chart on the upper right). Even among the "fast" calls there are smooth transitions. Clearly, other things are interfering with our measurement. There is still hope, however, because we can use the kernel's scheduling API to reduce interference.

    Since the test case itself is multi-threaded, the first thing we need to do is make sure each thread is assigned to a specific CPU to avoid our test interfering with itself. We can read /proc/cpuinfo to see how many CPUs exist, and use sched to assign each thread to its own CPU:

    cpu_set_t my_cpu;
    /* Skip CPU0 - let the OS run on that one */
    int my_cpu_num = (tid % (num_cpus-1))+1;
    
    CPU_ZERO (&my_cpu);
    CPU_SET (my_cpu_num, &my_cpu);
    if (sched_setaffinity (0, sizeof(my_cpu), &my_cpu) == -1)
      perror ("setaffinity failed");
    

    Next we need to make sure the kernel doesn't schedule anything else to the CPUs we're using, and there happens to be a kernel command line option to "isolate" CPUs from the system so that other, outside tasks won't get scheduled to them. We have 8 CPUs, and as the comment above notes, we let the OS use CPU #0. It's time to reboot and add this to the boot-loader kernel command line:

    linux . . . isolcpus=1,2,3,4,5,6,7
    

    Go ahead and boot using this additional configuration, and once the system is back up and running, you can confirm which CPUs are in use via the taskset command. You can check the PID of your shell, for example. Notice the "affinity mask" will be set when isolcpus is enabled:

    [without isolcpus]
    $ taskset -p $$
    pid 26993's current affinity mask: ff
    [with isolcpus]
    $ taskset -p $$
    pid 26993's current affinity mask: 01
    

    What else could be interrupting our tests? Interrupts! They're not tasks and are not governed by the task scheduler. Interrupt affinity is controlled separately, by /proc/irq/*/smp_affinity. If you look at those files, you'll see a mask similar to what taskset uses, and most of them are still ff despite using CPU isolation. You can assign interrupts to CPU0 like this (it will probably take several tries to get them all):

    cd /proc/irq
    for i in */smp_affinity; do echo 1 > $i; done
    

    You probably don't want to run your machine like this for long, as it's effectively a single-core machine; however, the results are now almost as clean as we can get them:

    Results after isolating the tests

    Note that a larger percentile of the calls show the shortest time (i.e. transitions have shifted right), and the transition to calls taking more cycles is sharper. This indicates that fewer of the fast calls are getting interrupted by scheduling or hardware issues.

    This is starting to look pretty good, but there's one more thing we can do to clean it up more. If you happen to have access to a machine with a lot of CPU cores and RAM, you'll discover that there's still room for improvement (could be tasks that can't be moved off the relevant CPUs, or hardware contention). An example of those results can be seen below:

    Results on a many-cpu machine

    (For the record, no, I can't afford a machine like that :-)

    Conclusion

    In summary, you can see a huge difference between the first and last graph, but that improvement is based on many steps, and there's still some room for improvement (an Intel paper uses a kernel module to isolate their tests even further, but that's beyond the scope of what I need to test). As we run more benchmarks and clean up the results, we learn more about what we're benchmarking. We learn what "noise" needs to be removed (like we've done above) and what "noise" is actually part of the result we're looking for.

    For now, this is good enough for my purposes. Look forward to a blog post about changes in malloc!


    1How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, Gabriele Paoloni, Linux Kernel/Embedded Software Engineer, Intel Corporation, September 2010

    Last updated: March 16, 2018

    Recent Posts

    • Every layer counts: Defense in depth for AI agents with Red Hat AI

    • Fun in the RUN instruction: Why container builds with distroless images can surprise you

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.