As a kernel quality engineer at Red Hat, I'm often asked what I do. How does one test a kernel for quality? Testing and debugging the Linux kernel can be challenging because you can't simply attach a debugger like GDB (the GNU Project debugger) if the kernel is crashing: When that happens, everything crashes, including GDB! Even if the kernel isn't crashing, if you ask a debugger to stop the kernel from running so you can inspect something, there is no way to resume running the kernel because the kernel itself is in charge of stopping and starting processes. Asking the kernel to stop itself is a dead-end road.
Note: There is a way to attach a debugger and stop and resume a kernel, but it requires a second computer and special hardware such as JTAG. Most of the time, I don't have access to that hardware, so that's out of the question.
Many people think that quality testing the Linux kernel must involve writing lots of
kernel code; for example, a module to exercise a part of the kernel. Although
that is a possibility, most testing actually happens from userspace. I spend a
lot of time reading kernel code to try to devise a test that usually runs
as a simple Bash script. There are many ways to interact with the kernel
from userspace, including system calls, logs, the
file systems, and more. In this article, I will introduce you to some of the simple tools I've used for investigating behavior in the Linux kernel.
dmesg and journalctl
The most basic check on Linux kernel operations, particularly to follow what they're doing with devices and modules, is to retrieve the messages that the kernel generates during its boot and later. The messages can be retrieved from the kernel's logs through a dmesg or journalctl command.
For instance, a feature I tested recently involved enabling support for AMD Secure Memory
Encryption ( (SME) on AMD's latest CPUs. To enable SME, you have to add
mem_encrypt=on as a kernel parameter and reboot. You can then check for the string "sme" in the
dmesg logs. See the
kernel-doc RPM in Red Hat Enterprise Linux (RHEL) 8 for more information.
Note: You may need to enable SME in your BIOS settings, too.
Enabling SME in dmesg logs
Let's take a look at the steps to enable SME in dmesg logs.
First, by default, because
mem_encrypt is unset,
dmesg shows no messages:
~]# dmesg | grep -i sme ~]#
Add the kernel parameter and reboot, and the message is logged:
~]# grubby --args="mem_encrypt=on" --update-kernel=ALL ~]# reboot ... ~]# dmesg | grep -i sme [ 0.001000] AMD Memory Encryption Features active: SME
To double-check your work, you can remove the parameter and reboot once more to verify that the message does not show up in the logs:
~]# grubby --remove-args="mem_encrypt=on" --update-kernel=ALL ~]# reboot ... ~]# dmesg | grep -i sme ~]#
Further testing could be required to show that SME is working as expected, but the
dmesg logs give the first evidence that the kernel is properly detecting the
hardware feature and enabling support when requested.
Logging with journalctl
I mentioned earlier that
journalctl can get the same output as
dmesg. This is because the journal recorded by
systemd includes the normal
dmesg logs along with userspace logs from
/var/log/messages and more, and each entry is
annotated with metadata. To extract just the
dmesg equivalent from
the journal, enter:
~]# journalctl --dmesg --output short-monotonic --no-hostname -- Logs begin at Wed 2021-05-26 11:08:19 EDT, end at Wed 2021-05-26 17:01:01 EDT. -- [ 0.000000] kernel: Linux version 4.18.0-305.el8.x86_64 (email@example.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021 [ 0.000000] kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81 [ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [ 0.000000] kernel: x86/fpu: xstate_offset: 576, xstate_sizes: 256 [ 0.000000] kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format. [ 0.000000] kernel: BIOS-provided physical RAM map: [ 0.000000] kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable [ 0.000000] kernel: BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS ... ...
Compare the previous
journalctl output to
~]# dmesg [ 0.000000] Linux version 4.18.0-305.el8.x86_64 (firstname.lastname@example.org) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021 [ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81 [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [ 0.000000] x86/fpu: xstate_offset: 576, xstate_sizes: 256 [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS ... ...
The two commmands show nearly identical output. The differences are that
journalctl output starts with a
-- Logs begin line and shows the source of each message (the
kernel: prefix before each message).
sysfs and procfs
/sys directory (aka
/proc directory (aka
are also important sources of kernel information, and can be used to control the kernel.
As an example, an important feature of some servers is the ability to hot plug CPUs. Before you swap out a CPU, though, you have to tell the kernel to disable that CPU, or the kernel might try to schedule work on a CPU that's in the middle of being swapped.
Let's look at the preparation to hot swap a CPU.
Disable a CPU for hot swapping
We start with an
lscpu command, which lists information about the CPUs, to find what CPUs we have with NUMA support. Then, we disable CPU 11 and check the CPU status again:
~]# lscpu | grep 'CPU(s)' | grep -v NUMA CPU(s): 96 On-line CPU(s) list: 0-95 ~]# echo 0 > /sys/devices/system/cpu/cpu11/online ~]# lscpu | grep 'CPU(s)' | grep -v NUMA CPU(s): 96 On-line CPU(s) list: 0-10,12-95 Off-line CPU(s) list: 11
A good way to double-check that the kernel is not trying to use CPU 11 is to verify that no interrupt handlers are assigned to it. To do this check, view the
/proc/interrupts file (the output is truncated here for better visibility):
~]# less -SFiX /proc/interrupts CPU0 .... CPU9 CPU10 CPU12 CPU13 C> 0: 169 .... 0 0 0 0 > 4: 0 .... 0 0 0 0 > 8: 0 .... 0 0 0 0 > 9: 0 .... 0 0 0 0 > 10: 0 .... 0 0 0 0 > ... ...
Note that the entire column for
CPU11 is missing.
After hot-swapping CPU 11 for a new one, turn it back on:
~]# echo 1 > /sys/devices/system/cpu/cpu11/online ~]# lscpu | grep 'CPU(s)' | grep -v NUMA CPU(s): 96 On-line CPU(s) list: 0-95 ~]# less -SFiX /proc/interrupts CPU0 .... CPU9 CPU10 CPU11 CPU12 C> 0: 169 .... 0 0 0 0 > 4: 0 .... 0 0 0 0 > 8: 0 .... 0 0 0 0 > 9: 0 .... 0 0 0 0 > 10: 0 .... 0 0 0 0 > ... ...
CPU 11 is online and ready to process interrupts again.
Disable CPU threads and cores
Of course, modern CPUs have more than one core on each CPU, and each core can have two or more threads, so you'll need to disable all of the threads and cores on a CPU (or "socket") in order to hot swap it. The CPU topology on this particular system is:
~]# lscpu | grep -e 'Thread(s)' -e 'Core(s)' -e 'Socket(s)' Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2
So, to hot swap CPU 11 (that is, one thread of one core), we'll need to
disable all of its siblings. This information is also available in
~]# cat /sys/devices/system/cpu/cpu11/topology/core_siblings_list 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
Disable all 48 of these CPUs, and then the system is ready for the physical CPU hot swap.
Several years ago, I added support for the
RDRAND assembly instruction to the kernel for Intel's Ivy Bridge line of CPUs. As I mentioned at the start of this article, it's difficult to attach a debugger to the kernel, so I can't just single-step through the assembly code and look for executions of
I looked at the upstream commit that enabled
RDRAND support and noticed a macro named
alternative_io that I had not seen before:
alternative_io("movl $0, %0\n\t" \ nop, \ "\n1: " rdrand "\n\t" \ "jc 2f\n\t" \ "decl %0\n\t" \ "jnz 1b\n\t" \ "2:", \ X86_FEATURE_RDRAND, \ ASM_OUTPUT2("=r" (ok), "=a" (*v)), \ "0" (RDRAND_RETRY_LOOPS)); \
After a bit of research, I learned that the
alternative set of macros (including variants like the
alternative_io macro shown here) is a mechanism for the kernel to do live patching on itself at boot under certain
conditions. As background, a default set of fairly generic assembly instructions is compatible with all x86 CPUs going back to at least the
Pentium (and likely back to 80386). But for newer CPUs that support a given
X86_FEATURE_RDRAND, in this case—the kernel
rewrites those default instructions with the
alternative code. It does the rewrite on the fly while the kernel is booting.
alternative macros are a very cool feature, much more efficient than using an
if statement. A simple approach to this problem, in the absence of
alternative macros, might look like the following (in pseudo-code):
if (CPU does not have feature foo) then generic compatible slow code else fast new code endif
The problem with this code is that the
if statement has to be evaluated every time
this chunk of code is executed, and branch mispredictions are terrible for CPU
performance. Modifying the instructions in-place avoids the branches.
Kernel on disk with RDRAND support
objdump to look at the
get_random_int function, along with the alternate instructions section,
~]# objdump -d /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux ... ... ffffffff81313780 <get_random_int>: ffffffff81313780: 55 push %rbp ffffffff81313781: 48 89 e5 mov %rsp,%rbp ffffffff81313784: 48 83 ec 20 sub $0x20,%rsp ffffffff81313788: 48 89 5d e8 mov %rbx,-0x18(%rbp) ffffffff8131378c: 4c 89 65 f0 mov %r12,-0x10(%rbp) ffffffff81313790: 4c 89 6d f8 mov %r13,-0x8(%rbp) ffffffff81313794: e8 27 76 cf ff callq ffffffff8100adc0 <mcount> ffffffff81313799: ba 0a 00 00 00 mov $0xa,%edx ffffffff8131379e: ba 00 00 00 00 mov $0x0,%edx ffffffff813137a3: 66 66 66 90 data32 data32 xchg %ax,%ax ffffffff813137a7: 85 d2 test %edx,%edx ffffffff813137a9: 75 50 jne ffffffff813137fb <get_random_int+0x7b> ffffffff813137ab: 65 48 8b 1c 25 b0 e0 mov %gs:0xe0b0,%rbx ... ... ffffffff81d3db6c <.altinstr_replacement>: ... ... ffffffff81d407e7: 48 0f c7 f0 rdrand %rax ffffffff81d407eb: 72 04 jb ffffffff81d407f1 <__alt_instructions_end+0x2c85> ffffffff81d407ed: ff c9 dec %ecx ffffffff81d407ef: 75 f6 jne ffffffff81d407e7 <__alt_instructions_end+0x2c7b> ... ...
objdump output, the following two lines are the default
instructions from the
ffffffff8131379e: ba 00 00 00 00 mov $0x0,%edx ffffffff813137a3: 66 66 66 90 data32 data32 xchg %ax,%ax
.altinstr_replacement section contains the instructions to use in their place
if the hardware supports
ffffffff81d407e7: 48 0f c7 f0 rdrand %rax ffffffff81d407eb: 72 04 jb ffffffff81d407f1 <__alt_instructions_end+0x2c85> ffffffff81d407ed: ff c9 dec %ecx ffffffff81d407ef: 75 f6 jne ffffffff81d407e7 <__alt_instructions_end+0x2c7b>
Kernel in live RAM with RDRAND support
Now we'll see the assembly language for the
get_random_int function in the running kernel. To be able to run the following command, install the
gdb, and matching
~]# crash -s /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux crash> disassemble get_random_int Dump of assembler code for function get_random_int: 0xffffffff81313780 <+0>: push %rbp 0xffffffff81313781 <+1>: mov %rsp,%rbp 0xffffffff81313784 <+4>: sub $0x20,%rsp 0xffffffff81313788 <+8>: mov %rbx,-0x18(%rbp) 0xffffffff8131378c <+12>: mov %r12,-0x10(%rbp) 0xffffffff81313790 <+16>: mov %r13,-0x8(%rbp) 0xffffffff81313794 <+20>: nopl 0x0(%rax,%rax,1) 0xffffffff81313799 <+25>: mov $0xa,%edx 0xffffffff8131379e <+30>: rdrand %eax 0xffffffff813137a1 <+33>: jb 0xffffffff813137a7 <get_random_int+39> 0xffffffff813137a3 <+35>: dec %edx 0xffffffff813137a5 <+37>: jne 0xffffffff8131379e <get_random_int+30> ... End of assembler dump.
The kernel recognized at boot that it had
RDRAND support, so it inserted the four lines in the
jnz instructions were decoded by
crash into the
jne instructions, but that's a topic
for another day.
This article was a quick introduction to some of the tools I have used to quality test the Linux kernel in userspace.