As a kernel quality engineer at Red Hat, I'm often asked what I do. How does one test a kernel for quality? Testing and debugging the Linux kernel can be challenging because you can't simply attach a debugger like GDB (the GNU Project debugger) if the kernel is crashing: When that happens, everything crashes, including GDB! Even if the kernel isn't crashing, if you ask a debugger to stop the kernel from running so you can inspect something, there is no way to resume running the kernel because the kernel itself is in charge of stopping and starting processes. Asking the kernel to stop itself is a dead-end road.
Note: There is a way to attach a debugger and stop and resume a kernel, but it requires a second computer and special hardware such as JTAG. Most of the time, I don't have access to that hardware, so that's out of the question.
Many people think that quality testing the Linux kernel must involve writing lots of kernel code; for example, a module to exercise a part of the kernel. Although that is a possibility, most testing actually happens from userspace. I spend a lot of time reading kernel code to try to devise a test that usually runs as a simple Bash script. There are many ways to interact with the kernel from userspace, including system calls, logs, the procfs
and sysfs
virtual file systems, and more. In this article, I will introduce you to some of the simple tools I've used for investigating behavior in the Linux kernel.
dmesg and journalctl
The most basic check on Linux kernel operations, particularly to follow what they're doing with devices and modules, is to retrieve the messages that the kernel generates during its boot and later. The messages can be retrieved from the kernel's logs through a dmesg or journalctl command.
For instance, a feature I tested recently involved enabling support for AMD Secure Memory Encryption ( (SME) on AMD's latest CPUs. To enable SME, you have to add mem_encrypt=on
as a kernel parameter and reboot. You can then check for the string "sme" in the dmesg
logs. See the /usr/share/doc/kernel-doc-4.18.0/Documentation/x86/amd-memory-encryption.txt
file from the kernel-doc
RPM in Red Hat Enterprise Linux (RHEL) 8 for more information.
Note: You may need to enable SME in your BIOS settings, too.
Enabling SME in dmesg logs
Let's take a look at the steps to enable SME in dmesg logs.
First, by default, because mem_encrypt
is unset, dmesg
shows no messages:
~]# dmesg | grep -i sme
~]#
Add the kernel parameter and reboot, and the message is logged:
~]# grubby --args="mem_encrypt=on" --update-kernel=ALL
~]# reboot
...
~]# dmesg | grep -i sme
[ 0.001000] AMD Memory Encryption Features active: SME
To double-check your work, you can remove the parameter and reboot once more to verify that the message does not show up in the logs:
~]# grubby --remove-args="mem_encrypt=on" --update-kernel=ALL
~]# reboot
...
~]# dmesg | grep -i sme
~]#
Further testing could be required to show that SME is working as expected, but the dmesg
logs give the first evidence that the kernel is properly detecting the hardware feature and enabling support when requested.
Logging with journalctl
I mentioned earlier that journalctl
can get the same output as dmesg
. This is because the journal recorded by systemd
includes the normal dmesg
logs along with userspace logs from /var/log/messages
and more, and each entry is annotated with metadata. To extract just the dmesg
equivalent from the journal, enter:
~]# journalctl --dmesg --output short-monotonic --no-hostname
-- Logs begin at Wed 2021-05-26 11:08:19 EDT, end at Wed 2021-05-26 17:01:01 EDT. --
[ 0.000000] kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[ 0.000000] kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81
[ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[ 0.000000] kernel: BIOS-provided physical RAM map:
[ 0.000000] kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
[ 0.000000] kernel: BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
...
...
Compare the previous journalctl
output to dmesg
output:
~]# dmesg
[ 0.000000] Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
...
...
The two commmands show nearly identical output. The differences are that journalctl
output starts with a -- Logs begin
line and shows the source of each message (the kernel:
prefix before each message).
sysfs and procfs
The /sys
directory (aka sysfs
) and /proc
directory (aka procfs
) are also important sources of kernel information, and can be used to control the kernel.
As an example, an important feature of some servers is the ability to hot plug CPUs. Before you swap out a CPU, though, you have to tell the kernel to disable that CPU, or the kernel might try to schedule work on a CPU that's in the middle of being swapped.
Let's look at the preparation to hot swap a CPU.
Disable a CPU for hot swapping
We start with an lscpu
command, which lists information about the CPUs, to find what CPUs we have with NUMA support. Then, we disable CPU 11 and check the CPU status again:
~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s): 96
On-line CPU(s) list: 0-95
~]# echo 0 > /sys/devices/system/cpu/cpu11/online
~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s): 96
On-line CPU(s) list: 0-10,12-95
Off-line CPU(s) list: 11
A good way to double-check that the kernel is not trying to use CPU 11 is to verify that no interrupt handlers are assigned to it. To do this check, view the /proc/interrupts
file (the output is truncated here for better visibility):
~]# less -SFiX /proc/interrupts
CPU0 .... CPU9 CPU10 CPU12 CPU13 C>
0: 169 .... 0 0 0 0 >
4: 0 .... 0 0 0 0 >
8: 0 .... 0 0 0 0 >
9: 0 .... 0 0 0 0 >
10: 0 .... 0 0 0 0 >
...
...
Note that the entire column for CPU11
is missing.
After hot-swapping CPU 11 for a new one, turn it back on:
~]# echo 1 > /sys/devices/system/cpu/cpu11/online
~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s): 96
On-line CPU(s) list: 0-95
~]# less -SFiX /proc/interrupts
CPU0 .... CPU9 CPU10 CPU11 CPU12 C>
0: 169 .... 0 0 0 0 >
4: 0 .... 0 0 0 0 >
8: 0 .... 0 0 0 0 >
9: 0 .... 0 0 0 0 >
10: 0 .... 0 0 0 0 >
...
...
CPU 11 is online and ready to process interrupts again.
Disable CPU threads and cores
Of course, modern CPUs have more than one core on each CPU, and each core can have two or more threads, so you'll need to disable all of the threads and cores on a CPU (or "socket") in order to hot swap it. The CPU topology on this particular system is:
~]# lscpu | grep -e 'Thread(s)' -e 'Core(s)' -e 'Socket(s)'
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
So, to hot swap CPU 11 (that is, one thread of one core), we'll need to disable all of its siblings. This information is also available in sysfs
:
~]# cat /sys/devices/system/cpu/cpu11/topology/core_siblings_list
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
Disable all 48 of these CPUs, and then the system is ready for the physical CPU hot swap.
crash
Several years ago, I added support for the RDRAND
assembly instruction to the kernel for Intel's Ivy Bridge line of CPUs. As I mentioned at the start of this article, it's difficult to attach a debugger to the kernel, so I can't just single-step through the assembly code and look for executions of RDRAND
.
I looked at the upstream commit that enabled RDRAND
support and noticed a macro named alternative_io
that I had not seen before:
alternative_io("movl $0, %0\n\t" \
nop, \
"\n1: " rdrand "\n\t" \
"jc 2f\n\t" \
"decl %0\n\t" \
"jnz 1b\n\t" \
"2:", \
X86_FEATURE_RDRAND, \
ASM_OUTPUT2("=r" (ok), "=a" (*v)), \
"0" (RDRAND_RETRY_LOOPS)); \
After a bit of research, I learned that the alternative
set of macros (including variants like the alternative_io
macro shown here) is a mechanism for the kernel to do live patching on itself at boot under certain conditions. As background, a default set of fairly generic assembly instructions is compatible with all x86 CPUs going back to at least the Pentium (and likely back to 80386). But for newer CPUs that support a given hardware feature—X86_FEATURE_RDRAND
, in this case—the kernel rewrites those default instructions with the alternative
code. It does the rewrite on the fly while the kernel is booting.
The alternative
macros are a very cool feature, much more efficient than using an if
statement. A simple approach to this problem, in the absence of alternative
macros, might look like the following (in pseudo-code):
if (CPU does not have feature foo) then
generic compatible slow code
else
fast new code
endif
The problem with this code is that the if
statement has to be evaluated every time this chunk of code is executed, and branch mispredictions are terrible for CPU performance. Modifying the instructions in-place avoids the branches.
Kernel on disk with RDRAND support
Let's use objdump
to look at the get_random_int
function, along with the alternate instructions section, altinstr_replacement
:
java {.bash .numberLines}
~]# objdump -d /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux
...
...
ffffffff81313780 <get_random_int>:
ffffffff81313780: 55 push %rbp
ffffffff81313781: 48 89 e5 mov %rsp,%rbp
ffffffff81313784: 48 83 ec 20 sub $0x20,%rsp
ffffffff81313788: 48 89 5d e8 mov %rbx,-0x18(%rbp)
ffffffff8131378c: 4c 89 65 f0 mov %r12,-0x10(%rbp)
ffffffff81313790: 4c 89 6d f8 mov %r13,-0x8(%rbp)
ffffffff81313794: e8 27 76 cf ff callq ffffffff8100adc0 <mcount>
ffffffff81313799: ba 0a 00 00 00 mov $0xa,%edx
ffffffff8131379e: ba 00 00 00 00 mov $0x0,%edx
ffffffff813137a3: 66 66 66 90 data32 data32 xchg %ax,%ax
ffffffff813137a7: 85 d2 test %edx,%edx
ffffffff813137a9: 75 50 jne ffffffff813137fb <get_random_int+0x7b>
ffffffff813137ab: 65 48 8b 1c 25 b0 e0 mov %gs:0xe0b0,%rbx
...
...
ffffffff81d3db6c <.altinstr_replacement>:
...
...
ffffffff81d407e7: 48 0f c7 f0 rdrand %rax
ffffffff81d407eb: 72 04 jb ffffffff81d407f1 <__alt_instructions_end+0x2c85>
ffffffff81d407ed: ff c9 dec %ecx
ffffffff81d407ef: 75 f6 jne ffffffff81d407e7 <__alt_instructions_end+0x2c7b>
...
...
In this objdump
output, the following two lines are the default instructions from the alternative_io
macro:
ffffffff8131379e: ba 00 00 00 00 mov $0x0,%edx
ffffffff813137a3: 66 66 66 90 data32 data32 xchg %ax,%ax
And the .altinstr_replacement
section contains the instructions to use in their place if the hardware supports RDRAND
:
ffffffff81d407e7: 48 0f c7 f0 rdrand %rax
ffffffff81d407eb: 72 04 jb ffffffff81d407f1 <__alt_instructions_end+0x2c85>
ffffffff81d407ed: ff c9 dec %ecx
ffffffff81d407ef: 75 f6 jne ffffffff81d407e7 <__alt_instructions_end+0x2c7b>
Kernel in live RAM with RDRAND support
Now we'll see the assembly language for the get_random_int
function in the running kernel. To be able to run the following command, install the crash
, gdb
, and matching kernel-debuginfo
RPMs:
java {.bash .numberLines}
~]# crash -s /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux
crash> disassemble get_random_int
Dump of assembler code for function get_random_int:
0xffffffff81313780 <+0>: push %rbp
0xffffffff81313781 <+1>: mov %rsp,%rbp
0xffffffff81313784 <+4>: sub $0x20,%rsp
0xffffffff81313788 <+8>: mov %rbx,-0x18(%rbp)
0xffffffff8131378c <+12>: mov %r12,-0x10(%rbp)
0xffffffff81313790 <+16>: mov %r13,-0x8(%rbp)
0xffffffff81313794 <+20>: nopl 0x0(%rax,%rax,1)
0xffffffff81313799 <+25>: mov $0xa,%edx
0xffffffff8131379e <+30>: rdrand %eax
0xffffffff813137a1 <+33>: jb 0xffffffff813137a7 <get_random_int+39>
0xffffffff813137a3 <+35>: dec %edx
0xffffffff813137a5 <+37>: jne 0xffffffff8131379e <get_random_int+30>
...
End of assembler dump.
The kernel recognized at boot that it had RDRAND
support, so it inserted the four lines in the .altinstr_replacement
section.
Note: The jc
, decl
, and jnz
instructions were decoded by crash
into the functionally equivalent jb
, dec
, and jne
instructions, but that's a topic for another day.
Conclusion
This article was a quick introduction to some of the tools I have used to quality test the Linux kernel in userspace.
Last updated: August 14, 2023