Featured image for a Linux topic.

As a kernel quality engineer at Red Hat, I'm often asked what I do. How does one test a kernel for quality? Testing and debugging the Linux kernel can be challenging because you can't simply attach a debugger like GDB (the GNU Project debugger) if the kernel is crashing: When that happens, everything crashes, including GDB! Even if the kernel isn't crashing, if you ask a debugger to stop the kernel from running so you can inspect something, there is no way to resume running the kernel because the kernel itself is in charge of stopping and starting processes. Asking the kernel to stop itself is a dead-end road.

Note: There is a way to attach a debugger and stop and resume a kernel, but it requires a second computer and special hardware such as JTAG. Most of the time, I don't have access to that hardware, so that's out of the question.

Many people think that quality testing the Linux kernel must involve writing lots of kernel code; for example, a module to exercise a part of the kernel. Although that is a possibility, most testing actually happens from userspace. I spend a lot of time reading kernel code to try to devise a test that usually runs as a simple Bash script. There are many ways to interact with the kernel from userspace, including system calls, logs, the procfs and sysfs virtual file systems, and more. In this article, I will introduce you to some of the simple tools I've used for investigating behavior in the Linux kernel.

dmesg and journalctl

The most basic check on Linux kernel operations, particularly to follow what they're doing with devices and modules, is to retrieve the messages that the kernel generates during its boot and later. The messages can be retrieved from the kernel's logs through a dmesg or journalctl command.

For instance, a feature I tested recently involved enabling support for AMD Secure Memory Encryption ( (SME) on AMD's latest CPUs. To enable SME, you have to add mem_encrypt=on as a kernel parameter and reboot. You can then check for the string "sme" in the dmesg logs. See the /usr/share/doc/kernel-doc-4.18.0/Documentation/x86/amd-memory-encryption.txt file from the kernel-doc RPM in Red Hat Enterprise Linux (RHEL) 8 for more information.

Note: You may need to enable SME in your BIOS settings, too.

Enabling SME in dmesg logs

Let's take a look at the steps to enable SME in dmesg logs.

First, by default, because mem_encrypt is unset, dmesg shows no messages:

~]# dmesg | grep -i sme
~]#

Add the kernel parameter and reboot, and the message is logged:

~]# grubby --args="mem_encrypt=on" --update-kernel=ALL
~]# reboot
...
~]# dmesg | grep -i sme
[    0.001000] AMD Memory Encryption Features active: SME

To double-check your work, you can remove the parameter and reboot once more to verify that the message does not show up in the logs:

~]# grubby --remove-args="mem_encrypt=on" --update-kernel=ALL
~]# reboot
...
~]# dmesg | grep -i sme
~]#

Further testing could be required to show that SME is working as expected, but the dmesg logs give the first evidence that the kernel is properly detecting the hardware feature and enabling support when requested.

Logging with journalctl

I mentioned earlier that journalctl can get the same output as dmesg. This is because the journal recorded by systemd includes the normal dmesg logs along with userspace logs from /var/log/messages and more, and each entry is annotated with metadata. To extract just the dmesg equivalent from the journal, enter:

~]# journalctl --dmesg --output short-monotonic --no-hostname
-- Logs begin at Wed 2021-05-26 11:08:19 EDT, end at Wed 2021-05-26 17:01:01 EDT. --
[    0.000000] kernel: Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[    0.000000] kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81
[    0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] kernel: BIOS-provided physical RAM map:
[    0.000000] kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
[    0.000000] kernel: BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
...
...

Compare the previous journalctl output to dmesg output:

~]# dmesg
[    0.000000] Linux version 4.18.0-305.el8.x86_64 (mockbuild@x86-vm-07.build.eng.bos.redhat.com) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Thu Apr 29 08:54:30 EDT 2021
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-305.el8.x86_64 root=/dev/mapper/rhel_dell--per7425--02-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per7425--02-swap rd.lvm.lv=rhel_dell-per7425-02/root rd.lvm.lv=rhel_dell-per7425-02/swap console=ttyS0,115200n81
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
...
...

The two commmands show nearly identical output. The differences are that journalctl output starts with a -- Logs begin line and shows the source of each message (the kernel: prefix before each message).

sysfs and procfs

The /sys directory (aka sysfs) and /proc directory (aka procfs) are also important sources of kernel information, and can be used to control the kernel.

As an example, an important feature of some servers is the ability to hot plug CPUs. Before you swap out a CPU, though, you have to tell the kernel to disable that CPU, or the kernel might try to schedule work on a CPU that's in the middle of being swapped.

Let's look at the preparation to hot swap a CPU.

Disable a CPU for hot swapping

We start with an lscpu command, which lists information about the CPUs, to find what CPUs we have with NUMA support. Then, we disable CPU 11 and check the CPU status again:

~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s):              96
On-line CPU(s) list: 0-95
~]# echo 0 > /sys/devices/system/cpu/cpu11/online
~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s):               96
On-line CPU(s) list:  0-10,12-95
Off-line CPU(s) list: 11

A good way to double-check that the kernel is not trying to use CPU 11 is to verify that no interrupt handlers are assigned to it. To do this check, view the /proc/interrupts file (the output is truncated here for better visibility):

~]# less -SFiX /proc/interrupts
            CPU0       ....       CPU9       CPU10      CPU12      CPU13      C>
   0:        169       ....          0          0          0          0        >
   4:          0       ....          0          0          0          0        >
   8:          0       ....          0          0          0          0        >
   9:          0       ....          0          0          0          0        >
  10:          0       ....          0          0          0          0        >
...
...

Note that the entire column for CPU11 is missing.

After hot-swapping CPU 11 for a new one, turn it back on:

~]# echo 1 > /sys/devices/system/cpu/cpu11/online
~]# lscpu | grep 'CPU(s)' | grep -v NUMA
CPU(s):              96
On-line CPU(s) list: 0-95
~]# less -SFiX /proc/interrupts
            CPU0       ....       CPU9       CPU10      CPU11      CPU12      C>
   0:        169       ....          0          0          0          0        >
   4:          0       ....          0          0          0          0        >
   8:          0       ....          0          0          0          0        >
   9:          0       ....          0          0          0          0        >
  10:          0       ....          0          0          0          0        >
...
...

CPU 11 is online and ready to process interrupts again.

Disable CPU threads and cores

Of course, modern CPUs have more than one core on each CPU, and each core can have two or more threads, so you'll need to disable all of the threads and cores on a CPU (or "socket") in order to hot swap it. The CPU topology on this particular system is:

~]# lscpu | grep -e 'Thread(s)' -e 'Core(s)' -e 'Socket(s)'
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2

So, to hot swap CPU 11 (that is, one thread of one core), we'll need to disable all of its siblings. This information is also available in sysfs:

~]# cat /sys/devices/system/cpu/cpu11/topology/core_siblings_list
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95

Disable all 48 of these CPUs, and then the system is ready for the physical CPU hot swap.

crash

Several years ago, I added support for the RDRAND assembly instruction to the kernel for Intel's Ivy Bridge line of CPUs. As I mentioned at the start of this article, it's difficult to attach a debugger to the kernel, so I can't just single-step through the assembly code and look for executions of RDRAND.

I looked at the upstream commit that enabled RDRAND support and noticed a macro named alternative_io that I had not seen before:

        alternative_io("movl $0, %0\n\t"                        \
                       nop,                                     \
                       "\n1: " rdrand "\n\t"                    \
                       "jc 2f\n\t"                              \
                       "decl %0\n\t"                            \
                       "jnz 1b\n\t"                             \
                       "2:",                                    \
                       X86_FEATURE_RDRAND,                      \
                       ASM_OUTPUT2("=r" (ok), "=a" (*v)),       \
                       "0" (RDRAND_RETRY_LOOPS));               \

After a bit of research, I learned that the alternative set of macros (including variants like the alternative_io macro shown here) is a mechanism for the kernel to do live patching on itself at boot under certain conditions. As background, a default set of fairly generic assembly instructions is compatible with all x86 CPUs going back to at least the Pentium (and likely back to 80386). But for newer CPUs that support a given hardware feature—X86_FEATURE_RDRAND, in this case—the kernel rewrites those default instructions with the alternative code. It does the rewrite on the fly while the kernel is booting.

The alternative macros are a very cool feature, much more efficient than using an if statement. A simple approach to this problem, in the absence of alternative macros, might look like the following (in pseudo-code):

    if (CPU does not have feature foo) then
       generic compatible slow code
    else
       fast new code
    endif

The problem with this code is that the if statement has to be evaluated every time this chunk of code is executed, and branch mispredictions are terrible for CPU performance. Modifying the instructions in-place avoids the branches.

Kernel on disk with RDRAND support

Let's use objdump to look at the get_random_int function, along with the alternate instructions section, altinstr_replacement:

~]# objdump -d /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux
...
...
ffffffff81313780 <get_random_int>:
ffffffff81313780:       55                      push   %rbp
ffffffff81313781:       48 89 e5                mov    %rsp,%rbp
ffffffff81313784:       48 83 ec 20             sub    $0x20,%rsp
ffffffff81313788:       48 89 5d e8             mov    %rbx,-0x18(%rbp)
ffffffff8131378c:       4c 89 65 f0             mov    %r12,-0x10(%rbp)
ffffffff81313790:       4c 89 6d f8             mov    %r13,-0x8(%rbp)
ffffffff81313794:       e8 27 76 cf ff          callq  ffffffff8100adc0 <mcount>
ffffffff81313799:       ba 0a 00 00 00          mov    $0xa,%edx
ffffffff8131379e:       ba 00 00 00 00          mov    $0x0,%edx
ffffffff813137a3:       66 66 66 90             data32 data32 xchg %ax,%ax
ffffffff813137a7:       85 d2                   test   %edx,%edx
ffffffff813137a9:       75 50                   jne    ffffffff813137fb <get_random_int+0x7b>
ffffffff813137ab:       65 48 8b 1c 25 b0 e0    mov    %gs:0xe0b0,%rbx
...
...
ffffffff81d3db6c <.altinstr_replacement>:
...
...
ffffffff81d407e7:       48 0f c7 f0             rdrand %rax
ffffffff81d407eb:       72 04                   jb     ffffffff81d407f1 <__alt_instructions_end+0x2c85>
ffffffff81d407ed:       ff c9                   dec    %ecx
ffffffff81d407ef:       75 f6                   jne    ffffffff81d407e7 <__alt_instructions_end+0x2c7b>
...
...

In this objdump output, the following two lines are the default instructions from the alternative_io macro:

ffffffff8131379e:       ba 00 00 00 00          mov    $0x0,%edx
ffffffff813137a3:       66 66 66 90             data32 data32 xchg %ax,%ax

And the .altinstr_replacement section contains the instructions to use in their place if the hardware supports RDRAND:

ffffffff81d407e7:       48 0f c7 f0             rdrand %rax
ffffffff81d407eb:       72 04                   jb     ffffffff81d407f1 <__alt_instructions_end+0x2c85>
ffffffff81d407ed:       ff c9                   dec    %ecx
ffffffff81d407ef:       75 f6                   jne    ffffffff81d407e7 <__alt_instructions_end+0x2c7b>

Kernel in live RAM with RDRAND support

Now we'll see the assembly language for the get_random_int function in the running kernel. To be able to run the following command, install the crash, gdb, and matching kernel-debuginfo RPMs:

~]# crash -s /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/vmlinux
crash> disassemble get_random_int

Dump of assembler code for function get_random_int:
   0xffffffff81313780 <+0>: push   %rbp
   0xffffffff81313781 <+1>: mov    %rsp,%rbp
   0xffffffff81313784 <+4>: sub    $0x20,%rsp
   0xffffffff81313788 <+8>: mov    %rbx,-0x18(%rbp)
   0xffffffff8131378c <+12>:    mov    %r12,-0x10(%rbp)
   0xffffffff81313790 <+16>:    mov    %r13,-0x8(%rbp)
   0xffffffff81313794 <+20>:    nopl   0x0(%rax,%rax,1)
   0xffffffff81313799 <+25>:    mov    $0xa,%edx
   0xffffffff8131379e <+30>:    rdrand %eax
   0xffffffff813137a1 <+33>:    jb     0xffffffff813137a7 <get_random_int+39>
   0xffffffff813137a3 <+35>:    dec    %edx
   0xffffffff813137a5 <+37>:    jne    0xffffffff8131379e <get_random_int+30>
   ...
End of assembler dump.

The kernel recognized at boot that it had RDRAND support, so it inserted the four lines in the .altinstr_replacement section.

Note: The jc, decl, and jnz instructions were decoded by crash into the functionally equivalent jb, dec, and jne instructions, but that's a topic for another day.

Conclusion

This article was a quick introduction to some of the tools I have used to quality test the Linux kernel in userspace.

Comments