Often, when troubleshooting Open vSwitch (OVS) in the field, you might be left wondering if the issue is really OVS-related, or if it's a problem with the kernel being overloaded. The kernel_delay.py tool can help you quickly identify if the focus of your investigation should be OVS or the Linux kernel.
About kernel_delay.py
kernel_delay.py consists of a Python script that uses the BCC framework to install eBPF probes. The data the eBPF probes collect will be analyzed and presented to the user by the Python script. Some of the presented data can also be captured by the individual scripts included in the BBC framework.
kernel_delay.py has two modes of operation:
- In time mode, the tool runs for a specific time and collects the information.
- In trigger mode, event collection can be started and/or stopped based on a specific eBPF probe. Currently, we support the following probes:
- USDT probes
- Kernel tracepoints
- kprobe
- kretprobe
- uprobe
- uretprobe
In addition, the option --sample-count
exists to specify how many iterations you would like to do. When using triggers, you can also ignore samples if they are less than a number of nanoseconds with the --trigger-delta
option. The latter might be useful when debugging Linux syscalls that take a long time to complete. (More on this later.) Finally, you can configure the delay between two sample runs with the --sample-interval
option.
Before getting into more details, let's just run the tool without any options to see what the output looks like. Notice that it will try to automatically get the process ID of the running ovs-vsdwitchd
. You can overwrite this with the --pid
option.
$ sudo ./kernel_delay.py
# Start sampling @2023-06-08T12:17:22.725127 (10:17:22 UTC)
# Stop sampling @2023-06-08T12:17:23.224781 (10:17:23 UTC)
# Sample dump @2023-06-08T12:17:23.224855 (10:17:23 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
27090 ovs-vswitchd [SYSCALL STATISTICS]
<EDIT: REMOVED DATA FOR ovs-vswitchd THREAD>
31741 revalidator122 [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
poll 7 5 184,193,176 184,191,520
recvmsg 47 494 125,208,756 310,331
futex 202 8 18,768,758 4,023,039
sendto 44 10 375,861 266,867
sendmsg 46 4 43,294 11,213
write 1 1 5,949 5,949
getrusage 98 1 1,424 1,424
read 0 1 1,292 1,292
TOTAL( - poll): 519 144,405,334
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
6 136,764,071 1,480 115,146,424
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
7 11,334 6,636
[HARD IRQ STATISTICS]
NAME COUNT TOTAL ns MAX ns
eno8303-rx-1 1 3,586 3,586
TOTAL: 1 3,586
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
net_rx 3 1 17,699 17,699
sched 7 6 13,820 3,226
rcu 9 16 13,586 1,554
timer 1 3 10,259 3,815
TOTAL: 26 55,364
By default, the tool will run for half a second in time mode. To extend this, you can use the --sample-time
option.
What will it report?
The above sample output separates the captured data on a per-thread basis. For this, it displays the thread's id (TID
) and name (THREAD
), followed by resource-specific data. Which are:
SYSCALL STATISTICS
THREAD RUN STATISTICS
THREAD READY STATISTICS
HARD IRQ STATISTICS
SOFT IRQ STATISTICS
The following sections will describe in detail what statistics they report.
SYSCALL STATISTICS
SYSCALL STATISTICS
tell you which Linux system calls got executed during the measurement interval. This includes the number of times the syscall was called (COUNT
), the total time spent in the system calls (TOTAL ns
), and the worst-case duration of a single call (MAX ns
).
It also shows the total of all system calls, but it excludes the poll system call, as the purpose of this call is to wait for activity on a set of sockets, and usually, the thread gets swapped out.
Note that it only counts calls that started and stopped during the measurement interval!
THREAD RUN STATISTICS
THREAD RUN STATISTICS
tell you how long the thread was running on a CPU during the measurement interval.
Note that these statistics only count events where the thread started and stopped running on a CPU during the measurement interval. For example, if this was a PMD thread, you should see zero SCHED_CNT
and TOTAL_ns
. If not, there might be a misconfiguration.
THREAD READY STATISTICS
THREAD READY STATISTICS
tell you the time between the thread being ready to run and it actually running on the CPU.
Note that these statistics only count events where the thread was getting ready to run and started running during the measurement interval.
HARD IRQ STATISTICS
HARD IRQ STATISTICS
tell you how much time was spent servicing hard interrupts during the threads run time.
It shows the interrupt name (NAME
), the number of interrupts (COUNT
), the total time spent in the interrupt handler (TOTAL ns
), and the worst-case duration (MAX ns
).
SOFT IRQ STATISTICS
SOFT IRQ STATISTICS
tell you how much time was spent servicing soft interrupts during the threads run time.
It shows the interrupt name (NAME
), vector number (VECT_NR
), the number of interrupts (COUNT
), the total time spent in the interrupt handler (TOTAL ns
), and the worst-case duration (MAX ns
).
The --syscall-events option
In addition to reporting global syscall statistics in SYSCALL_STATISTICS
, the tool can also report each individual syscall. This can be a useful second step if the SYSCALL_STATISTICS
show high latency numbers.
All you need to do is add the --syscall-events
option, with or without the additional DURATION_NS
parameter. The DUTATION_NS
parameter allows you to exclude events that take less than the supplied time.
The --skip-syscall-poll-events
option allows you to exclude poll syscalls from the report.
Below is an example run; note that I have removed the resource-specific data to highlight the syscall events:
$ sudo ./kernel_delay.py --syscall-events 50000 --skip-syscall-poll-events
# Start sampling @2023-06-13T17:10:46.460874 (15:10:46 UTC)
# Stop sampling @2023-06-13T17:10:46.960727 (15:10:46 UTC)
# Sample dump @2023-06-13T17:10:46.961033 (15:10:46 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
3359686 ipf_clean2 [SYSCALL STATISTICS]
...
3359635 ovs-vswitchd [SYSCALL STATISTICS]
...
3359697 revalidator12 [SYSCALL STATISTICS]
...
3359698 revalidator13 [SYSCALL STATISTICS]
...
3359699 revalidator14 [SYSCALL STATISTICS]
...
3359700 revalidator15 [SYSCALL STATISTICS]
...
# SYSCALL EVENTS:
ENTRY (ns) EXIT (ns) TID COMM DELTA (us) SYSCALL
------------------- ------------------- ---------- ---------------- ---------- ----------------
2161821694935486 2161821695031201 3359699 revalidator14 95 futex
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode+0x9 [kernel]
do_syscall_64+0x68 [kernel]
entry_SYSCALL_64_after_hwframe+0x72 [kernel]
__GI___lll_lock_wait+0x30 [libc.so.6]
ovs_mutex_lock_at+0x18 [ovs-vswitchd]
[unknown] 0x696c003936313a63
2161821695276882 2161821695333687 3359698 revalidator13 56 futex
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode+0x9 [kernel]
do_syscall_64+0x68 [kernel]
entry_SYSCALL_64_after_hwframe+0x72 [kernel]
__GI___lll_lock_wait+0x30 [libc.so.6]
ovs_mutex_lock_at+0x18 [ovs-vswitchd]
[unknown] 0x696c003134313a63
2161821695275820 2161821695405733 3359700 revalidator15 129 futex
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode+0x9 [kernel]
do_syscall_64+0x68 [kernel]
entry_SYSCALL_64_after_hwframe+0x72 [kernel]
__GI___lll_lock_wait+0x30 [libc.so.6]
ovs_mutex_lock_at+0x18 [ovs-vswitchd]
[unknown] 0x696c003936313a63
2161821695964969 2161821696052021 3359635 ovs-vswitchd 87 accept
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode_prepare+0x161 [kernel]
syscall_exit_to_user_mode+0x9 [kernel]
do_syscall_64+0x68 [kernel]
entry_SYSCALL_64_after_hwframe+0x72 [kernel]
__GI_accept+0x4d [libc.so.6]
pfd_accept+0x3a [ovs-vswitchd]
[unknown] 0x7fff19f2bd00
[unknown] 0xe4b8001f0f
As you can see above, the output also shows the stackback trace. You can disable this using the --stack-trace-size 0
option.
As you can see above, the backtrace does not show a lot of useful information due to the BCC toolkit not supporting dwarf decoding. To further analyze system call backtraces, you could use perf. The following perf script can do this for you (refer to the embedded instructions): https://github.com/chaudron/perf_scripts/blob/master/analyze_perf_pmd_syscall.py
Using triggers
The tool supports both start and stop triggers. This will allow you to capture statistics triggered by a specific event. First, let's look at what combinations of stop-and-start triggers we can use.
If you only use --start-trigger
, the inspection start when the trigger happens and runs until the --sample-time
number of seconds has passed. The example below shows all the supported options in this scenario.
$ sudo ./kernel_delay.py --start-trigger up:bridge_run --sample-time 4 \
--sample-count 4 --sample-interval 1
If you only use --stop-trigger
, the inspection starts immediately and stops when the trigger happens. The example below shows all the supported options in this scenario.
$ sudo ./kernel_delay.py --stop-trigger upr:bridge_run \
--sample-count 4 --sample-interval 1
If you use both --start-trigger
and --stop-trigger
triggers, the statistics are captured between the two first occurrences of these events. The example below shows all the supported options in this scenario.
$ sudo ./kernel_delay.py --start-trigger up:bridge_run \
--stop-trigger upr:bridge_run \
--sample-count 4 --sample-interval 1 \
--trigger-delta 50000
Now that we know how these triggers can be used, let's investigate what triggers are supported. What we call triggers, BCC calls events; these are eBPF tracepoints you can attach to. For more details on the supported tracepoints, check out the BCC documentation.
The list below shows the supported triggers and their argument format:
- USDT probes:
-
[u]:{provider}:{probe}
- Kernel tracepoint:
-
[t:trace]:{system}:{event}
- kprobe:
-
[k:kprobe]:{kernel_function}
- kretprobe:
-
[kr:kretprobe]:{kernel_function}
- uprobe:
-
[up:uprobe]:{function}
- uretprobe:
-
[upr:uretprobe]:{function}
Here are a couple of trigger examples (more use case-specific examples can be found in the next section):
--start|stop-trigger u:udpif_revalidator:start_dump
--start|stop-trigger t:openvswitch:ovs_dp_upcall
--start|stop-trigger k:ovs_dp_process_packet
--start|stop-trigger kr:ovs_dp_process_packet
--start|stop-trigger up:bridge_run
--start|stop-trigger upr:bridge_run
Examples
This section will give some examples of how to use this tool in real-world scenarios. Let's start with the issue where Open vSwitch reports Unreasonably long XXXXms poll interval
on your revalidator threads. Note that there is a blog available explaining how the revalidator process works in OVS.
First, let me explain this log message. It gets logged if the time delta between two poll_block()
calls is more than 1 second. In other words, the process was spending a lot of time processing stuff that was made available by the return of the poll_block()
.
Do a run with the tool using the existing USDT revalidator probes as a start and stop trigger (note that I removed the resource-specific data from the none revalidator threads):
$ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done
# Start sampling (trigger@791777093512008) @2023-06-14T14:52:00.110303 (12:52:00 UTC)
# Stop sampling (trigger@791778281498462) @2023-06-14T14:52:01.297975 (12:52:01 UTC)
# Triggered sample dump, stop-start delta 1,187,986,454 ns @2023-06-14T14:52:01.298021 (12:52:01 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
1457761 handler24 [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
sendmsg 46 6110 123,274,761 41,776
recvmsg 47 136299 99,397,508 49,896
futex 202 51 7,655,832 7,536,776
poll 7 4068 1,202,883 2,907
getrusage 98 2034 586,602 1,398
sendto 44 9 213,682 27,417
TOTAL( - poll): 144503 231,128,385
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
1 1,438 1,438
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
sched 7 21 59,145 3,769
rcu 9 50 42,917 2,234
TOTAL: 71 102,062
1457733 ovs-vswitchd [SYSCALL STATISTICS]
...
1457792 revalidator55 [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
futex 202 73 572,576,329 19,621,600
recvmsg 47 815 296,697,618 405,338
sendto 44 3 78,302 26,837
sendmsg 46 3 38,712 13,250
write 1 1 5,073 5,073
TOTAL( - poll): 895 869,396,034
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
48 394,350,393 1,729 140,455,796
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
49 23,650 1,559
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
sched 7 14 26,889 3,041
rcu 9 28 23,024 1,600
TOTAL: 42 49,913
You can see from the start of the output that the trigger took more than a second (1,187,986,454 nanoseconds), which we would already know by looking at the output of the ovs-vsctl upcall/show
command.
From the revalidator55's SYSCALL STATISTICS
statistics, we can see it spent almost 870 milliseconds handling syscalls, and there were no poll()
calls being executed. The THREAD RUN STATISTICS
statistics here are a bit misleading, as it looks like we only spent 394 milliseconds on the CPU. But earlier, we learned that this time does not include the time being on the CPU at the start or stop of an event. What is exactly the case here because we are using USDT probes.
From the above data and maybe some top
output, we can determine that the revalidator55 thread is taking a lot of CPU time, probably because it has to do a lot of revalidator work by itself. The solution is to increase the number of revalidator threads, so more work could be done in parallel.
Let's do another run of the same command in another scenario:
$ sudo ./kernel_delay.py --start-trigger u:udpif_revalidator:start_dump --stop-trigger u:udpif_revalidator:sweep_done
# Start sampling (trigger@795160501758971) @2023-06-14T15:48:23.518512 (13:48:23 UTC)
# Stop sampling (trigger@795160764940201) @2023-06-14T15:48:23.781381 (13:48:23 UTC)
# Triggered sample dump, stop-start delta 263,181,230 ns @2023-06-14T15:48:23.781414 (13:48:23 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
1457733 ovs-vswitchd [SYSCALL STATISTICS]
...
1457792 revalidator55 [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
recvmsg 47 284 193,422,110 46,248,418
sendto 44 2 46,685 23,665
sendmsg 46 2 24,916 12,703
write 1 1 6,534 6,534
TOTAL( - poll): 289 193,500,245
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
2 47,333,558 331,516 47,002,042
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
3 87,000,403 45,999,712
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
sched 7 2 9,504 5,109
TOTAL: 2 9,504
Here you can see the revalidator run took about 263 milliseconds, which does not look odd; however, the THREAD READY STATISTICS
information shows us we were waiting 87 milliseconds for a CPU to be run on. This means the revalidator process could have finished 87 milliseconds faster. Looking at the MAX ns
value, we see a worst-case delay of almost 46 milliseconds, which hints at an overloaded system.
The following is one final example where we use a uprobe
to get some statistics on a bridge_run()
execution that takes more than 1 millisecond:
$ sudo ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run --trigger-delta 1000000
# Start sampling (trigger@2245245432101270) @2023-06-14T16:21:10.467919 (14:21:10 UTC)
# Stop sampling (trigger@2245245432414656) @2023-06-14T16:21:10.468296 (14:21:10 UTC)
# Sample dump skipped, delta 313,386 ns @2023-06-14T16:21:10.468419 (14:21:10 UTC)
# Start sampling (trigger@2245245505301745) @2023-06-14T16:21:10.540970 (14:21:10 UTC)
# Stop sampling (trigger@2245245506911119) @2023-06-14T16:21:10.542499 (14:21:10 UTC)
# Triggered sample dump, stop-start delta 1,609,374 ns @2023-06-14T16:21:10.542565 (14:21:10 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
3371035 <unknown:3366258/3371035> [SYSCALL STATISTICS]
... <REMOVED 7 MORE unknown THREADS>
3371102 handler66 [SYSCALL STATISTICS]
... <REMOVED 7 MORE HANDLER THREADS>
3366258 ovs-vswitchd [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
futex 202 43 403,469 199,312
clone3 435 13 174,394 30,731
munmap 11 8 115,774 21,861
poll 7 5 92,969 38,307
unlink 87 2 49,918 35,741
mprotect 10 8 47,618 13,201
accept 43 10 31,360 6,976
mmap 9 8 30,279 5,776
write 1 6 27,720 11,774
rt_sigprocmask 14 28 12,281 970
read 0 6 9,478 2,318
recvfrom 45 3 7,024 4,024
sendto 44 1 4,684 4,684
getrusage 98 5 4,594 1,342
close 3 2 2,918 1,627
recvmsg 47 1 2,722 2,722
TOTAL( - poll): 144 924,233
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
13 817,605 5,433 524,376
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
14 28,646 11,566
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
rcu 9 1 2,838 2,838
TOTAL: 1 2,838
3371110 revalidator74 [SYSCALL STATISTICS]
... <REMOVED 7 MORE NEW revalidator THREADS>
3366311 urcu3 [SYSCALL STATISTICS]
...
We removed some of the threads and their resource-specific data, but based on the <unknown:3366258/3371035>
thread name, you can see that some threads no longer exist. In the ovs-vswitchd
thread, you can see some clone3
syscalls, indicating threads were created. In this example, it was due to the deletion of a bridge, which resulted in the recreation of the revalidator and handler threads.
Using kernel_delay.py with OpenShift
This section describes how you would use the tool on a node in an OpenShift cluster. It assumes you have console access to the node, either directly or through a debug container.
We will use a base Fedora Linux 38 container through Podman, as this will allow us to install some additional tools and packages we need.
The first thing we need to do is to start the container:
[core@sno-master ~]$ sudo podman run -it --rm \
-e PS1='[(DEBUG)\u@\h \W]\$ ' \
--privileged --network=host --pid=host \
-v /lib/modules:/lib/modules:ro \
-v /sys/kernel/debug:/sys/kernel/debug \
-v /proc:/proc \
-v /:/mnt/rootdir \
quay.io/fedora/fedora:38-x86_64
[(DEBUG)root@sno-master /]#
Next, add the linux_delay.py
dependencies:
[(DEBUG)root@sno-master /]# dnf install -y bcc-tools perl-interpreter \
python3-pytz python3-psutil
You need to install Devel, debug, and source RPMs for your OVS and kernel version:
[(DEBUG)root@sno-master home]# rpm -i \
openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \
openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \
kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm
Now we can run the tool. Here we use the above bridge_run()
example:
[(DEBUG)root@sno-master home]# ./kernel_delay.py --start-trigger up:bridge_run --stop-trigger ur:bridge_run
# Start sampling (trigger@75279117343513) @2023-06-15T11:44:07.628372 (11:44:07 UTC)
# Stop sampling (trigger@75279117443980) @2023-06-15T11:44:07.628529 (11:44:07 UTC)
# Triggered sample dump, stop-start delta 100,467 ns @2023-06-15T11:44:07.628569 (11:44:07 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
1246 ovs-vswitchd [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
getdents64 217 2 8,560 8,162
openat 257 1 6,951 6,951
accept 43 4 6,942 3,763
recvfrom 45 1 3,726 3,726
recvmsg 47 2 2,880 2,188
stat 4 2 1,946 1,384
close 3 1 1,393 1,393
fstat 5 1 1,324 1,324
TOTAL( - poll): 14 33,722
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
Conclusion
By incorporating the kernel_delay.py utility into your development toolkit, you can swiftly pinpoint the problem's source and initiate focused debugging efforts.
Last updated: September 19, 2023