Unfortunately, we often receive reports of users escalating the "long poll interval" and "blocked waiting" log messages as issues with Open vSwitch (OVS). Regrettably, in most cases, the problems are tied to system resource issues. Because OVS raises alerts about them, it ends up being wrongly identified as the culprit.
Before delving into the debugging process, let's first clarify the meaning of these messages.
Within its internal operations, OVS employs a mechanism known as a poll loop to handle various events. Its function is similar to more advanced solutions like libevent or glib. In essence, it comprises a set of APIs enabling the registration of events, such as a timeout or the availability of data on a socket. After registering these events, your process can enter a sleep state, awaiting the occurrence of one of these events. Once event processing is complete, the process returns to a sleep state, anticipating the next event.
The "Unreasonably long poll interval" message is logged when the duration between waking up for an event and re-entering the sleep state exceeds one second. In other words, when processing an event takes more than one second.
The "blocked waiting to quiesce" message is logged when one thread is waiting for another to complete its work. This is related to ovs-vswitchd’s RCU implementation, more information can be found in the source. It is generally an indication that one or more threads are completing their work slowly and often accompanies "Unreasonably long poll interval" messages.
Reported problem
In this case, we have a system showing log messages like this:
|timeval|WARN|Unreasonably long 10540ms poll interval (42ms user, 430ms system)
|timeval|WARN|Unreasonably long 6116ms poll interval (24ms user, 80ms system)
|timeval|WARN|Unreasonably long 8699ms poll interval (15ms user, 104ms system)
|timeval|WARN|Unreasonably long 14754ms poll interval (51ms user, 209ms system)
These log messages include three timers; the first indicates elapsed wall clock time, the second is the amount of time vswitchd spent in userspace, and the third is how much time the kernel spent servicing ovs-vswitchd’s requests.
The initial observation here is that the cumulative user + system time does not align with the long poll interval time. This discrepancy suggests that the kernel might be busy doing other activities, preventing the ovs-vswitchd main thread from receiving any CPU time. However this is not always the case, hence we are going to do some additional troubleshooting
It's also important to note that in these instances no thread name follows timeval
, which is the log module name. This absence indicates that the log message is reported by the main vswtichd thread, also referred to as the bridge run thread.
Below are examples illustrating other ovs-vswitchd threads, namely handler340
and revalidator391
, reporting similar issues:
|timeval(handler340)|WARN|Unreasonably long 1259ms poll interval (0ms user, 692ms system)
|timeval(revalidator391)|WARN|Unreasonably long 1371ms poll interval (0ms user, 853ms system)
Getting the initial data
As you are never sure when this problem will happen again, we started requesting the below data. In our case, we are "lucky," as this problem was happening quite frequently.
The kernel_delay
script's output is particularly interesting, as it provides insights into the actions of ovs-vswitchd. Refer to the article Troubleshooting Open vSwitch: Is the kernel to blame? for additional information about this script and how to run it in a Red Hat OpenShift environment.
Here we are taking 120 samples of 30 seconds, assuming this will be enough to capture the problem. If this does not work for your environment, you can adjust the parameters or restart the script if the issue has not occurred.
./kernel_delay.py --sample-time 30 --sample-count 120 | tee delay_results.txt
During occurrences of such issues, we may also encounter messages like the following:
|ovs_rcu(urcu5)|WARN|blocked 1000 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 2001 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 4001 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 8000 ms waiting for handler496 to quiesce
In such instances, it becomes valuable to examine the thread’s activity. The thread may be blocked in a system call. Therefore we should proactively request the below, which dumps the kernel call stack for each ovs-vswitchd thread every 30 seconds:
for ((;;)) do date; \
find /proc/$(pidof -s ovs-vswitchd)/task -maxdepth 1 \
-type d \( ! -name . \) -exec \
bash -c "echo '{}' && cat '{}'/comm && cat '{}'/stack" \; ; \
sleep 30; \
done | tee syscall_log
Lastly, we would like to capture the CPU statistics by executing the following. This output could provide us with a general understanding of the CPU load during the issue.
mpstat -A 1 | tee cpu_log
Now, we need to actively monitor for the occurrence of the issue, indicated by the relevant log messages appearing in the ovs-vswitchd.log
. Once identified, stop all the aforementioned captures, saving both the captured data and the actual ovs-vswitchd.log
for further analysis.
In general, if your system supports collecting SoS reports, it would be advisable to do so after gathering all the other relevant data. For additional information on SoS reports, please refer to the following wiki.
It's important to note that not all of the collected data will be analyzed. However, it could provide valuable insights if the issue points to a different cause.
Analyzing the initial results
The first thing we look at is the ovs-vswitchd log showing the problem:
06:47:10.571Z||timeval|WARN|Unreasonably long 1435ms poll interval (7ms user, 28ms system)
06:59:37.304Z||timeval|WARN|Unreasonably long 4529ms poll interval (13ms user, 125ms system)
07:00:44.796Z||timeval|WARN|Unreasonably long 11115ms poll interval (18ms user, 215ms system)
Now that we have three events, let’s try to correlate them to the kernel_delay
script’s output:
# Start sampling @06:46:47.538441 (06:46:47 UTC)
# Stop sampling @06:47:17.566780 (06:47:17 UTC)
# Sample dump @06:47:17.566919 (06:47:17 UTC)
TID THREAD <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
4056 ovs-vswitchd [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
poll 7 69 27,731,273,101 511,454,506
sendmsg 46 2340 1,549,422,997 1,393,773,073
sendto 44 1375 42,493,227 7,949,970
getdents64 217 140 7,710,665 172,940
futex 202 283 6,403,133 235,639
recvmsg 47 2760 5,162,714 11,086
openat 257 70 2,380,907 53,532
accept 43 210 1,202,824 25,403
recvfrom 45 106 598,966 20,157
getrusage 98 74 399,409 131,849
read 0 70 279,191 11,853
fstat 5 70 160,894 7,148
close 3 71 117,364 5,633
clone 56 1 107,854 107,854
write 1 26 65,078 16,364
rt_sigprocmask 14 2 3,744 3,032
fcntl 72 2 2,270 1,693
TOTAL( - poll): 7600 1,616,511,237
[THREAD RUN STATISTICS]
SCHED_CNT TOTAL ns MIN ns MAX ns
174 249,326,672 3,044 24,811,988
[THREAD READY STATISTICS]
SCHED_CNT TOTAL ns MAX ns
174 784,491 74,757
[SOFT IRQ STATISTICS]
NAME VECT_NR COUNT TOTAL ns MAX ns
sched 7 21 158,027 13,963
rcu 9 23 83,012 15,794
timer 1 9 58,868 8,354
TOTAL: 53 299,907
The extract above shows the time slot for the first event in the logfile, and only for the main ovs-vswitchd, as this is the one reporting the event. The only thing outstanding in the above output is the one plus second we spend in the sendmsg()
system call.
Generally, a large value in the Max NS column is notable as most of the syscalls are non-blocking. A function like sendmsg()
blocking for a full second is a good indication of something to investigate further. Two notable exceptions are poll()
and futex()
, both of these functions are expected to block while waiting on something else and do not point to the root cause directly.
We notice a similar behavior in the time slot of the second event:
# Start sampling @06:59:12.245780 (06:59:12 UTC)
# Stop sampling @06:59:42.274987 (06:59:42 UTC)
# Sample dump @06:59:42.275134 (06:59:42 UTC)
TID THREAD <RESOURCE SPECIFIC>
4056 ovs-vswitchd [SYSCALL STATISTICS]
NAME NUMBER COUNT TOTAL ns MAX ns
poll 7 65 23,711,425,967 510,370,473
sendmsg 46 6287 5,160,045,530 4,383,277,804
recvmsg 47 7417 308,630,468 8,693,126
socket 41 4891 22,110,916 36,111
getdents64 217 138 7,566,298 267,129
futex 202 593 5,775,618 323,246
ioctl 16 4996 5,541,061 11,631
sendto 44 37 3,806,988 476,704
close 3 4960 2,583,594 6,997
openat 257 69 2,114,669 44,723
accept 43 204 1,289,991 32,191
recvfrom 45 91 518,709 15,596
getrusage 98 70 379,898 110,071
read 0 66 274,984 11,724
write 1 25 208,880 25,577
fstat 5 69 171,851 9,024
clone 56 2 166,940 84,744
bind 49 6 30,762 9,308
getsockname 51 6 10,925 3,149
rt_sigprocmask 14 4 8,910 5,262
TOTAL( - poll): 29931 5,521,236,992
Here you can see the sendmsg()
system call took four plus seconds, also in line with the log message.
The data from the syscall_log provides additional details about what the kernel is doing during this slow syscall, but in this case doesn’t elucidate the situation further:
ovs-vswitchd
[<0>] rtnetlink_rcv_msg+0x268/0x370
[<0>] netlink_rcv_skb+0x4c/0x130
[<0>] netlink_unicast+0x19a/0x230
[<0>] netlink_sendmsg+0x204/0x3d0
[<0>] sock_sendmsg+0x50/0x60
[<0>] ____sys_sendmsg+0x1ef/0x250
[<0>] ___sys_sendmsg+0x7c/0xc0
[<0>] __sys_sendmsg+0x57/0xa0
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
Unfortunately, the eleven-second delay event wasn't captured as the kernel_delay
script restarted right in the midst of it.
To understand the reason behind the long duration of the sendmsg()
call, the next step involves initiating the trace-cmd
below. This command will capture the function graph of the system call, including timestamps:
trace-cmd record -p function_graph -g __sys_sendmsg -P $(pidof ovs-vswitchd) -c
Analyzing the trace-cmd data
First, we need to make sure the issue happened during the time the trace was taken, and in our case 74 log messages were recorded. Now, we can decode the data using the trace-cmd report
command. The output of this command looks something like this:
ovs-vswitchd-4056 [025] 14673003.226125: funcgraph_entry: | __sys_sendmsg() {
ovs-vswitchd-4056 [025] 14673003.226128: funcgraph_entry: | smp_irq_work_interrupt() {
ovs-vswitchd-4056 [025] 14673003.226129: funcgraph_entry: | irq_enter() {
ovs-vswitchd-4056 [025] 14673003.226129: funcgraph_entry: | irq_enter_rcu() {
ovs-vswitchd-4056 [025] 14673003.226129: funcgraph_entry: 0.104 us | irqtime_account_irq();
ovs-vswitchd-4056 [025] 14673003.226130: funcgraph_exit: 0.730 us | }
ovs-vswitchd-4056 [025] 14673003.226130: funcgraph_exit: 1.229 us | }
ovs-vswitchd-4056 [025] 14673003.226131: funcgraph_entry: | __wake_up() {
...
...
ovs-vswitchd-4056 [025] 14673003.226142: funcgraph_entry: | irq_exit() {
ovs-vswitchd-4056 [025] 14673003.226142: funcgraph_entry: | irq_exit_rcu() {
ovs-vswitchd-4056 [025] 14673003.226142: funcgraph_entry: 0.076 us | irqtime_account_irq();
ovs-vswitchd-4056 [025] 14673003.226143: funcgraph_entry: 0.047 us | idle_cpu();
ovs-vswitchd-4056 [025] 14673003.226143: funcgraph_exit: 0.798 us | }
ovs-vswitchd-4056 [025] 14673003.226143: funcgraph_exit: 1.179 us | }
ovs-vswitchd-4056 [025] 14673003.226143: funcgraph_exit: + 15.070 us | }
ovs-vswitchd-4056 [025] 14673003.226209: funcgraph_exit: + 64.289 us | }
Rather than going over this long log file, we could grep/awk/sort for the total time each call took. In our case this resulted in the following:
$ cat trace-cmd_report.txt | grep "| }" | \
awk '{for(i=1;i<NF;i++)if($i=="us")print $(i-1) " " $1 " " $2 " " $3}' | \
sort -n
...
6048.482 ovs-vswitchd-4056 [114] 14673003.893160:
6059.754 ovs-vswitchd-4056 [114] 14673003.994271:
6345.768 ovs-vswitchd-4056 [114] 14673004.012454:
6377.403 ovs-vswitchd-4056 [114] 14673003.914774:
6410.477 ovs-vswitchd-4056 [114] 14673004.078818:
11346.105 ovs-vswitchd-4056 [119] 14673009.406664:
From the above, you can see the worst case the system call took 1.1ms, which is not even close to the 12 seconds reported in ovs-vswitchd.log
for this test. If there are long delays reported, we need to dig into the file, to figure out what is causing this.
The next step would be to gather some scheduling information to determine what the kernel might be doing with our process. To do this we use the perf
command. More details on perf
and its use cases can be found online and at the perf
wiki. For our use case, we execute the following:
perf sched record -a --output perf-sched.data -- sleep 60
In this context, we assume that the issue will manifest within the 60-second window. If not, it is advisable to restart perf
, rather than increasing the run time, to prevent an excessive amount of data from being collected.
Analyzing the perf sched data
We could now look at all the kernel scheduling events with the perf sched timehist -i perf-sched.data -f
command. However, this will generate a huge amount of output. We could further limit the output to only the main thread of ovs-vswitchd, by including the --tid 4056
option.
When we show this output, we see the 500ms wait time, which is expected due to the ovs-vswitch's main thread timing out after 500ms if no poll events are available:
$ perf sched timehist -i perf-sched.data -f --tid=4056
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
14918892.771863 [0022] ovs-vswitchd[4056] 0.000 0.004 0.301
14918893.273458 [0022] ovs-vswitchd[4056] 501.336 0.005 0.258
14918893.273485 [0022] ovs-vswitchd[4056] 0.003 0.000 0.023
14918893.273565 [0022] ovs-vswitchd[4056] 0.002 0.000 0.078
14918893.775038 [0022] ovs-vswitchd[4056] 501.455 0.005 0.016
14918893.775070 [0022] ovs-vswitchd[4056] 0.025 0.003 0.007
14918893.775089 [0022] ovs-vswitchd[4056] 0.011 0.003 0.007
14918894.276799 [0022] ovs-vswitchd[4056] 501.365 0.005 0.021
...
Let’s filter out long wait time and sch delay values of more than 550ms:
$ perf sched timehist -i perf-sched.data -f --tid=4056 | \
awk '$4 > 550 || $5 > 550' | wc -l
1
Going over the list manually we can see an event where we are waiting for two plus seconds, corresponding to the log event we saw for this capture:
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
14918919.383197 [0003] ovs-vswitchd[4056] 2388.540 0.010 6.492
|timeval|WARN|Unreasonably long 2412ms poll interval (5ms user, 12ms system)
We should further analyze the sched data to see who was running during this timeframe, and maybe why ovs-vswitch was not scheduled. Let’s first find the above event in the raw data:
$ perf script -i perf-sched.data --tid=4056 -f
…
ovs-vswitchd 4056 [003] 14918916.986227: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=816399 [ns] vruntime=899326138452 [ns]
ovs-vswitchd 4056 [003] 14918916.987228: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=985902 [ns] vruntime=899326244187 [ns]
ovs-vswitchd 4056 [003] 14918916.988159: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=915643 [ns] vruntime=899326342387 [ns]
ovs-vswitchd 4056 [003] 14918916.988164: sched:sched_switch: ovs-vswitchd:4056 [110] T ==> swapper/3:0 [120]
ovs-vswitchd 4056 [003] 14918919.376769: sched:sched_waking: comm=kworker/101:2 pid=3703381 prio=120 target_cpu=101
ovs-vswitchd 4056 [003] 14918919.377173: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=474745 [ns] vruntime=899326393302 [ns]
ovs-vswitchd 4056 [003] 14918919.377184: sched:sched_waking: comm=kworker/3:0 pid=3693404 prio=120 target_cpu=003
…
Now let’s find the raw data for what happened on this CPU between the two events, and save it to a file:
$ perf script -i perf-sched.data \
--time 14918916.988164,14918919.377173 \
--cpu 3 -f | tee perf-sched-cpu3.txt | wc -l
1240
We can now obtain a general list of the threads that were active during this period using the following:
$ grep sched_stat_runtime perf-sched-cpu3.txt | awk '{print $1}' | sort | uniq -c
1 :-1
1 csi-node-driver
4 kworker/3:0-mm_
5 runc
2 trace-cmd
622 vector-worker
The data above shows that other threads were running on the CPU with higher priority, however, this should have been flagged by the THREAD READY STATISTICS
. Taking a closer look at the raw trace data, we see that the ovs-vswtichd thread was stopped, indicated by the T
in the below fragment:
ovs-vswitchd 4056 [003] 14918916.988164: sched:sched_switch: ovs-vswitchd:4056 [110] T ==> swapper/3:0 [120]
So next steps would be to further investigate why the kernel or another process stopped ovs-vswitchd. However, it's important to note that this investigation goes beyond the scope of this article, as we have established that the problem does not originate from Open vSwitch.
Conclusion
During this investigation, we began with only a scary-looking log message in the ovs-vswitchd log files; it is possible that poor network or system performance drew our attention to these logs. However, after thoroughly analyzing several sources of telemetry Open vSwitch has been rejected as a root cause. This highlights how the "Unreasonably long poll interval" and "blocked waiting" log messages aren’t necessarily an indication of any bug or problem with Open vSwitch, but instead more of an indication of overall system health.