Featured image for: SCTP over UDP in the Linux kernel.

Unfortunately, we often receive reports of users escalating the "long poll interval" and "blocked waiting" log messages as issues with Open vSwitch (OVS). Regrettably, in most cases, the problems are tied to system resource issues. Because OVS raises alerts about them, it ends up being wrongly identified as the culprit.

Before delving into the debugging process, let's first clarify the meaning of these messages.

Within its internal operations, OVS employs a mechanism known as a poll loop to handle various events. Its function is similar to more advanced solutions like libevent or glib. In essence, it comprises a set of APIs enabling the registration of events, such as a timeout or the availability of data on a socket. After registering these events, your process can enter a sleep state, awaiting the occurrence of one of these events. Once event processing is complete, the process returns to a sleep state, anticipating the next event.

The "Unreasonably long poll interval" message is logged when the duration between waking up for an event and re-entering the sleep state exceeds one second. In other words, when processing an event takes more than one second.

The "blocked waiting to quiesce" message is logged when one thread is waiting for another to complete its work. This is related to ovs-vswitchd’s RCU implementation, more information can be found in the source. It is generally an indication that one or more threads are completing their work slowly and often accompanies "Unreasonably long poll interval" messages.

Reported problem

In this case, we have a system showing log messages like this:

|timeval|WARN|Unreasonably long 10540ms poll interval (42ms user, 430ms system)
|timeval|WARN|Unreasonably long 6116ms poll interval (24ms user, 80ms system)
|timeval|WARN|Unreasonably long 8699ms poll interval (15ms user, 104ms system)
|timeval|WARN|Unreasonably long 14754ms poll interval (51ms user, 209ms system)

These log messages include three timers; the first indicates elapsed wall clock time, the second is the amount of time vswitchd spent in userspace, and the third is how much time the kernel spent servicing ovs-vswitchd’s requests.

The initial observation here is that the cumulative user + system time does not align with the long poll interval time. This discrepancy suggests that the kernel might be busy doing other activities, preventing the ovs-vswitchd main thread from receiving any CPU time. However this is not always the case, hence we are going to do some additional troubleshooting

It's also important to note that in these instances no thread name follows timeval, which is the log module name. This absence indicates that the log message is reported by the main vswtichd thread, also referred to as the bridge run thread.

Below are examples illustrating other ovs-vswitchd threads, namely handler340and revalidator391, reporting similar issues:

|timeval(handler340)|WARN|Unreasonably long 1259ms poll interval (0ms user, 692ms system) 
|timeval(revalidator391)|WARN|Unreasonably long 1371ms poll interval (0ms user, 853ms system)

Getting the initial data

As you are never sure when this problem will happen again, we started requesting the below data. In our case, we are "lucky," as this problem was happening quite frequently.

The kernel_delay script's output is particularly interesting, as it provides insights into the actions of ovs-vswitchd. Refer to the article Troubleshooting Open vSwitch: Is the kernel to blame? for additional information about this script and how to run it in a Red Hat OpenShift environment.

Here we are taking 120 samples of 30 seconds, assuming this will be enough to capture the problem. If this does not work for your environment, you can adjust the parameters or restart the script if the issue has not occurred.

./kernel_delay.py --sample-time 30 --sample-count 120 | tee delay_results.txt

During occurrences of such issues, we may also encounter messages like the following:

|ovs_rcu(urcu5)|WARN|blocked 1000 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 2001 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 4001 ms waiting for handler496 to quiesce
|ovs_rcu(urcu5)|WARN|blocked 8000 ms waiting for handler496 to quiesce

In such instances, it becomes valuable to examine the thread’s activity. The thread may be blocked in a system call. Therefore we should proactively request the below, which dumps the kernel call stack for each ovs-vswitchd thread every 30 seconds:

for ((;;)) do date; \
  find /proc/$(pidof -s ovs-vswitchd)/task -maxdepth 1 \
    -type d \( ! -name . \) -exec \
  bash -c "echo '{}' && cat '{}'/comm && cat '{}'/stack" \; ; \
  sleep 30; \
done | tee syscall_log

Lastly, we would like to capture the CPU statistics by executing the following. This output could provide us with a general understanding of the CPU load during the issue.

mpstat -A 1 | tee cpu_log

Now, we need to actively monitor for the occurrence of the issue, indicated by the relevant log messages appearing in the ovs-vswitchd.log. Once identified, stop all the aforementioned captures, saving both the captured data and the actual ovs-vswitchd.log for further analysis.

In general, if your system supports collecting SoS reports, it would be advisable to do so after gathering all the other relevant data. For additional information on SoS reports, please refer to the following wiki.

It's important to note that not all of the collected data will be analyzed. However, it could provide valuable insights if the issue points to a different cause.

Analyzing the initial results

The first thing we look at is the ovs-vswitchd log showing the problem:

06:47:10.571Z||timeval|WARN|Unreasonably long 1435ms poll interval (7ms user, 28ms system)
06:59:37.304Z||timeval|WARN|Unreasonably long 4529ms poll interval (13ms user, 125ms system)
07:00:44.796Z||timeval|WARN|Unreasonably long 11115ms poll interval (18ms user, 215ms system)

Now that we have three events, let’s try to correlate them to the kernel_delay script’s output:

# Start sampling @06:46:47.538441 (06:46:47 UTC)
# Stop sampling @06:47:17.566780 (06:47:17 UTC)
# Sample dump @06:47:17.566919 (06:47:17 UTC)
TID        THREAD           <RESOURCE SPECIFIC>
---------- ---------------- ----------------------------------------------------------------------------
       4056 ovs-vswitchd     [SYSCALL STATISTICS]
                            NAME                 NUMBER       COUNT          TOTAL ns            MAX ns
                            poll                      7          69    27,731,273,101       511,454,506
                            sendmsg                  46        2340     1,549,422,997     1,393,773,073
                            sendto                   44        1375        42,493,227         7,949,970
                            getdents64              217         140         7,710,665           172,940
                            futex                   202         283         6,403,133           235,639
                            recvmsg                  47        2760         5,162,714            11,086
                            openat                  257          70         2,380,907            53,532
                            accept                   43         210         1,202,824            25,403
                            recvfrom                 45         106           598,966            20,157
                            getrusage                98          74           399,409           131,849
                            read                      0          70           279,191            11,853
                            fstat                     5          70           160,894             7,148
                            close                     3          71           117,364             5,633
                            clone                    56           1           107,854           107,854
                            write                     1          26            65,078            16,364
                            rt_sigprocmask           14           2             3,744             3,032
                            fcntl                    72           2             2,270             1,693
                            TOTAL( - poll):                    7600     1,616,511,237

                            [THREAD RUN STATISTICS]
                            SCHED_CNT           TOTAL ns            MIN ns            MAX ns
                                   174       249,326,672             3,044        24,811,988

                            [THREAD READY STATISTICS]
                            SCHED_CNT           TOTAL ns            MAX ns
                                   174           784,491            74,757

                            [SOFT IRQ STATISTICS]
                            NAME                 VECT_NR       COUNT          TOTAL ns            MAX ns
                            sched                      7          21           158,027            13,963
                            rcu                        9          23            83,012            15,794
                            timer                      1           9            58,868             8,354
                            TOTAL:                                53           299,907

The extract above shows the time slot for the first event in the logfile, and only for the main ovs-vswitchd, as this is the one reporting the event. The only thing outstanding in the above output is the one plus second we spend in the sendmsg() system call.

Generally, a large value in the Max NS column is notable as most of the syscalls are non-blocking. A function like sendmsg() blocking for a full second is a good indication of something to investigate further. Two notable exceptions are poll() and futex(), both of these functions are expected to block while waiting on something else and do not point to the root cause directly.

We notice a similar behavior in the time slot of the second event:

# Start sampling @06:59:12.245780 (06:59:12 UTC)
# Stop sampling @06:59:42.274987 (06:59:42 UTC)
# Sample dump @06:59:42.275134 (06:59:42 UTC)
TID        THREAD           <RESOURCE SPECIFIC>
      4056 ovs-vswitchd     [SYSCALL STATISTICS]
                            NAME                 NUMBER       COUNT          TOTAL ns            MAX ns
                            poll                      7          65    23,711,425,967       510,370,473
                            sendmsg                  46        6287     5,160,045,530     4,383,277,804
                            recvmsg                  47        7417       308,630,468         8,693,126
                            socket                   41        4891        22,110,916            36,111
                            getdents64              217         138         7,566,298           267,129
                            futex                   202         593         5,775,618           323,246
                            ioctl                    16        4996         5,541,061            11,631
                            sendto                   44          37         3,806,988           476,704
                            close                     3        4960         2,583,594             6,997
                            openat                  257          69         2,114,669            44,723
                            accept                   43         204         1,289,991            32,191
                            recvfrom                 45          91           518,709            15,596
                            getrusage                98          70           379,898           110,071
                            read                      0          66           274,984            11,724
                            write                     1          25           208,880            25,577
                            fstat                     5          69           171,851             9,024
                            clone                    56           2           166,940            84,744
                            bind                     49           6            30,762             9,308
                            getsockname              51           6            10,925             3,149
                            rt_sigprocmask           14           4             8,910             5,262
                            TOTAL( - poll):                   29931     5,521,236,992

Here you can see the sendmsg() system call took four plus seconds, also in line with the log message.

The data from the syscall_log provides additional details about what the kernel is doing during this slow syscall, but in this case doesn’t elucidate the situation further:

ovs-vswitchd
[<0>] rtnetlink_rcv_msg+0x268/0x370
[<0>] netlink_rcv_skb+0x4c/0x130
[<0>] netlink_unicast+0x19a/0x230
[<0>] netlink_sendmsg+0x204/0x3d0
[<0>] sock_sendmsg+0x50/0x60
[<0>] ____sys_sendmsg+0x1ef/0x250
[<0>] ___sys_sendmsg+0x7c/0xc0
[<0>] __sys_sendmsg+0x57/0xa0
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

Unfortunately, the eleven-second delay event wasn't captured as the kernel_delay script restarted right in the midst of it.

To understand the reason behind the long duration of the sendmsg() call, the next step involves initiating the trace-cmd below. This command will capture the function graph of the system call, including timestamps:

trace-cmd record -p function_graph -g __sys_sendmsg -P $(pidof ovs-vswitchd) -c

Analyzing the trace-cmd data

First, we need to make sure the issue happened during the time the trace was taken, and in our case 74 log messages were recorded. Now, we can decode the data using the trace-cmd report command. The output of this command looks something like this:

ovs-vswitchd-4056  [025] 14673003.226125: funcgraph_entry:                   |  __sys_sendmsg() {
ovs-vswitchd-4056  [025] 14673003.226128: funcgraph_entry:                   |  smp_irq_work_interrupt() {
ovs-vswitchd-4056  [025] 14673003.226129: funcgraph_entry:                   |    irq_enter() {
ovs-vswitchd-4056  [025] 14673003.226129: funcgraph_entry:                   |      irq_enter_rcu() {
ovs-vswitchd-4056  [025] 14673003.226129: funcgraph_entry:        0.104 us   |        irqtime_account_irq();
ovs-vswitchd-4056  [025] 14673003.226130: funcgraph_exit:         0.730 us   |      }
ovs-vswitchd-4056  [025] 14673003.226130: funcgraph_exit:         1.229 us   |    }
ovs-vswitchd-4056  [025] 14673003.226131: funcgraph_entry:                   |    __wake_up() {
...
...
ovs-vswitchd-4056  [025] 14673003.226142: funcgraph_entry:                   |    irq_exit() {
ovs-vswitchd-4056  [025] 14673003.226142: funcgraph_entry:                   |      irq_exit_rcu() {
ovs-vswitchd-4056  [025] 14673003.226142: funcgraph_entry:        0.076 us   |        irqtime_account_irq();
ovs-vswitchd-4056  [025] 14673003.226143: funcgraph_entry:        0.047 us   |        idle_cpu();
ovs-vswitchd-4056  [025] 14673003.226143: funcgraph_exit:         0.798 us   |      }
ovs-vswitchd-4056  [025] 14673003.226143: funcgraph_exit:         1.179 us   |    }
ovs-vswitchd-4056  [025] 14673003.226143: funcgraph_exit:       + 15.070 us  |  }
ovs-vswitchd-4056  [025] 14673003.226209: funcgraph_exit:       + 64.289 us  |  }

Rather than going over this long log file, we could grep/awk/sort for the total time each call took. In our case this resulted in the following:

$ cat trace-cmd_report.txt | grep "|  }" | \
    awk '{for(i=1;i<NF;i++)if($i=="us")print $(i-1) " " $1 " " $2 " " $3}' | \
    sort -n
...
6048.482 ovs-vswitchd-4056 [114] 14673003.893160:
6059.754 ovs-vswitchd-4056 [114] 14673003.994271:
6345.768 ovs-vswitchd-4056 [114] 14673004.012454:
6377.403 ovs-vswitchd-4056 [114] 14673003.914774:
6410.477 ovs-vswitchd-4056 [114] 14673004.078818:
11346.105 ovs-vswitchd-4056 [119] 14673009.406664:

From the above, you can see the worst case the system call took 1.1ms, which is not even close to the 12 seconds reported in ovs-vswitchd.log for this test. If there are long delays reported, we need to dig into the file, to figure out what is causing this.

The next step would be to gather some scheduling information to determine what the kernel might be doing with our process. To do this we use the perf command. More details on perf and its use cases can be found online and at the perf wiki. For our use case, we execute the following:

perf sched record -a --output perf-sched.data -- sleep 60

In this context, we assume that the issue will manifest within the 60-second window. If not, it is advisable to restart perf, rather than increasing the run time, to prevent an excessive amount of data from being collected.

Analyzing the perf sched data

We could now look at all the kernel scheduling events with the perf sched timehist -i perf-sched.data -f command. However, this will generate a huge amount of output. We could further limit the output to only the main thread of ovs-vswitchd, by including the --tid 4056 option.

When we show this output, we see the 500ms wait time, which is expected due to the ovs-vswitch's main thread timing out after 500ms if no poll events are available:

$ perf sched timehist -i perf-sched.data -f --tid=4056
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
14918892.771863 [0022]  ovs-vswitchd[4056]                  0.000      0.004      0.301 
14918893.273458 [0022]  ovs-vswitchd[4056]                501.336      0.005      0.258 
14918893.273485 [0022]  ovs-vswitchd[4056]                  0.003      0.000      0.023 
14918893.273565 [0022]  ovs-vswitchd[4056]                  0.002      0.000      0.078 
14918893.775038 [0022]  ovs-vswitchd[4056]                501.455      0.005      0.016 
14918893.775070 [0022]  ovs-vswitchd[4056]                  0.025      0.003      0.007 
14918893.775089 [0022]  ovs-vswitchd[4056]                  0.011      0.003      0.007 
14918894.276799 [0022]  ovs-vswitchd[4056]                501.365      0.005      0.021
... 

Let’s filter out long wait time and sch delay values of more than 550ms:

$ perf sched timehist -i perf-sched.data -f  --tid=4056 | \
    awk '$4 > 550 || $5 > 550' | wc -l
1

Going over the list manually we can see an event where we are waiting for two plus seconds, corresponding to the log event we saw for this capture:
  

         time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
14918919.383197 [0003]  ovs-vswitchd[4056]               2388.540      0.010      6.492 
|timeval|WARN|Unreasonably long 2412ms poll interval (5ms user, 12ms system)

We should further analyze the sched data to see who was running during this timeframe, and maybe why ovs-vswitch was not scheduled. Let’s first find the above event in the raw data:

$ perf script -i perf-sched.data --tid=4056 -f
…
    ovs-vswitchd  4056 [003] 14918916.986227: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=816399 [ns] vruntime=899326138452 [ns]
    ovs-vswitchd  4056 [003] 14918916.987228: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=985902 [ns] vruntime=899326244187 [ns]
    ovs-vswitchd  4056 [003] 14918916.988159: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=915643 [ns] vruntime=899326342387 [ns]
    ovs-vswitchd  4056 [003] 14918916.988164:       sched:sched_switch: ovs-vswitchd:4056 [110] T ==> swapper/3:0 [120]
    ovs-vswitchd  4056 [003] 14918919.376769:       sched:sched_waking: comm=kworker/101:2 pid=3703381 prio=120 target_cpu=101
    ovs-vswitchd  4056 [003] 14918919.377173: sched:sched_stat_runtime: comm=ovs-vswitchd pid=4056 runtime=474745 [ns] vruntime=899326393302 [ns]
    ovs-vswitchd  4056 [003] 14918919.377184:       sched:sched_waking: comm=kworker/3:0 pid=3693404 prio=120 target_cpu=003
…

Now let’s find the raw data for what happened on this CPU between the two events, and save it to a file:

$ perf script -i perf-sched.data \
    --time 14918916.988164,14918919.377173 \
    --cpu 3 -f | tee perf-sched-cpu3.txt | wc -l
1240

We can now obtain a general list of the threads that were active during this period using the following:

$ grep sched_stat_runtime perf-sched-cpu3.txt | awk '{print $1}' | sort | uniq -c
      1 :-1
      1 csi-node-driver
      4 kworker/3:0-mm_
      5 runc
      2 trace-cmd
    622 vector-worker

The data above shows that other threads were running on the CPU with higher priority, however, this should have been flagged by the THREAD READY STATISTICS. Taking a closer look at the raw trace data, we see that the ovs-vswtichd thread was stopped, indicated by the T in the below fragment:

    ovs-vswitchd  4056 [003] 14918916.988164:       sched:sched_switch: ovs-vswitchd:4056 [110] T ==> swapper/3:0 [120]

So next steps would be to further investigate why the kernel or another process stopped ovs-vswitchd. However, it's important to note that this investigation goes beyond the scope of this article, as we have established that the problem does not originate from Open vSwitch.

Conclusion

During this investigation, we began with only a scary-looking log message in the ovs-vswitchd log files; it is possible that poor network or system performance drew our attention to these logs. However, after thoroughly analyzing several sources of telemetry Open vSwitch has been rejected as a root cause. This highlights how the "Unreasonably long poll interval" and "blocked waiting" log messages aren’t necessarily an indication of any bug or problem with Open vSwitch, but instead more of an indication of overall system health.