Debugging vHost user TX contention in Open vSwitch

It isn't always easy to understand how Open vSwitch (OVS) cycles are spent, especially because various parameters and configuration options can affect how OVS behaves. Members of the Open vSwitch community are actively working to understand what causes packets drops in Open vSwitch. Efforts so far have included adding a custom statistic for vHost TX retries, tracking vHost TX contention, and adding a coverage counter to count vHost IRQs. We are particularly interested in the user space datapath that uses the Data Plane Development Kit (DPDK) for fast I/O.

Adding these statistics is an ongoing effort, and we won't cover all of the corners. In some cases, the statistics leave doubts about what is causing a behavior.

In this article, I will introduce a new counter we've added to learn more about contention in the vHost transmission path. I'll also show you how to use the new counter with perf, and I'll discuss what's next for our ongoing efforts.

The test environment for reproducing contention

In this section, we'll set up a test environment to reproduce contention in the vHost transmission path. Our reference system is running Red Hat Enterprise Linux (RHEL) 7.7 with openvswitch2.11-2.11.0-35.el7fdp.x86_64. You can skip this section if your environment is set up and running already.

Configuring OVS

Assuming you have Red Hat Enterprise Linux 7.7 already running in your system, you can configure OVS with a single bridge that has two plugged-in physical ports and two vhost-user-client ports.

The physical ports are connected to a TRex Realistic Traffic Generator. The traffic generator will send a unidirectional flow of packets to the bridge, passed to the vhost-user-clients ports. We've crafted these packets to send traffic to both queues of the first physical port on the OVS side. The vhost-user-clients ports are connected to a virtual machine (VM) that sends the packets back using testpmd in io forward mode.

If you need more details about setting up a TRex traffic generator, configuring a host running OVS, or other aspects of this setup, see Eelco Chaudron's introduction to Automated Open vSwitch PVP testing.

Configuring the bridge and host

The setup in the ASCII diagram here differs slightly from the Physical interface to Virtual interface back to Physical interface (PVP) setup shown in the linked article above.

+------+   +-----+   +---------+
|      |   |     |   |         |
|     0+---+1   4+---+0        |
| tgen |   | ovs |   | testpmd |
|     1+---+2   5+---+1        |
|      |   |     |   |         |
+------+   +-----+   +---------+

Configure the bridge on the host as follows:

# ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x00008002
# ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
# ovs-vsctl add-port br0 dpdk0 -- \
    set Interface dpdk0 type=dpdk -- \
    set Interface dpdk0 options:dpdk-devargs=0000:01:00.0 -- \
    set Interface dpdk0 ofport_request=1 -- \
    set Interface dpdk0 options:n_rxq=2
# ovs-vsctl add-port br0 dpdk1 -- \
    set Interface dpdk1 type=dpdk -- \
    set Interface dpdk1 options:dpdk-devargs=0000:01:00.1 -- \
    set Interface dpdk1 ofport_request=2 -- \
    set Interface dpdk1 options:n_rxq=2
# ovs-vsctl add-port br0 vhost0 -- \
    set Interface vhost0 type=dpdkvhostuserclient -- \
    set Interface vhost0 options:vhost-server-path="/tmp/vhost-sock0" -- \
    set Interface vhost0 ofport_request=4
# ovs-vsctl add-port br0 vhost1 -- \
    set Interface vhost1 type=dpdkvhostuserclient -- \
    set Interface vhost1 options:vhost-server-path="/tmp/vhost-sock1" -- \
    set Interface vhost1 ofport_request=5

Check the polling configuration:

# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 1:
  isolated : false
  port: dpdk0             queue-id:  0  pmd usage: NOT AVAIL
  port: dpdk1             queue-id:  1  pmd usage: NOT AVAIL
  port: vhost0            queue-id:  0  pmd usage: NOT AVAIL
pmd thread numa_id 0 core_id 15:
  isolated : false
  port: dpdk0             queue-id:  1  pmd usage: NOT AVAIL
  port: dpdk1             queue-id:  0  pmd usage: NOT AVAIL
  port: vhost1            queue-id:  0  pmd usage: NOT AVAIL

We could let this OVS bridge with a NORMAL action, in which case it would behave like a standard switch learning Media Access Control (MAC) addresses on its ports. To simplify the setup, let's just write a few OpenFlow rules for a basic mapping:

  • Receiving on the physical port dpdk0 pushes packets to vhost0.
  • Receiving on the virtual port vhost0 pushes packets to dpdk0.
  • Receiving on the physical port dpdk1 pushes packets to vhost1.
  • Receiving on the virtual port vhost1 pushes packets to dpdk1.

Here's that mapping:

# ovs-ofctl del-flows br0
# ovs-ofctl add-flow br0 in_port=1,actions=4
# ovs-ofctl add-flow br0 in_port=4,actions=1
# ovs-ofctl add-flow br0 in_port=2,actions=5
# ovs-ofctl add-flow br0 in_port=5,actions=2

That completes the test environment.

Catching vHost TX contention

Now let's take a look at the new coverage counter for OVS:

# ovs-appctl coverage/show |grep vhost
vhost_tx_contention      39082.8/sec 11553.017/sec      192.5503/sec   total: 758359

Adding a perf probe

As it is, the counter leaves the question of which cores are impacted by contention. We can use perf to catch more information without stopping OVS. Just add a probe in the branch where the contention occurs:

# perf probe -x $(which ovs-vswitchd) 'netdev_dpdk_vhost_tx_lock=__netdev_dpdk_vhost_send:22 netdev->name:string qid'
Added new event:
  probe_ovs:netdev_dpdk_vhost_tx_lock (on __netdev_dpdk_vhost_send:22 in /usr/sbin/ovs-vswitchd with name=netdev->name:string qid)

Now you can use the counter in all of your perf tools.

Using the coverage counter in perf

Here, we ask perf to record a specific event:

# perf record -e probe_ovs:netdev_dpdk_vhost_tx_lock -aR sleep 1
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 3.938 MB perf.data (44059 samples) ]

We can also make a report of this perf session:

# perf report -F +pid --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 44K of event 'probe_ovs:netdev_dpdk_vhost_tx_lock'
# Event count (approx.): 44059
#
# Overhead      Pid:Command  Trace output
# ........  ...............  ..................................
#
    61.30%    33003:pmd60    (55ef4abe5494) name="vhost0" qid=0
    38.70%    33006:pmd61    (55ef4abe5494) name="vhost0" qid=0

#
# (Tip: For a higher level overview, try: perf report --sort comm,dso)
#

The new coverage counter makes interpreting contention easier. We can see that the contention happened between pmd60 (on core 1, by looking at the OVS logs) and pmd61 (on core 15). Both pmd threads are trying to send packets on the vhost0 port queue zero.

Conclusion

Using perf to debug contention is interesting, and it worked in this case because we were trying to catch events on an error or slow path. But perf involves context switches that have a visible effect on performance. We can't use it without accounting for the performance impact.

Even if it's fine for developers to put in a probe by reading the code sources, support or operations teams will prefer higher-level tools or traces. The DPDK community has started a workgroup to set traces with a minimal impact at key places in the DPDK infrastructure code. We are still far from something as rich as perf, but this is likely to be a focus for part of the community for the next year.

Last updated: June 26, 2020