It isn't always easy to understand how Open vSwitch (OVS) cycles are spent, especially because various parameters and configuration options can affect how OVS behaves. Members of the Open vSwitch community are actively working to understand what causes packets drops in Open vSwitch. Efforts so far have included adding a custom statistic for vHost TX retries, tracking vHost TX contention, and adding a coverage counter to count vHost IRQs. We are particularly interested in the user space datapath that uses the Data Plane Development Kit (DPDK) for fast I/O.
Adding these statistics is an ongoing effort, and we won't cover all of the corners. In some cases, the statistics leave doubts about what is causing a behavior.
In this article, I will introduce a new counter we've added to learn more about contention in the vHost transmission path. I'll also show you how to use the new counter with perf
, and I'll discuss what's next for our ongoing efforts.
The test environment for reproducing contention
In this section, we'll set up a test environment to reproduce contention in the vHost transmission path. Our reference system is running Red Hat Enterprise Linux (RHEL) 7.7 with openvswitch2.11-2.11.0-35.el7fdp.x86_64
. You can skip this section if your environment is set up and running already.
Configuring OVS
Assuming you have Red Hat Enterprise Linux 7.7 already running in your system, you can configure OVS with a single bridge that has two plugged-in physical ports and two vhost-user-client
ports.
The physical ports are connected to a TRex Realistic Traffic Generator. The traffic generator will send a unidirectional flow of packets to the bridge, passed to the vhost-user-clients
ports. We've crafted these packets to send traffic to both queues of the first physical port on the OVS side. The vhost-user-clients
ports are connected to a virtual machine (VM) that sends the packets back using testpmd
in io forward
mode.
If you need more details about setting up a TRex traffic generator, configuring a host running OVS, or other aspects of this setup, see Eelco Chaudron's introduction to Automated Open vSwitch PVP testing.
Configuring the bridge and host
The setup in the ASCII diagram here differs slightly from the Physical interface to Virtual interface back to Physical interface (PVP) setup shown in the linked article above.
+------+ +-----+ +---------+ | | | | | | | 0+---+1 4+---+0 | | tgen | | ovs | | testpmd | | 1+---+2 5+---+1 | | | | | | | +------+ +-----+ +---------+
Configure the bridge on the host as follows:
# ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true # ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x00008002 # ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev # ovs-vsctl add-port br0 dpdk0 -- \ set Interface dpdk0 type=dpdk -- \ set Interface dpdk0 options:dpdk-devargs=0000:01:00.0 -- \ set Interface dpdk0 ofport_request=1 -- \ set Interface dpdk0 options:n_rxq=2 # ovs-vsctl add-port br0 dpdk1 -- \ set Interface dpdk1 type=dpdk -- \ set Interface dpdk1 options:dpdk-devargs=0000:01:00.1 -- \ set Interface dpdk1 ofport_request=2 -- \ set Interface dpdk1 options:n_rxq=2 # ovs-vsctl add-port br0 vhost0 -- \ set Interface vhost0 type=dpdkvhostuserclient -- \ set Interface vhost0 options:vhost-server-path="/tmp/vhost-sock0" -- \ set Interface vhost0 ofport_request=4 # ovs-vsctl add-port br0 vhost1 -- \ set Interface vhost1 type=dpdkvhostuserclient -- \ set Interface vhost1 options:vhost-server-path="/tmp/vhost-sock1" -- \ set Interface vhost1 ofport_request=5
Check the polling configuration:
# ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 1: isolated : false port: dpdk0 queue-id: 0 pmd usage: NOT AVAIL port: dpdk1 queue-id: 1 pmd usage: NOT AVAIL port: vhost0 queue-id: 0 pmd usage: NOT AVAIL pmd thread numa_id 0 core_id 15: isolated : false port: dpdk0 queue-id: 1 pmd usage: NOT AVAIL port: dpdk1 queue-id: 0 pmd usage: NOT AVAIL port: vhost1 queue-id: 0 pmd usage: NOT AVAIL
We could let this OVS bridge with a NORMAL
action, in which case it would behave like a standard switch learning Media Access Control (MAC) addresses on its ports. To simplify the setup, let's just write a few OpenFlow rules for a basic mapping:
- Receiving on the physical port
dpdk0
pushes packets tovhost0
. - Receiving on the virtual port
vhost0
pushes packets todpdk0
. - Receiving on the physical port
dpdk1
pushes packets tovhost1
. - Receiving on the virtual port
vhost1
pushes packets todpdk1
.
Here's that mapping:
# ovs-ofctl del-flows br0 # ovs-ofctl add-flow br0 in_port=1,actions=4 # ovs-ofctl add-flow br0 in_port=4,actions=1 # ovs-ofctl add-flow br0 in_port=2,actions=5 # ovs-ofctl add-flow br0 in_port=5,actions=2
That completes the test environment.
Catching vHost TX contention
Now let's take a look at the new coverage counter for OVS:
# ovs-appctl coverage/show |grep vhost vhost_tx_contention 39082.8/sec 11553.017/sec 192.5503/sec total: 758359
Adding a perf probe
As it is, the counter leaves the question of which cores are impacted by contention. We can use perf
to catch more information without stopping OVS. Just add a probe in the branch where the contention occurs:
# perf probe -x $(which ovs-vswitchd) 'netdev_dpdk_vhost_tx_lock=__netdev_dpdk_vhost_send:22 netdev->name:string qid' Added new event: probe_ovs:netdev_dpdk_vhost_tx_lock (on __netdev_dpdk_vhost_send:22 in /usr/sbin/ovs-vswitchd with name=netdev->name:string qid)
Now you can use the counter in all of your perf
tools.
Using the coverage counter in perf
Here, we ask perf
to record a specific event:
# perf record -e probe_ovs:netdev_dpdk_vhost_tx_lock -aR sleep 1 [ perf record: Woken up 15 times to write data ] [ perf record: Captured and wrote 3.938 MB perf.data (44059 samples) ]
We can also make a report of this perf
session:
# perf report -F +pid --stdio # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 44K of event 'probe_ovs:netdev_dpdk_vhost_tx_lock' # Event count (approx.): 44059 # # Overhead Pid:Command Trace output # ........ ............... .................................. # 61.30% 33003:pmd60 (55ef4abe5494) name="vhost0" qid=0 38.70% 33006:pmd61 (55ef4abe5494) name="vhost0" qid=0 # # (Tip: For a higher level overview, try: perf report --sort comm,dso) #
The new coverage counter makes interpreting contention easier. We can see that the contention happened between pmd60
(on core 1, by looking at the OVS logs) and pmd61
(on core 15). Both pmd
threads are trying to send packets on the vhost0
port queue zero.
Conclusion
Using perf
to debug contention is interesting, and it worked in this case because we were trying to catch events on an error or slow path. But perf
involves context switches that have a visible effect on performance. We can't use it without accounting for the performance impact.
Even if it's fine for developers to put in a probe by reading the code sources, support or operations teams will prefer higher-level tools or traces. The DPDK community has started a workgroup to set traces with a minimal impact at key places in the DPDK infrastructure code. We are still far from something as rich as perf
, but this is likely to be a focus for part of the community for the next year.