Open vSwitch (OvS), an open source tool for creating virtual Layer 2 networks, relies in some use cases on connection tracking. The recent 3.0.0 release of OvS included this patch series to improve multithread scalability, which makes connection tracking more efficient when OvS is run on multiple CPUs. This article shows how to measure the performance of connection tracking with OvS.

What is connection tracking and why is it critical?

Connection tracking, or conntrack, maintains an internal table of logical network connections (also called flows). The table identifies all packets that make up each flow so that they can be handled consistently.

Conntrack is a requirement for network address translation (NAT)—in IP address masquerading, for example (described in detail in RFC 3022). Conntrack is also required for stateful firewalls, load balancers, intrusion detection and prevention systems, and deep packet inspection. More specifically, OvS conntrack rules are used to isolate different OpenStack virtual networks (aka security groups).

Connection tracking is usually implemented by storing known connection entries in a table, indexed by a bidirectional 5-tuple consisting of a protocol, source address, destination address, source port, and destination port. Each entry also has a state as seen from the connection tracking system. The state (new, established, closed, etc.) is updated every time a packet matching its 5-tuple is processed. If a received packet does not match any existing conntrack entry, a new one is created and inserted into the table.

Performance aspects

There are two aspects to consider when measuring conntrack performance.

  • How many new connections can be handled per second? This question depends on the following details:

    • What is the cost of looking up an existing connection entry for each received packet?
    • Can multiple threads insert and destroy conntrack entries concurrently?
    • What is the cost of creating a conntrack entry for a new connection?
    • How many packets are exchanged per connection?
  • How many concurrent connections can the system support? This question depends on the following details:

    • What is the size of the conntrack table?
    • What is the duration of each individual connection?
    • After a connection has been closed, how long does the conntrack entry linger in the table until it is expunged to make room for new connections? What if the connection is not closed but no longer exchanges traffic (because the client or server crashed or disconnected)?
    • What happens when the conntrack table is full?

These two aspects of performance are somewhat connected, because even a low rate of very long new connections causes the conntrack table to fill up eventually.

In order to properly size the connection tracking table, one needs to know the average number of new connections per second and their average duration. Testing also requires tuning the timeout values of the conntrack engine.

Benchmarking process

To take the measurements necessary to answer the questions in the previous section, you need a way to simulate clients and servers. Such a system must specify how many clients and servers to test, how many connections per second they are creating, how long the connections are, and how much data is exchanged in each connection.

A few commercial traffic generators have these capabilities, more or less refined. This article describes how to carry out the simulation with TRex—an open source traffic generator based on the Data Plane Development Kit (DPDK).

TRex has multiple modes of operation. This article uses the advanced stateful (ASTF) mode, which allows TRex to simulate TCP and UDP endpoints. I have tailored a script using the TRex Python API to perform benchmarks in a manner like RFC 2544, but focusing on how many new connections can be created per second.

Basically, this script connects to a running TRex server started in ASTF mode and creates TCP and UDP connection profiles. These profiles are state machines representing clients and servers with dynamic IP addresses and ports. You can define the number of data exchanges and their sizes, add some arbitrary wait time to simulate network latency, etc. TRex takes care of translating your specifications into real traffic.

Here is a stripped down example, in Python, of a TCP connection profile:

client = ASTFProgram(stream=True)
server = ASTFProgram(stream=True)
for _ in range(num_messages):
    client.send(message_size * b"x")
    server.recv(message_size)
    if server_wait > 0:
        server.delay(server_wait * 1000)  # trex wants microseconds
    server.send(message_size * b"y")
    client.recv(message_size)

tcp_profile = ASTFTemplate(
    client_template=ASTFTCPClientTemplate(
        program=client,
        port=8080,
        cps=99, # base value which is changed during the binary search
        cont=True,
    ),
    server_template=ASTFTCPServerTemplate(
        program=server, assoc=ASTFAssociationRule(port=8080)
    ),
)

Setup

The device under test (DUT) runs the ovs-vswitchd Open vSwitch daemon with the user-space DPDK datapath. The setup can be used to benchmark any connection-tracking device. This procedure is overly simple and does not represent an actual production workload. However, it allows you to stress the connection tracking code path without bothering about the external details.

Figure 1 illustrates the relationship between the DUT and the traffic generator, which the test creates. Traffic simulating the clients travels from port0 to port1 on traffic generator through the DUT. Server traffic travels from port1 to port0 on the traffic generator. Conntrack flows are programmed on br0 to only allow new connections to be established from port0 to port1 (from "clients" to "servers") and also allow the reply packets on established connections from port1 to port0 (from "servers" to "clients") to go through.

Network topology diagram
Figure 1: Network topology.

Base system

Both the OvS user-space datapath and TRex use DPDK. The settings shown in this section are common to both machines.

DPDK requires compatible network interfaces. The example in this article runs on the last two ports of an Intel X710 PCI network interface. The following commands show the hardware in use:

[root@* ~]# lscpu | grep -e "^Model name:" -e "^NUMA" -e MHz
NUMA node(s):        1
Model name:          Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
CPU MHz:             2700.087
NUMA node0 CPU(s):   0-23
[root@* ~]# grep ^MemTotal /proc/meminfo
MemTotal:       65373528 kB
[root@* ~]# lspci | grep X710 | tail -n2
18:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
18:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)

Note: To make things simpler, all commands in this article are executed as the root user.

The CPUs used by TRex and OvS need to be isolated in order to minimize disturbance from the other tasks running on Linux. Therefore, the following commands isolate CPUs from the NUMA node where the PCI NIC is connected. CPUs 0 and 12 are left to Linux:

dnf install -y tuned tuned-profiles-cpu-partitioning
cat > /etc/tuned/cpu-partitioning-variables.conf <<EOF
isolated_cores=1-11,13-23
no_balance_cores=1-11,13-23
EOF
tuned-adm profile cpu-partitioning

Finally, DPDK applications require huge pages. It is best to allocate them at boot time to ensure that they are all mapped to contiguous chunks of memory:

cat >> /etc/default/grub <<EOF
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX hugepagesz=1G hugepages=32"
EOF
grub2-mkconfig -o /etc/grub2.cfg
dnf install -y driverctl
driverctl set-override 0000:18:00.2 vfio-pci
driverctl set-override 0000:18:00.3 vfio-pci
# reboot is required to apply isolcpus and allocate hugepages on boot
systemctl reboot

TRex and the traffic generator

TRex needs to be compiled from source. The following commands download and build the program:

dnf install -y python3 git numactl-devel zlib-devel gcc-c++ gcc
git clone https://github.com/cisco-system-traffic-generator/trex-core ~/trex
cd ~/trex/linux_dpdk
./b configure
taskset 0xffffffffff ./b build

We use the following configuration in /etc/trex_cfg.yaml:

- version: 2
  interfaces:
    - "18:00.2"
    - "18:00.3"
  rx_desc: 4096
  tx_desc: 4096
  port_info:
    - dest_mac: "04:3f:72:f2:8f:33"
      src_mac:  "04:3f:72:f2:8f:32"
    - dest_mac: "04:3f:72:f2:8f:32"
      src_mac:  "04:3f:72:f2:8f:33"

  c: 22
  memory:
    mbuf_64: 30000
    mbuf_128: 500000
    mbuf_256: 30717
    mbuf_512: 30720
    mbuf_1024: 30720
    mbuf_2048: 4096

  platform:
    master_thread_id: 0
    latency_thread_id: 12
    dual_if:
      - socket: 0
        threads: [
           1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
          13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
        ]

Finally, we can start TRex:

cd ~/trex/scripts
./t-rex-64 -i --astf

The TRex daemon runs in the foreground. The cps_ndr.py script connects to the daemon via the JSON-RPC API in a separate terminal.

The device under test

First, let's compile and install DPDK:

dnf install -y git meson ninja-build gcc python3-pyelftools
git clone -b v21.11 https://github.com/DPDK/dpdk ~/dpdk
cd ~/dpdk
meson build
taskset 0xffffff ninja -C ~/dpdk/build install

Then compile and install OVS. In the following console excerpt, I explicitly check out version 2.17.2. Version 3.0.0 will be recompiled before running all tests again:

dnf install -y gcc-g++ make libtool autoconf automake
git clone -b v2.17.2 https://github.com/openvswitch/ovs ~/ovs
cd ~/ovs
./boot.sh
PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig" ./configure --with-dpdk=static
taskset 0xffffff make install -j24
/usr/local/share/openvswitch/scripts/ovs-ctl start

Here I enable the DPDK user-space datapath and configure a bridge with two ports. For now, there is only one receive (RX) queue per port, and one CPU is assigned to poll them. I will increase these parameters along the way.

I set the conntrack table size to a relatively large value (5 million entries) to reduce the risk of it getting full during tests. Also, I configure the various timeout policies to match the traffic profiles I am about to send. These aggressive timeouts help prevent the table from getting full. The default timeout values are very conservative—they're too long to achieve high numbers of connections per second without filling the conntrack table:

ovs-vsctl set open_vswitch . other_config:dpdk-init=true
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
/usr/local/share/openvswitch/scripts/ovs-ctl restart
ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 port0 -- \
    set interface port0 type=dpdk options:dpdk-devargs=0000:18:00.2
ovs-vsctl add-port br0 port1 -- \
    set Interface port1 type=dpdk options:dpdk-devargs=0000:18:00.3

ovs-appctl dpctl/ct-set-maxconns 5000000
# creating an empty datapath record is required to add a zone timeout policy
ovs-vsctl -- --id=@m create Datapath datapath_version=0 -- \
    set Open_vSwitch . datapaths:"netdev"=@m
ovs-vsctl add-zone-tp netdev zone=0 \
    udp_first=1 udp_single=1 udp_multiple=30 tcp_syn_sent=1 \
    tcp_syn_recv=1 tcp_fin_wait=1 tcp_time_wait=1 tcp_close=1 \
    tcp_established=30

cat > ~/ct-flows.txt << EOF
priority=1 ip ct_state=-trk                   actions=ct(table=0)
priority=1 ip ct_state=+trk+new in_port=port0 actions=ct(commit),normal
priority=1 ip ct_state=+trk+est               actions=normal
priority=0 actions=drop
EOF

Test procedure

The cps_ndr.py script that I have written has multiple parameters to control the nature of the generated connections:

  • Ratio of TCP connections to UDP connections
  • Number of data messages (request + response) exchanged per connection (excluding protocol overhead)
  • Size of data messages in bytes (to emulate the TCP maximum segment size)
  • Time in milliseconds that the simulated servers wait before sending a response to a request

In the context of this benchmark, I intentionally keep the size of data messages fixed to 20 bytes, to avoid being limited by the 10Gbit bandwidth.

I run two types of test: One with short-lived connections and the other with long-lived connections. Both the short-lived and long-lived connection profiles are tested against OVS versions 2.17.2 and 3.0.0. Different configurations are tested to check whether performance scales with the number of CPUs and receive queues.

Short-lived connections

The parameters of this test consist of sending 40 data bytes per connection (1 request + 1 reply of 20 bytes each), with no wait by the server before sending the replies. These parameters stress the conntrack creation and destruction code path.

An example run follows:

[root@tgen scripts]# ./cps_ndr.py --sample-time 30 --max-iterations 8 \
>    --error-threshold 0.02 --udp-percent 1 --num-messages 1 \
>    --message-size 20 --server-wait 0 -m 1k -M 100k
... iteration #1: lower=1.0K current=50.5K upper=100K
▼▼▼ Flows: active 26.8K (50.1K/s) TX: 215Mb/s (345Kp/s) RX: 215Mb/s (345Kp/s) Size: ~4.5B
err dropped: 1.6K pkts (1.6K/s) ~ 0.4746%
... iteration #2: lower=1.0K current=25.8K upper=50.5K
▲▲▲ Flows: active 12.9K (25.7K/s) TX: 112Mb/s (179Kp/s) RX: 112Mb/s (179Kp/s) Size: ~4.5B
... iteration #3: lower=25.8K current=38.1K upper=50.5K
▲▲▲ Flows: active 19.1K (38.1K/s) TX: 166Mb/s (266Kp/s) RX: 166Mb/s (266Kp/s) Size: ~4.5B
... iteration #4: lower=38.1K current=44.3K upper=50.5K
▼▼▼ Flows: active 22.2K (44.2K/s) TX: 192Mb/s (307Kp/s) RX: 191Mb/s (307Kp/s) Size: ~4.5B
err dropped: 1.3K pkts (125/s) ~ 0.0408%
... iteration #5: lower=38.1K current=41.2K upper=44.3K
▲▲▲ Flows: active 20.7K (41.2K/s) TX: 178Mb/s (286Kp/s) RX: 178Mb/s (286Kp/s) Size: ~4.5B
... iteration #6: lower=41.2K current=42.8K upper=44.3K
▼▼▼ Flows: active 21.5K (42.6K/s) TX: 185Mb/s (296Kp/s) RX: 185Mb/s (296Kp/s) Size: ~4.5B
err dropped: 994 pkts (99/s) ~ 0.0335%
... iteration #7: lower=41.2K current=42.0K upper=42.8K
▼▼▼ Flows: active 21.0K (41.8K/s) TX: 181Mb/s (290Kp/s) RX: 181Mb/s (290Kp/s) Size: ~4.5B
err dropped: 877 pkts (87/s) ~ 0.0301%
... iteration #8: lower=41.2K current=41.6K upper=42.0K
▲▲▲ Flows: active 20.9K (41.4K/s) TX: 180Mb/s (289Kp/s) RX: 180Mb/s (289Kp/s) Size: ~4.5B

Long-lived connections

The parameters of this test consist of sending 20K data bytes per connection (500 requests + 500 replies of 20 bytes each) over 25 seconds. These parameters stress the conntrack lookup code path.

An example run follows:

[root@tgen scripts]# ./cps_ndr.py --sample-time 120 --max-iterations 8 \
>    --error-threshold 0.02 --udp-percent 1 --num-messages 500 \
>    --message-size 20 --server-wait 50 -m 500 -M 2k
... iteration #1: lower=500 current=1.2K upper=2.0K
▼▼▼ Flows: active 48.5K (1.2K/s) TX: 991Mb/s (1.5Mp/s) RX: 940Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 1.8M pkts (30.6K/s) ~ 2.4615%
... iteration #2: lower=500 current=875 upper=1.2K
▲▲▲ Flows: active 22.5K (871/s) TX: 871Mb/s (1.3Mp/s) RX: 871Mb/s (1.3Mp/s) Size: ~13.3B
... iteration #3: lower=875 current=1.1K upper=1.2K
▼▼▼ Flows: active 33.8K (1.1K/s) TX: 967Mb/s (1.4Mp/s) RX: 950Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 621K pkts (10.3K/s) ~ 0.7174%
... iteration #4: lower=875 current=968 upper=1.1K
▲▲▲ Flows: active 24.9K (965/s) TX: 961Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
... iteration #5: lower=968 current=1.0K upper=1.1K
▼▼▼ Flows: active 29.8K (1.0K/s) TX: 965Mb/s (1.4Mp/s) RX: 957Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 334K pkts (5.6K/s) ~ 0.3830%
... iteration #6: lower=968 current=992 upper=1.0K
▼▼▼ Flows: active 25.5K (989/s) TX: 964Mb/s (1.4Mp/s) RX: 964Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 460 pkts (460/s) ~ 0.0314%
... iteration #7: lower=968 current=980 upper=992
▼▼▼ Flows: active 25.3K (977/s) TX: 962Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 397 pkts (397/s) ~ 0.0272%
... iteration #8: lower=968 current=974 upper=980
▲▲▲ Flows: active 25.1K (971/s) TX: 969Mb/s (1.5Mp/s) RX: 969Mb/s (1.5Mp/s) Size: ~13.3B

Performance statistics

This section presents results of runs with varying numbers of CPUs and queues on my test system. The numbers that I measured should be taken with a grain of salt. Connection tracking performance is highly dependent on hardware, traffic profile, and overall system load. I provide the statistics here just to give a general idea of the improvement brought by OVS 3.0.0.

Baseline results for comparison

For reference, the tests were executed with a cable connecting port0 and port1 of the traffic generator machine. This is the maximum performance TRex is able to achieve with this configuration and hardware.

Table 1: Maximum traffic generator performance.
Type Connection rate Active flows Bandwidth Packet rate
Short-lived 1.8M conn/s 1.7M 8.4G bit/s 12.7M pkt/s
Long-lived 11.1K conn/s 898K 8.0G bit/s 11.4M pkt/s

1 CPU, 1 queue per port, without connection tracking

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flow br0 action=normal
Table 2: Short-lived connections with 1 CPU, 1 queue per port, without connection tracking.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 1.0M conn/s 524.8K 4.5G bit/s 7.3M pkt/s  
3.0.0 1.0M conn/s 513.1K 4.5G bit/s 7.1M pkt/s -1.74%
Table 3: Long-lived connections with 1 CPU, 1 queue per port, without connection tracking.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 3.1K conn/s 79.9K 3.1G bit/s 4.7M pkt/s  
3.0.0 2.8K conn/s 71.9K 2.8G bit/s 4.2M pkt/s -9.82%

There is a drop in performance, without connection tracking enabled, between versions 2.17.2 and 3.0.0. This drop is completely unrelated to the conntrack optimization patch series I am focusing on. It might be caused by some discrepancies in the test procedure, but it might also have been introduced by another patch series between the two tested versions.

1 CPU, 1 queue per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 4: Short-lived connections with 1 CPU, 1 queue per port, with connection tracking.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 39.7K conn/s 20.0K 172.0M bit/s 275.8K pkt/s  
3.0.0 48.2K conn/s 24.3K 208.9M bit/s 334.9K pkt/s +21.36%
Table 5: Long-lived connections with 1 CPU, 1 queue per port, with connection tracking.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 959 conn/s 24.7K 956.6M bit/s 1.4M pkt/s  
3.0.0 1.2K conn/s 31.5K 1.2G bit/s 1.8M pkt/s +28.15%

Already here, we can see that the patch series improves the single-threaded performance of connection tracking during the creation, destruction, and lookup code paths. Keep these results in mind when looking at improvements in multithreaded performance.

2 CPUs, 1 queue per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 6: Short-lived connections with 2 CPUs, 1 queue per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 39.9K conn/s 20.0K 172.8M bit/s 277.0K pkt/s  
3.0.0 46.8K conn/s 23.5K 202.7M bit/s 325.0K pkt/s +17.28%
Table 7: Long-lived connections with 2 CPUs, 1 queue per port.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 885 conn/s 22.7K 883.1M bit/s 1.3M pkt/s  
3.0.0 1.1K conn/s 28.6K 1.1G bit/s 1.7M pkt/s +25.19%

It is worth noting that assigning twice as many CPUs to do packet processing does not double the performance. Far from it, in fact. The numbers are exactly the same (if not lower) than with only one CPU.

This surprising result might be caused because there is only one RX queue per port and each CPU processes a single port.

2 CPUs, 2 queues per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 8: Short-lived connections with 2 CPUs, 2 queues per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 48.3K conn/s 24.3K 208.8M bit/s 334.8K pkt/s  
3.0.0 65.9K conn/s 33.2K 286.8M bit/s 459.9K pkt/s +36.41%

For short-lived connections, we begin to see improvement beyond the single-threaded performance gain. Lock contention was reduced during the insertion and deletion of conntrack entries.

Table 9: Long-lived connections with 2 CPUs, 2 queues per port.

Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 1.1K conn/s 29.1K 1.1G bit/s 1.7M pkt/s  
3.0.0 1.4K conn/s 37.0K 1.4G bit/s 2.2M pkt/s +26.77%

With two CPUs and two queues, if we take the single-threaded performance out of the picture, there seems to be no improvement in conntrack lookup for long-lived connections.

4 CPUs, 2 queues per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 10: Short-lived connections with 4 CPUs, 2 queues per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 47.4K conn/s 23.9K 206.2M bit/s 330.6K pkt/s  
3.0.0 49.1K conn/s 24.7K 212.1M bit/s 340.1K pkt/s

+3.53%

The short-lived connection rate performance has dropped in 3.0.0. This is not a fluke: The numbers are consistent across multiple runs. This drop warrants some scrutiny, but does not invalidate all the work that has been done.

Table 11: Long-lived connections with 4 CPUs, 2 queues per port.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 981 conn/s 25.2K 977.7M bit/s 1.5M pkt/s  
3.0.0 2.0K conn/s 52.4K 2.0G bit/s 3.1M pkt/s +108.31%

With four CPUs and two queues per port, long-lived connections tracking is starting to scale up.

4 CPUs, 4 queues per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 12: Short-lived connections with 4 CPUs, 4 queues per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 66.1K conn/s 33.2K 286.4M bit/s 459.2K pkt/s  
3.0.0 100.8K conn/s 50.6K 437.0M bit/s 700.6K pkt/s +52.55%
Table 13: Long-lived connections with 4 CPUs, 4 queues per port.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 996 conn/s 25.9K 994.2M bit/s 1.5M pkt/s  
3.0.0 2.6K conn/s 67.0K 2.6G bit/s 3.9M pkt/s +162.89%

8 CPUs, 4 queues per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 14: Short-lived connections with 8 CPUs, 4 queues per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 62.2K conn/s 31.3K 269.8M bit/s 432.5K pkt/s  
3.0.0 90.1K conn/s 45.2K 390.9M bit/s 626.7K pkt/s +44.89%
Table 15: Long-lived connections with 8 CPUs, 4 queues per port.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 576 conn/s 17.1K 567.2M bit/s 852.5K pkt/s  
3.0.0 3.8K conn/s 97.8K 3.8G bit/s 5.7M pkt/s +562.76%

8 CPUs, 8 queues per port

The results in this section were achieved with the following DUT configuration:

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=8
ovs-vsctl set Interface port1 options:n_rxq=8
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Table 16: Short-lived connections with 8 CPUs, 8 queues per port.
Version Short-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 50.6K conn/s 25.5K 219.5M bit/s 351.9K pkt/s  
3.0.0 100.9K conn/s 50.7K 436.0M bit/s 698.9K pkt/s +99.36%
Table 17: Long-lived connections with 8 CPUs, 8 queues per port.
Version Long-lived connections Active flows Bandwidth Packet rate Difference
2.17.2 541 conn/s 14.0K 539.2M bit/s 810.3K pkt/s  
3.0.0 4.8K conn/s 124.1K 4.8G bit/s 7.2M pkt/s +792.83%

Performance improvements in version 3.0.0 of Open vSwitch

Using the tools in this article, I have been able to record advances made in version 3.0.0 in scaling and in handling long-lived connections.

Scaling

Figure 2 shows how many insertions and deletions per second were achieved on different system configurations for short-lived connections.

Chart showing improvements in scaling of short-lived connections tracking in version 3.0.0
Figure 2: Improvements in scaling of short-lived connections tracking in version 3.0.0.

Apart from the small blip with 4 CPUs and 2 queues per port, the conntrack insertion and deletion code path has improved consistently in OvS 3.0.0. The multithreaded lock contention remains, but is less noticeable than with OvS 2.17.2.

Figure 3 shows how many insertions and deletions per second were achieved on different system configurations for long-lived connections.

Chart showing improvements in scaling of long-lived connections tracking in version 3.0.0.
Figure 3: Improvements in scaling of long-lived connections tracking in version 3.0.0.

Long-lived connections tracking is where the optimizations done in OvS 3.0.0 really shine. The reduction in multithreaded lock contention with conntrack lookup makes the performance scale significantly better with the number of CPUs.

Performance during high traffic

The following commands generate profiling reports using the Linux kernel's perf command. I measured the performance of both version 2.17.2 and version 3.0.0 for 8 CPUs and 8 RX queues under a maximum load for long-lived connections, with conntrack flows enabled. Only the events of a single CPU were captured:

perf record -g -C 1 sleep 60
perf report -U --no-children | grep '\[[\.k]\]' | head -15 > profile-$version.txt

In the subsections that follow, I have manually annotated lines that are directly related to acquiring mutexes so that they start with a * character. When a CPU is waiting for a mutex acquisition, it is not processing any network traffic, but waiting for another CPU to release the lock.

Performance in version 2.17.2

The profiled CPU spends almost 40% of its cycles acquiring locks and waiting for other CPUs to release locks:

* 30.99%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
  12.27%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_process_rxq_port
   5.18%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
   4.24%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_thread_main
   3.93%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_perf_end_iteration
*  3.63%  pmd-c01/id:5  libc.so.6          [.] __GI___pthread_mutex_unlock_usercnt
   3.62%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_recv_pkts_vec_avx2
*  2.76%  pmd-c01/id:5  [kernel.kallsyms]  [k] syscall_exit_to_user_mode
*  0.91%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wait
*  0.18%  pmd-c01/id:5  [kernel.kallsyms]  [k] __x64_sys_futex
*  0.17%  pmd-c01/id:5  [kernel.kallsyms]  [k] futex_wait
*  0.12%  pmd-c01/id:5  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
*  0.11%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wake
*  0.08%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_syscall_64
*  0.06%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_futex

Performance in version 3.0.0

It is obvious that 3.0.0 has much less lock contention and therefore scales better with the number of CPUs:

  15.30%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input__
   8.62%  pmd-c01/id:5  ovs-vswitchd       [.] conn_key_lookup
   7.88%  pmd-c01/id:5  ovs-vswitchd       [.] miniflow_extract
   7.75%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find
*  6.92%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
   5.15%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w4_u1w1
   4.16%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find_batch
   4.10%  pmd-c01/id:5  ovs-vswitchd       [.] tcp_conn_update
   3.86%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w5_u1w1
   3.51%  pmd-c01/id:5  ovs-vswitchd       [.] conntrack_execute
   3.42%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_fixed_burst_vec_avx2
   0.77%  pmd-c01/id:5  ovs-vswitchd       [.] dp_execute_cb
   0.72%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
   0.07%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_pkts_vec_avx2
   0.04%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input

Final words

I hope this gave you some ideas for benchmarking and profiling connection tracking with TRex and perf. Please leave any questions you have in comments on this article.

Kudos to Paolo Valerio and Gaëtan Rivet for their work on optimizing the user space OvS conntrack implementation.

Comments