Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Benchmarking improved conntrack performance in OvS 3.0.0

November 17, 2022
Robin Jarry
Related topics:
Developer Tools
Related products:
Developer Tools

Share:

    Open vSwitch (OvS), an open source tool for creating virtual Layer 2 networks, relies in some use cases on connection tracking. The recent 3.0.0 release of OvS included this patch series to improve multithread scalability, which makes connection tracking more efficient when OvS is run on multiple CPUs. This article shows how to measure the performance of connection tracking with OvS.

    What is connection tracking and why is it critical?

    Connection tracking, or conntrack, maintains an internal table of logical network connections (also called flows). The table identifies all packets that make up each flow so that they can be handled consistently.

    Conntrack is a requirement for network address translation (NAT)—in IP address masquerading, for example (described in detail in RFC 3022). Conntrack is also required for stateful firewalls, load balancers, intrusion detection and prevention systems, and deep packet inspection. More specifically, OvS conntrack rules are used to isolate different OpenStack virtual networks (aka security groups).

    Connection tracking is usually implemented by storing known connection entries in a table, indexed by a bidirectional 5-tuple consisting of a protocol, source address, destination address, source port, and destination port. Each entry also has a state as seen from the connection tracking system. The state (new, established, closed, etc.) is updated every time a packet matching its 5-tuple is processed. If a received packet does not match any existing conntrack entry, a new one is created and inserted into the table.

    Performance aspects

    There are two aspects to consider when measuring conntrack performance.

    • How many new connections can be handled per second? This question depends on the following details:
      • What is the cost of looking up an existing connection entry for each received packet?
      • Can multiple threads insert and destroy conntrack entries concurrently?
      • What is the cost of creating a conntrack entry for a new connection?
      • How many packets are exchanged per connection?
    • How many concurrent connections can the system support? This question depends on the following details:
      • What is the size of the conntrack table?
      • What is the duration of each individual connection?
      • After a connection has been closed, how long does the conntrack entry linger in the table until it is expunged to make room for new connections? What if the connection is not closed but no longer exchanges traffic (because the client or server crashed or disconnected)?
      • What happens when the conntrack table is full?

    These two aspects of performance are somewhat connected, because even a low rate of very long new connections causes the conntrack table to fill up eventually.

    In order to properly size the connection tracking table, one needs to know the average number of new connections per second and their average duration. Testing also requires tuning the timeout values of the conntrack engine.

    Benchmarking process

    To take the measurements necessary to answer the questions in the previous section, you need a way to simulate clients and servers. Such a system must specify how many clients and servers to test, how many connections per second they are creating, how long the connections are, and how much data is exchanged in each connection.

    A few commercial traffic generators have these capabilities, more or less refined. This article describes how to carry out the simulation with TRex—an open source traffic generator based on the Data Plane Development Kit (DPDK).

    TRex has multiple modes of operation. This article uses the advanced stateful (ASTF) mode, which allows TRex to simulate TCP and UDP endpoints. I have tailored a script using the TRex Python API to perform benchmarks in a manner like RFC 2544, but focusing on how many new connections can be created per second.

    Basically, this script connects to a running TRex server started in ASTF mode and creates TCP and UDP connection profiles. These profiles are state machines representing clients and servers with dynamic IP addresses and ports. You can define the number of data exchanges and their sizes, add some arbitrary wait time to simulate network latency, etc. TRex takes care of translating your specifications into real traffic.

    Here is a stripped down example, in Python, of a TCP connection profile:

    client = ASTFProgram(stream=True)
    server = ASTFProgram(stream=True)
    for _ in range(num_messages):
        client.send(message_size * b"x")
        server.recv(message_size)
        if server_wait > 0:
            server.delay(server_wait * 1000)  # trex wants microseconds
        server.send(message_size * b"y")
        client.recv(message_size)
    
    tcp_profile = ASTFTemplate(
        client_template=ASTFTCPClientTemplate(
            program=client,
            port=8080,
            cps=99, # base value which is changed during the binary search
            cont=True,
        ),
        server_template=ASTFTCPServerTemplate(
            program=server, assoc=ASTFAssociationRule(port=8080)
        ),
    )

    Setup

    The device under test (DUT) runs the ovs-vswitchd Open vSwitch daemon with the user-space DPDK datapath. The setup can be used to benchmark any connection-tracking device. This procedure is overly simple and does not represent an actual production workload. However, it allows you to stress the connection tracking code path without bothering about the external details.

    Figure 1 illustrates the relationship between the DUT and the traffic generator, which the test creates. Traffic simulating the clients travels from port0 to port1 on traffic generator through the DUT. Server traffic travels from port1 to port0 on the traffic generator. Conntrack flows are programmed on br0 to only allow new connections to be established from port0 to port1 (from "clients" to "servers") and also allow the reply packets on established connections from port1 to port0 (from "servers" to "clients") to go through.

    Network topology diagram
    Figure 1: Network topology.

    Base system

    Both the OvS user-space datapath and TRex use DPDK. The settings shown in this section are common to both machines.

    DPDK requires compatible network interfaces. The example in this article runs on the last two ports of an Intel X710 PCI network interface. The following commands show the hardware in use:

    [root@* ~]# lscpu | grep -e "^Model name:" -e "^NUMA" -e MHz
    NUMA node(s):        1
    Model name:          Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
    CPU MHz:             2700.087
    NUMA node0 CPU(s):   0-23
    [root@* ~]# grep ^MemTotal /proc/meminfo
    MemTotal:       65373528 kB
    [root@* ~]# lspci | grep X710 | tail -n2
    18:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
    18:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
    

    Note: To make things simpler, all commands in this article are executed as the root user.

    The CPUs used by TRex and OvS need to be isolated in order to minimize disturbance from the other tasks running on Linux. Therefore, the following commands isolate CPUs from the NUMA node where the PCI NIC is connected. CPUs 0 and 12 are left to Linux:

    dnf install -y tuned tuned-profiles-cpu-partitioning
    cat > /etc/tuned/cpu-partitioning-variables.conf <<EOF
    isolated_cores=1-11,13-23
    no_balance_cores=1-11,13-23
    EOF
    tuned-adm profile cpu-partitioning
    

    Finally, DPDK applications require huge pages. It is best to allocate them at boot time to ensure that they are all mapped to contiguous chunks of memory:

    cat >> /etc/default/grub <<EOF
    GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX intel_iommu=on iommu=pt"
    GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX hugepagesz=1G hugepages=32"
    EOF
    grub2-mkconfig -o /etc/grub2.cfg
    dnf install -y driverctl
    driverctl set-override 0000:18:00.2 vfio-pci
    driverctl set-override 0000:18:00.3 vfio-pci
    # reboot is required to apply isolcpus and allocate hugepages on boot
    systemctl reboot
    

    TRex and the traffic generator

    TRex needs to be compiled from source. The following commands download and build the program:

    dnf install -y python3 git numactl-devel zlib-devel gcc-c++ gcc
    git clone https://github.com/cisco-system-traffic-generator/trex-core ~/trex
    cd ~/trex/linux_dpdk
    ./b configure
    taskset 0xffffffffff ./b build
    

    We use the following configuration in /etc/trex_cfg.yaml:

    - version: 2
      interfaces:
        - "18:00.2"
        - "18:00.3"
      rx_desc: 4096
      tx_desc: 4096
      port_info:
        - dest_mac: "04:3f:72:f2:8f:33"
          src_mac:  "04:3f:72:f2:8f:32"
        - dest_mac: "04:3f:72:f2:8f:32"
          src_mac:  "04:3f:72:f2:8f:33"
    
      c: 22
      memory:
        mbuf_64: 30000
        mbuf_128: 500000
        mbuf_256: 30717
        mbuf_512: 30720
        mbuf_1024: 30720
        mbuf_2048: 4096
    
      platform:
        master_thread_id: 0
        latency_thread_id: 12
        dual_if:
          - socket: 0
            threads: [
               1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
              13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
            ]
    

    Finally, we can start TRex:

    cd ~/trex/scripts
    ./t-rex-64 -i --astf
    

    The TRex daemon runs in the foreground. The cps_ndr.py script connects to the daemon via the JSON-RPC API in a separate terminal.

    The device under test

    First, let's compile and install DPDK:

    dnf install -y git meson ninja-build gcc python3-pyelftools
    git clone -b v21.11 https://github.com/DPDK/dpdk ~/dpdk
    cd ~/dpdk
    meson build
    taskset 0xffffff ninja -C ~/dpdk/build install
    

    Then compile and install OVS. In the following console excerpt, I explicitly check out version 2.17.2. Version 3.0.0 will be recompiled before running all tests again:

    dnf install -y gcc-g++ make libtool autoconf automake
    git clone -b v2.17.2 https://github.com/openvswitch/ovs ~/ovs
    cd ~/ovs
    ./boot.sh
    PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig" ./configure --with-dpdk=static
    taskset 0xffffff make install -j24
    /usr/local/share/openvswitch/scripts/ovs-ctl start
    

    Here I enable the DPDK user-space datapath and configure a bridge with two ports. For now, there is only one receive (RX) queue per port, and one CPU is assigned to poll them. I will increase these parameters along the way.

    I set the conntrack table size to a relatively large value (5 million entries) to reduce the risk of it getting full during tests. Also, I configure the various timeout policies to match the traffic profiles I am about to send. These aggressive timeouts help prevent the table from getting full. The default timeout values are very conservative—they're too long to achieve high numbers of connections per second without filling the conntrack table:

    ovs-vsctl set open_vswitch . other_config:dpdk-init=true
    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
    /usr/local/share/openvswitch/scripts/ovs-ctl restart
    ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
    ovs-vsctl add-port br0 port0 -- \
        set interface port0 type=dpdk options:dpdk-devargs=0000:18:00.2
    ovs-vsctl add-port br0 port1 -- \
        set Interface port1 type=dpdk options:dpdk-devargs=0000:18:00.3
    
    ovs-appctl dpctl/ct-set-maxconns 5000000
    # creating an empty datapath record is required to add a zone timeout policy
    ovs-vsctl -- --id=@m create Datapath datapath_version=0 -- \
        set Open_vSwitch . datapaths:"netdev"=@m
    ovs-vsctl add-zone-tp netdev zone=0 \
        udp_first=1 udp_single=1 udp_multiple=30 tcp_syn_sent=1 \
        tcp_syn_recv=1 tcp_fin_wait=1 tcp_time_wait=1 tcp_close=1 \
        tcp_established=30
    
    cat > ~/ct-flows.txt << EOF
    priority=1 ip ct_state=-trk                   actions=ct(table=0)
    priority=1 ip ct_state=+trk+new in_port=port0 actions=ct(commit),normal
    priority=1 ip ct_state=+trk+est               actions=normal
    priority=0 actions=drop
    EOF
    

    Test procedure

    The cps_ndr.py script that I have written has multiple parameters to control the nature of the generated connections:

    • Ratio of TCP connections to UDP connections
    • Number of data messages (request + response) exchanged per connection (excluding protocol overhead)
    • Size of data messages in bytes (to emulate the TCP maximum segment size)
    • Time in milliseconds that the simulated servers wait before sending a response to a request

    In the context of this benchmark, I intentionally keep the size of data messages fixed to 20 bytes, to avoid being limited by the 10Gbit bandwidth.

    I run two types of test: One with short-lived connections and the other with long-lived connections. Both the short-lived and long-lived connection profiles are tested against OVS versions 2.17.2 and 3.0.0. Different configurations are tested to check whether performance scales with the number of CPUs and receive queues.

    Short-lived connections

    The parameters of this test consist of sending 40 data bytes per connection (1 request + 1 reply of 20 bytes each), with no wait by the server before sending the replies. These parameters stress the conntrack creation and destruction code path.

    An example run follows:

    [root@tgen scripts]# ./cps_ndr.py --sample-time 30 --max-iterations 8 \
    >    --error-threshold 0.02 --udp-percent 1 --num-messages 1 \
    >    --message-size 20 --server-wait 0 -m 1k -M 100k
    ... iteration #1: lower=1.0K current=50.5K upper=100K
    ▼▼▼ Flows: active 26.8K (50.1K/s) TX: 215Mb/s (345Kp/s) RX: 215Mb/s (345Kp/s) Size: ~4.5B
    err dropped: 1.6K pkts (1.6K/s) ~ 0.4746%
    ... iteration #2: lower=1.0K current=25.8K upper=50.5K
    ▲▲▲ Flows: active 12.9K (25.7K/s) TX: 112Mb/s (179Kp/s) RX: 112Mb/s (179Kp/s) Size: ~4.5B
    ... iteration #3: lower=25.8K current=38.1K upper=50.5K
    ▲▲▲ Flows: active 19.1K (38.1K/s) TX: 166Mb/s (266Kp/s) RX: 166Mb/s (266Kp/s) Size: ~4.5B
    ... iteration #4: lower=38.1K current=44.3K upper=50.5K
    ▼▼▼ Flows: active 22.2K (44.2K/s) TX: 192Mb/s (307Kp/s) RX: 191Mb/s (307Kp/s) Size: ~4.5B
    err dropped: 1.3K pkts (125/s) ~ 0.0408%
    ... iteration #5: lower=38.1K current=41.2K upper=44.3K
    ▲▲▲ Flows: active 20.7K (41.2K/s) TX: 178Mb/s (286Kp/s) RX: 178Mb/s (286Kp/s) Size: ~4.5B
    ... iteration #6: lower=41.2K current=42.8K upper=44.3K
    ▼▼▼ Flows: active 21.5K (42.6K/s) TX: 185Mb/s (296Kp/s) RX: 185Mb/s (296Kp/s) Size: ~4.5B
    err dropped: 994 pkts (99/s) ~ 0.0335%
    ... iteration #7: lower=41.2K current=42.0K upper=42.8K
    ▼▼▼ Flows: active 21.0K (41.8K/s) TX: 181Mb/s (290Kp/s) RX: 181Mb/s (290Kp/s) Size: ~4.5B
    err dropped: 877 pkts (87/s) ~ 0.0301%
    ... iteration #8: lower=41.2K current=41.6K upper=42.0K
    ▲▲▲ Flows: active 20.9K (41.4K/s) TX: 180Mb/s (289Kp/s) RX: 180Mb/s (289Kp/s) Size: ~4.5B
    

    Long-lived connections

    The parameters of this test consist of sending 20K data bytes per connection (500 requests + 500 replies of 20 bytes each) over 25 seconds. These parameters stress the conntrack lookup code path.

    An example run follows:

    [root@tgen scripts]# ./cps_ndr.py --sample-time 120 --max-iterations 8 \
    >    --error-threshold 0.02 --udp-percent 1 --num-messages 500 \
    >    --message-size 20 --server-wait 50 -m 500 -M 2k
    ... iteration #1: lower=500 current=1.2K upper=2.0K
    ▼▼▼ Flows: active 48.5K (1.2K/s) TX: 991Mb/s (1.5Mp/s) RX: 940Mb/s (1.4Mp/s) Size: ~13.3B
    err dropped: 1.8M pkts (30.6K/s) ~ 2.4615%
    ... iteration #2: lower=500 current=875 upper=1.2K
    ▲▲▲ Flows: active 22.5K (871/s) TX: 871Mb/s (1.3Mp/s) RX: 871Mb/s (1.3Mp/s) Size: ~13.3B
    ... iteration #3: lower=875 current=1.1K upper=1.2K
    ▼▼▼ Flows: active 33.8K (1.1K/s) TX: 967Mb/s (1.4Mp/s) RX: 950Mb/s (1.4Mp/s) Size: ~13.3B
    err dropped: 621K pkts (10.3K/s) ~ 0.7174%
    ... iteration #4: lower=875 current=968 upper=1.1K
    ▲▲▲ Flows: active 24.9K (965/s) TX: 961Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
    ... iteration #5: lower=968 current=1.0K upper=1.1K
    ▼▼▼ Flows: active 29.8K (1.0K/s) TX: 965Mb/s (1.4Mp/s) RX: 957Mb/s (1.4Mp/s) Size: ~13.3B
    err dropped: 334K pkts (5.6K/s) ~ 0.3830%
    ... iteration #6: lower=968 current=992 upper=1.0K
    ▼▼▼ Flows: active 25.5K (989/s) TX: 964Mb/s (1.4Mp/s) RX: 964Mb/s (1.4Mp/s) Size: ~13.3B
    err dropped: 460 pkts (460/s) ~ 0.0314%
    ... iteration #7: lower=968 current=980 upper=992
    ▼▼▼ Flows: active 25.3K (977/s) TX: 962Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
    err dropped: 397 pkts (397/s) ~ 0.0272%
    ... iteration #8: lower=968 current=974 upper=980
    ▲▲▲ Flows: active 25.1K (971/s) TX: 969Mb/s (1.5Mp/s) RX: 969Mb/s (1.5Mp/s) Size: ~13.3B
    

    Performance statistics

    This section presents results of runs with varying numbers of CPUs and queues on my test system. The numbers that I measured should be taken with a grain of salt. Connection tracking performance is highly dependent on hardware, traffic profile, and overall system load. I provide the statistics here just to give a general idea of the improvement brought by OVS 3.0.0.

    Baseline results for comparison

    For reference, the tests were executed with a cable connecting port0 and port1 of the traffic generator machine. This is the maximum performance TRex is able to achieve with this configuration and hardware.

    Table 1: Maximum traffic generator performance.
    TypeConnection rateActive flowsBandwidthPacket rate
    Short-lived1.8M conn/s1.7M8.4G bit/s12.7M pkt/s
    Long-lived11.1K conn/s898K8.0G bit/s11.4M pkt/s

    1 CPU, 1 queue per port, without connection tracking

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
    ovs-vsctl set Interface port0 options:n_rxq=1
    ovs-vsctl set Interface port1 options:n_rxq=1
    ovs-ofctl del-flows br0
    ovs-ofctl add-flow br0 action=normal
    
    Table 2: Short-lived connections with 1 CPU, 1 queue per port, without connection tracking.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.21.0M conn/s524.8K4.5G bit/s7.3M pkt/s 
    3.0.01.0M conn/s513.1K4.5G bit/s7.1M pkt/s-1.74%
    Table 3: Long-lived connections with 1 CPU, 1 queue per port, without connection tracking.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.23.1K conn/s79.9K3.1G bit/s4.7M pkt/s 
    3.0.02.8K conn/s71.9K2.8G bit/s4.2M pkt/s-9.82%

    There is a drop in performance, without connection tracking enabled, between versions 2.17.2 and 3.0.0. This drop is completely unrelated to the conntrack optimization patch series I am focusing on. It might be caused by some discrepancies in the test procedure, but it might also have been introduced by another patch series between the two tested versions.

    1 CPU, 1 queue per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
    ovs-vsctl set Interface port0 options:n_rxq=1
    ovs-vsctl set Interface port1 options:n_rxq=1
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 4: Short-lived connections with 1 CPU, 1 queue per port, with connection tracking.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.239.7K conn/s20.0K172.0M bit/s275.8K pkt/s 
    3.0.048.2K conn/s24.3K208.9M bit/s334.9K pkt/s+21.36%
    Table 5: Long-lived connections with 1 CPU, 1 queue per port, with connection tracking.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2959 conn/s24.7K956.6M bit/s1.4M pkt/s 
    3.0.01.2K conn/s31.5K1.2G bit/s1.8M pkt/s+28.15%

    Already here, we can see that the patch series improves the single-threaded performance of connection tracking during the creation, destruction, and lookup code paths. Keep these results in mind when looking at improvements in multithreaded performance.

    2 CPUs, 1 queue per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
    ovs-vsctl set Interface port0 options:n_rxq=1
    ovs-vsctl set Interface port1 options:n_rxq=1
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 6: Short-lived connections with 2 CPUs, 1 queue per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.239.9K conn/s20.0K172.8M bit/s277.0K pkt/s 
    3.0.046.8K conn/s23.5K202.7M bit/s325.0K pkt/s+17.28%
    Table 7: Long-lived connections with 2 CPUs, 1 queue per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2885 conn/s22.7K883.1M bit/s1.3M pkt/s 
    3.0.01.1K conn/s28.6K1.1G bit/s1.7M pkt/s+25.19%

    It is worth noting that assigning twice as many CPUs to do packet processing does not double the performance. Far from it, in fact. The numbers are exactly the same (if not lower) than with only one CPU.

    This surprising result might be caused because there is only one RX queue per port and each CPU processes a single port.

    2 CPUs, 2 queues per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
    ovs-vsctl set Interface port0 options:n_rxq=2
    ovs-vsctl set Interface port1 options:n_rxq=2
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 8: Short-lived connections with 2 CPUs, 2 queues per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.248.3K conn/s24.3K208.8M bit/s334.8K pkt/s 
    3.0.065.9K conn/s33.2K286.8M bit/s459.9K pkt/s+36.41%

    For short-lived connections, we begin to see improvement beyond the single-threaded performance gain. Lock contention was reduced during the insertion and deletion of conntrack entries.

    Table 9: Long-lived connections with 2 CPUs, 2 queues per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.21.1K conn/s29.1K1.1G bit/s1.7M pkt/s 
    3.0.01.4K conn/s37.0K1.4G bit/s2.2M pkt/s+26.77%

    With two CPUs and two queues, if we take the single-threaded performance out of the picture, there seems to be no improvement in conntrack lookup for long-lived connections.

    4 CPUs, 2 queues per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
    ovs-vsctl set Interface port0 options:n_rxq=2
    ovs-vsctl set Interface port1 options:n_rxq=2
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 10: Short-lived connections with 4 CPUs, 2 queues per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.247.4K conn/s23.9K206.2M bit/s330.6K pkt/s 
    3.0.049.1K conn/s24.7K212.1M bit/s340.1K pkt/s+3.53%

    The short-lived connection rate performance has dropped in 3.0.0. This is not a fluke: The numbers are consistent across multiple runs. This drop warrants some scrutiny, but does not invalidate all the work that has been done.

    Table 11: Long-lived connections with 4 CPUs, 2 queues per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2981 conn/s25.2K977.7M bit/s1.5M pkt/s 
    3.0.02.0K conn/s52.4K2.0G bit/s3.1M pkt/s+108.31%

    With four CPUs and two queues per port, long-lived connections tracking is starting to scale up.

    4 CPUs, 4 queues per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
    ovs-vsctl set Interface port0 options:n_rxq=4
    ovs-vsctl set Interface port1 options:n_rxq=4
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 12: Short-lived connections with 4 CPUs, 4 queues per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.266.1K conn/s33.2K286.4M bit/s459.2K pkt/s 
    3.0.0100.8K conn/s50.6K437.0M bit/s700.6K pkt/s+52.55%
    Table 13: Long-lived connections with 4 CPUs, 4 queues per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2996 conn/s25.9K994.2M bit/s1.5M pkt/s 
    3.0.02.6K conn/s67.0K2.6G bit/s3.9M pkt/s+162.89%

    8 CPUs, 4 queues per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
    ovs-vsctl set Interface port0 options:n_rxq=4
    ovs-vsctl set Interface port1 options:n_rxq=4
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 14: Short-lived connections with 8 CPUs, 4 queues per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.262.2K conn/s31.3K269.8M bit/s432.5K pkt/s 
    3.0.090.1K conn/s45.2K390.9M bit/s626.7K pkt/s+44.89%
    Table 15: Long-lived connections with 8 CPUs, 4 queues per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2576 conn/s17.1K567.2M bit/s852.5K pkt/s 
    3.0.03.8K conn/s97.8K3.8G bit/s5.7M pkt/s+562.76%

    8 CPUs, 8 queues per port

    The results in this section were achieved with the following DUT configuration:

    ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
    ovs-vsctl set Interface port0 options:n_rxq=8
    ovs-vsctl set Interface port1 options:n_rxq=8
    ovs-ofctl del-flows br0
    ovs-ofctl add-flows br0 ~/ct-flows.txt
    
    Table 16: Short-lived connections with 8 CPUs, 8 queues per port.
    VersionShort-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.250.6K conn/s25.5K219.5M bit/s351.9K pkt/s 
    3.0.0100.9K conn/s50.7K436.0M bit/s698.9K pkt/s+99.36%
    Table 17: Long-lived connections with 8 CPUs, 8 queues per port.
    VersionLong-lived connectionsActive flowsBandwidthPacket rateDifference
    2.17.2541 conn/s14.0K539.2M bit/s810.3K pkt/s 
    3.0.04.8K conn/s124.1K4.8G bit/s7.2M pkt/s+792.83%

    Performance improvements in version 3.0.0 of Open vSwitch

    Using the tools in this article, I have been able to record advances made in version 3.0.0 in scaling and in handling long-lived connections.

    Scaling

    Figure 2 shows how many insertions and deletions per second were achieved on different system configurations for short-lived connections.

    Chart showing improvements in scaling of short-lived connections tracking in version 3.0.0
    Figure 2: Improvements in scaling of short-lived connections tracking in version 3.0.0.

    Apart from the small blip with 4 CPUs and 2 queues per port, the conntrack insertion and deletion code path has improved consistently in OvS 3.0.0. The multithreaded lock contention remains, but is less noticeable than with OvS 2.17.2.

    Figure 3 shows how many insertions and deletions per second were achieved on different system configurations for long-lived connections.

    Chart showing improvements in scaling of long-lived connections tracking in version 3.0.0.
    Figure 3: Improvements in scaling of long-lived connections tracking in version 3.0.0.

    Long-lived connections tracking is where the optimizations done in OvS 3.0.0 really shine. The reduction in multithreaded lock contention with conntrack lookup makes the performance scale significantly better with the number of CPUs.

    Performance during high traffic

    The following commands generate profiling reports using the Linux kernel's perf command. I measured the performance of both version 2.17.2 and version 3.0.0 for 8 CPUs and 8 RX queues under a maximum load for long-lived connections, with conntrack flows enabled. Only the events of a single CPU were captured:

    perf record -g -C 1 sleep 60
    perf report -U --no-children | grep '\[[\.k]\]' | head -15 > profile-$version.txt
    

    In the subsections that follow, I have manually annotated lines that are directly related to acquiring mutexes so that they start with a * character. When a CPU is waiting for a mutex acquisition, it is not processing any network traffic, but waiting for another CPU to release the lock.

    Performance in version 2.17.2

    The profiled CPU spends almost 40% of its cycles acquiring locks and waiting for other CPUs to release locks:

    * 30.99%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
      12.27%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_process_rxq_port
       5.18%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
       4.24%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_thread_main
       3.93%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_perf_end_iteration
    *  3.63%  pmd-c01/id:5  libc.so.6          [.] __GI___pthread_mutex_unlock_usercnt
       3.62%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_recv_pkts_vec_avx2
    *  2.76%  pmd-c01/id:5  [kernel.kallsyms]  [k] syscall_exit_to_user_mode
    *  0.91%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wait
    *  0.18%  pmd-c01/id:5  [kernel.kallsyms]  [k] __x64_sys_futex
    *  0.17%  pmd-c01/id:5  [kernel.kallsyms]  [k] futex_wait
    *  0.12%  pmd-c01/id:5  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
    *  0.11%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wake
    *  0.08%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_syscall_64
    *  0.06%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_futex

    Performance in version 3.0.0

    It is obvious that 3.0.0 has much less lock contention and therefore scales better with the number of CPUs:

      15.30%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input__
       8.62%  pmd-c01/id:5  ovs-vswitchd       [.] conn_key_lookup
       7.88%  pmd-c01/id:5  ovs-vswitchd       [.] miniflow_extract
       7.75%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find
    *  6.92%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
       5.15%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w4_u1w1
       4.16%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find_batch
       4.10%  pmd-c01/id:5  ovs-vswitchd       [.] tcp_conn_update
       3.86%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w5_u1w1
       3.51%  pmd-c01/id:5  ovs-vswitchd       [.] conntrack_execute
       3.42%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_fixed_burst_vec_avx2
       0.77%  pmd-c01/id:5  ovs-vswitchd       [.] dp_execute_cb
       0.72%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
       0.07%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_pkts_vec_avx2
       0.04%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input
    

    Final words

    I hope this gave you some ideas for benchmarking and profiling connection tracking with TRex and perf. Please leave any questions you have in comments on this article.

    Kudos to Paolo Valerio and Gaëtan Rivet for their work on optimizing the user space OvS conntrack implementation.

    Last updated: May 30, 2024

    Recent Posts

    • How to change the meaning of python and python3 on RHEL

    • vLLM or llama.cpp: Choosing the right LLM inference engine for your use case

    • How to implement and monitor circuit breakers in OpenShift Service Mesh 3

    • Analysis of OpenShift node-system-admin-client lifespan

    • What's New in OpenShift GitOps 1.18

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue