Featured image for: SCTP over UDP in the Linux kernel.

Open vSwitch (OVS) is an open source framework for software-defined networking (SDN) and is useful in virtualized environments. Just like conventional network stacks, OVS can offload tasks to the hardware running on the network interface card (NIC) to speed up the processing of network packets.

However, dozens of functions are invoked in a chain to achieve hardware offload. This article takes you through the chain of functions to help you debug networking problems with OVS.

This article assumes that you understand the basics of OVS and hardware offload. To accompany your study of this article, you should be familiar with network commands, particularly Linux's tc (traffic control) command, in order to dump traffic flows and see whether they are offloaded.

For the flow illustrated in this article, I used a Mellanox NIC.

The start of a network transmission

Let's start our long journey from an add/modify OVS operation to the hardware drivers with the first few functions called. Figure 1 shows each function at the beginning of the process, and the file in OVS that defines the function.

Figure 1. OVS launches the chain of functions that eventually lead to hardware offload.

There are two ways that flows can be installed in the datapath. One is dpctl_add_flow, which can be used to manually inject the flow to the datapath as shown in Figure 1. But ovs-dpctl is not a common way of injecting datapath flows. Typically what happens is that the handler thread in OVS receives an upcall from dpif, processes it, and installs the flow via dpif_operate() as shown in Figure 2.

Figure 2. udpif_upcall_handler receives the upcall from dpif and installs the flow.

OVS offload operations

The dpif_netlink_operate function is registered to the function pointer dpif->dpif_class->operate. Calling the function leads to the call stack in Figure 3.

Figure 3. A put operation invokes the function assigned to a function pointer for that purpose.

OVS's /lib/netdev-offload.c file defines a netdev_register_flow_api_provider function. The chain of calls continues through a function pointer registered as follows:

netdev_register_flow_api_provider(&netdev_offload_tc);

The netdev_tc_flow_put function is assigned to the .flow_put struct member as shown in the following excerpt:

const struct netdev_flow_api netdev_offload_tc = {
.type = "linux_tc",
 …
 .flow_put = netdev_tc_flow_put,
 …
};

After the call reaches netdev_tc_flow_put, the chain of calls continues as shown in Figure 4.

Figure 4. Another sequence eventually calls sendmsg.

Sequence from a tc command

Let's leave our pursuit of offloading in the chain of OVS calls for a moment and look at a more conventional sequence of calls. Without OVS in the picture, a call from the tc utility proceeds as shown in Figure 5.

Figure 5. You can run a tc command and trace the sequence from sendmsg.

Whether sendmsg is issued from tc, from OVS, or from another sender, the message goes to the kernel and then to the hardware driver.

The call to sendmsg

Now let's continue from where we had paused earlier in sendmsg. The chain of functions continues as shown in Figure 6.

Figure 6. sendmsg invokes functions from the Routing Netlink (rtnl) subsystem.

The Linux kernel registers the following functions to Routing Netlink (rtnl) subsystem:

  • tc_new_tfilter
  • tc_del_tfilter
  • tc_get_tfilter
  • tc_ctl_chain

These functions are registered by calling rtnl_register in the net/sched/cls_api.c file. The RTM_NEWCHAIN, RTM_GETCHAIN, and RTM_DELCHAIN operations take place in tc_ctl_chain. In turn, rtnl_register invokes rtnl_register_internal, defined in net/core/rtnetlink.c.

The sequence continues based on functions registered to the rtnl subsystem. tc_new_tfilter, defined in net/sched/cls_api.c, invokes the function pointer registered to tp->ops->change, and ends up calling fl_change from the net/sched/cls_flower.c file.

fl_change checks whether the skip_hw or skip_sw policy is present. If the tc-policy is skip_hw, the flow is just added to tc and the function returns.

Figure 7 takes a deeper look into the fl_change function. It has changed somewhat in the latest kernel version, but the control flow is pretty much the same as the one shown in the figure.

Figure 7. The fl_change function checks for skip_sw and skip_hw.

If tc-policy is unset or skip_sw, the call sequence tries to add the flow to the hardware. Because we are interested in flows that get offloaded to hardware, we continue our journey further. The sequence of calls is the following, invoking functions 

fl_hw_replace_filter (cls_flower.c) --> tc_setup_cb_add (cls_api.c) --> __tc_setup_cb_call (cls_api.c)

Finally, in the device driver

From here, the sequence goes to the hardware driver that was registered for the sender when Linux set up traffic control as part of its init sequence. For instance, the following code defines our Mellanox driver as the recipient of the message:

.ndo_setup_tc            = mlx5e_setup_tc,

The mlx5e_setup_tc function issues the following call to register the socket buffer's control block (CB):

flow_block_cb_setup_simple(type_data, &mlx5e_block_cb_list, mlx5e_setup_tc_block_cb, priv, priv, true);

In our case, the Mellanox hardware function named mlx5e_setup_tc_block_cb gets called.

So now we have reached the Mellanox driver code. A few more calls and we can see how the flow rule is added to the flow table for hardware offload (Figure 8).

Figure 8. The Mellanox hardware driver checks flags to add the flow.

The drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c registers the following function, and the sequence continues as shown in Figure 9.

Figure 9. The Mellanox driver adds the flow to the hardware.
.create_fte = mlx5_cmd_create_fte,

The final function in Figure 9 invokes a command that adds the flow rule to the hardware. With this result, we have reached our destination.

Conclusion

I hope this helps you understand what happens while adding a flow for hardware offload, and helps you troubleshoot problems you might encounter. To learn more about the basics of Open vSwitch hardware offload, I recommend reading Haresh Khandelwal's blog post on the subject.

Last updated: December 14, 2021

Comments