abstract networking image

Multipath TCP (MPTCP) extends traditional TCP to allow reliable end-to-end delivery over multiple simultaneous TCP paths, and is coming as a tech preview on Red Hat Enterprise Linux 8.3. This is the first of two articles for users who want to practice with the new MPTCP functionality on a live system. In this first part, we show you how to enable the protocol in the kernel and let client and server applications use the MPTCP sockets. Then, we run diagnostics on the kernel in a sample test network, where endpoints are using a single subflow.

Multipath TCP in Red Hat Enterprise Linux 8

Multipath TCP is a relatively new extension for the Transmission Control Protocol (TCP), and its official Linux implementation is even more recent. Early users might want to know what to expect in RHEL 8.3. In this article, you will learn how to:

  • Enable the Multipath TCP protocol in the kernel.
  • Let an application open an IPPROTO_MPTCP socket.
  • Use tcpdump to inspect MPTCP options with live traffic.
  • Inspect the subflow status with ss.

Enabling Multipath TCP in the kernel

Multipath TCP registers as an upper-layer protocol (ULP) for TCP. Users can ensure that mptcp is available in the kernel by checking the available ULPs:

# sysctl net.ipv4.tcp_available_ulp
net.ipv4.tcp_available_ulp = espintcp mptcp

Unlike upstream Linux, MPTCP is disabled in the default Red Hat Enterprise Linux (RHEL) 8.3 runtime. To enable the possibility of creating sockets, system administrators need to issue a proper sysctl command:

# sysctl -w net.mptcp.enabled=1
# sysctl net.mptcp.enabled
net.mptcp.enabled = 1

Preparing the system for its first MPTCP socket

With MPTCP enabled in the RHEL 8.3 kernel, user-space programs have a new protocol available for the socket system call. There are two potential use cases for the new protocol.

Native MPTCP applications

Applications supporting MPTCP natively can open a SOCK_STREAM socket specifying IPPROTO_MPTCP as the protocol and AF_INET or AF_INET6 as the address family:

fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

After the application creates a socket, the kernel will operate one or more TCP subflows that will use the standard MPTCP option (IANA number = 30). Client and server semantics are the same as those used by a regular TCP socket (meaning that they will use bind(), listen(), connect(), and accept()).

Legacy TCP applications converted to MPTCP

Most user-space applications have no knowledge of IPPROTO_MPTCP, nor would it be realistic to patch and rebuild all of them to add native support for MPTCP. Because of this, the community opted for using an eBPF program that wraps the socket() system call and overrides the value of protocol.

In RHEL 8.3, this program will run on CPU groups so that system administrators can specify which applications should run MPTCP while others continue with TCP. We will discuss the eBPF helper upstream in the next weeks, but we want to support early RHEL 8.3 users who want to try their own applications with MPTCP.

You can use a systemtap script as a workaround to intercept calls to __sys_socket() in the kernel. You can then allow a kernel probe to replace IPPROTO_TCP with IPPROTO_MPTCP. You will need to add packages to install a probe in the kernel with stap. You'll also use the good-old ncat tool from the nmap-ncat package to run the client and the server:

# dnf -y install \
> kernel-headers \
> kernel-devel \
> kernel-debuginfo
> kernel-debuginfo-common_x86_64 \
> systemtap-client \
> systemtap-client-devel \
> nmap-ncat

Use the following command to start the systemtap script:

# stap -vg mpctp.stap

Protocol smoke test: A single subflow using ncat

The test network topology shown in Figure 1 consists of a client and a server that run in separate namespaces, connected through a virtual ethernet device (veth).

veth-ns-client server
network topology for basic MPTCP testing
Figure 1: A network topology for basic MPTCP testing.">

Adding additional IP addresses will simulate multiple L4 paths between endpoints. First, the server opens a passive socket, listening on a TCP port:

# ncat -l 192.0.2.1 4321

Then, the client connects to the server:

# ncat 192.0.2.1 4321

From a functional point of view, the interaction is the same as using ncat with regular TCP: When the user writes a line in the client's standard input, the server displays that line in the standard output. Similarly, typing a line in the server's standard input results in transmitting it back to the client's standard output. In this example, we use ncat to send a "hello world (1)\n" message to the server. It waits for a second, then sends back "hello world (2)\n," then it closes the connection.

Note: Current Linux MPTCP does not support mixed IPv4/IPv6 addresses. Therefore, all addresses involved in client/server connectivity must belong to the same family.

Capturing traffic and examining it with tcpdump

The Red Hat Enterprise Linux 8 version of tcpdump doesn't yet support dissecting MPTCP v1 suboptions in TCP headers. We can overcome this problem by building a binary from the upstream repository. Alternatively, we can replace it with a more recent binary. With either of those changes, it's possible to inspect the MPTCP suboption.

Three-way handshake: The MP_CAPABLE suboption

During a three-way-handshake, the client and server exchange a 64-bit key using the MP_CAPABLE suboption, which is visible in the output of tcpdump in the braces ({}) after mptcp capable. These keys are then used later to compute the DSN/DACK and token. The MP_CAPABLE suboption that originates in the client is also present following a successful connection setup. It will be present until the server explicitly acknowledges it using a data sequence signal (DSS) suboption:

# tcpdump -#tnnr capture.pcap
1  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [S], seq 1721499445, win 29200, options [mss 1460,sackOK,TS val 33385784 ecr 0,nop,wscale 7,mptcp capable v1], length 0
2  IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [S.], seq 3341831007, ack 1721499446, win 28960, options [mss 1460,sackOK,TS val 4061152149 ecr 33385784,nop,wscale 7,mptcp capable v1 {0xbb206e3023b47a2d}], length 0
3  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [.], ack 1, win 229, options [nop,nop,TS val 33385785 ecr 4061152149,mptcp capable v1 {0x41923206b75835f5,0xbb206e3023b47a2d}], length 0
4  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [P.], seq 1:17, ack 1, win 229, options [nop,nop,TS val 33385785 ecr 4061152149,mptcp capable v1 {0x41923206b75835f5,0xbb206e3023b47a2d},nop,nop], length 16

MPTCP-level sequence numbers: The DSS suboption

After that, TCP segments will carry the DSS suboption that contains MPTCP sequence numbers. More specifically, we can observe the data sequence number (DSN) and data acknowledgment (DACK) values, as shown here:

5  IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 17, win 227, options [nop,nop,TS val 4061152149 ecr 33385785,mptcp dss ack 1711754507747579648], length 0
6  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [P.], seq 17:33, ack 1, win 229, options [nop,nop,TS val 33386778 ecr 4061152149,mptcp dss ack 1331650533424046587 seq 1711754507747579648 subseq 17 len 16,nop,nop], length 16
7  IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 33, win 227, options [nop,nop,TS val 4061153142 ecr 33386778,mptcp dss ack 1711754507747579664], length 0

Using a single subflow, DSN and DACK increase by the same amount as the TCP sequence and acknowledgment numbers. When the connection ends, the subflows are closed with a FIN packet, just like regular TCP flows would be. Because it also closes the MPTCP socket, the data fin bit is set in the DSS suboption, as shown here:

8  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [F.], seq 33, ack 1, win 229, options [nop,nop,TS val 33387798 ecr 4061153142,mptcp dss fin ack 1331650533424046587 seq 1711754507747579664 subseq 0 len 1,nop,nop], length 0
9  IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [.], ack 34, win 227, options [nop,nop,TS val 4061154203 ecr 33387798,mptcp dss ack 1711754507747579664], length 0
10  IP 192.0.2.1.4321 > 192.0.2.2.44176: Flags [F.], seq 1, ack 34, win 227, options [nop,nop,TS val 4061162156 ecr 33387798,mptcp dss fin ack 1711754507747579664 seq 1331650533424046587 subseq 0 len 1,nop,nop], length 0
11  IP 192.0.2.2.44176 > 192.0.2.1.4321: Flags [.], ack 2, win 229, options [nop,nop,TS val 33395793 ecr 4061162156,mptcp dss ack 1331650533424046587], length 0

Inspecting subflow data with ss

Because MPTCP uses TCP as a transport protocol, network administrators can query the kernel to retrieve information on TCP connections that are being used by the main MPTCP socket. In this example, we're running ss on the client filtering on the server listening port, where information relevant to MPTCP can be read after tcp-ulp-mptcp:

# ss -nti '( dport :4321 )' dst 192.0.2.1
State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
ESTAB 0      0          192.0.2.2:44176    192.0.2.1:4321
cubic wscale:7,7 [...] bytes_sent:32 bytes_acked:33 [...] tcp-ulp-mptcp flags:Mmec token:0000(id:0)/768f615c(id:0) seq:127af91ad1b321fb sfseq:1 ssnoff:c7304b5f maplen:0

SS command output explained

The line below tcp-ulp-mptcp is the output of ss in the client namespace immediately following the transmission of packet 6 in the previous section:

  • Each value of token is the truncated Hashed Message Authentication Code algorithm (HMAC) of the remote peer's key, which the client receives during the three-way handshake. Further MP_JOIN SYN packets will use that value to prove that they have not been spoofed. The id is the subflow identifier as specified in the RFC. For non-MP_JOIN sockets, only the local token and ID are available.
  • flags is a bitmask containing information on the subflow state. For instance, M/m records the presence of the MP_CAPABLE suboption in the three-way handshake. The c means that the client received the server's key (that is, it acknowledged the SYN/ACK), while e means that the exchange of both MPTCP keys is complete.
  • seq denotes the next MPTCP sequence number that the endpoint expects on reception, or, equivalently, the DACK value for the next transmitted packet.
  • sfseq is the subflow sequence number, meaning that it is the current TCP ACK value for this subflow.
  • ssnoff is the current difference between the TCP sequence number and the MPTCP sequence number for this subflow. If you are using a single subflow, this value will not change during the connection. If you are using more than one subflow to simultaneously carry data segments, then this value can increase or decrease depending on the path capacity.
  • maplen indicates how many bytes are left to fill the current DSS map.

Note that we can compute the value of seq by starting from the server key in the SYN/ACK (which is packet 2 of the capture) and computing the server's Initial Data Sequence Number (IDSN), then truncating sha256(ntohll(bb206e3023b47a2d)) to the least-significant 64-bit, as specified by RFC 8684.

Also note that, because the client is not receiving any data from the server, seq remains equal to the IDSN  throughout the connection's lifetime. For the same reason, the value of sfseq is constantly equal to 1 in the example. We can see the IDSN in the DSN number of packet 10 and in the DACK number of packets 6 and 8 (in decimal format: 1331650533424046587), as well as in the output of ss (in hex format: 127af91ad1b321fb). Similarly, in this example the SSN offset (c7304b5f in the ss output)  is constantly equal to the initial TCP sequence number (3341831007 in the SYN/ACK, packet 2 of the capture output).

Conclusion and what's next

In realistic scenarios, MPTCP will generally use more than one subflow. In this way, sockets can preserve connectivity even after an event causes a failure in one of the L4 paths. In the next article, we will show you how to use iproute2 to configure multiple TCP paths on RHEL 8.3, and how to watch ncat doing multipath for real.

Last updated: August 18, 2020