Featured image for: SCTP over UDP in the Linux kernel.

What is path MTU discovery?

A maximum transmission unit (MTU) is the largest packet that can be transmitted as a single entity over a network connection. Each network node defines the MTU for packets it's transmitting through a standard called path MTU discovery (PMTUD). The goal of PMTUD is to choose the most efficient packet size that will succeed in reaching the recipient. In this article, you'll learn how this process works in the Linux kernel's implementation of the Stream Control Transmission Protocol (SCTP).

Linux SCTP uses an algorithm called Datagram Packetization Layer Path MTU Discovery (DPLPMTUD, or just PLPMTUD), which is described in RFC 8899. Unlike earlier forms of PMTUD, this method does not rely on reception and validation of Packet Too Big (PTB) ICMP messages. The new implementation is therefore more robust than the classical PMTUD.

PLPMTUD for SCTP was implemented in the Linux kernel some months ago and will be supported on versions 8.6 and 9.0 of Red Hat Enterprise Linux.

The general strategy behind PLPMTUD runs in the kernel's packetization layer (PL). It sends probe packets using various packet sizes to determine the largest size of unfragmented datagram that can be sent over a network path. If a probe packet is successfully delivered (as determined by the PL), the PLPMTU is raised to the size of the successful probe. If a black hole is detected (that is, if packets of size PLPMTU are consistently not received), the method reduces the PLPMTU.

Probes and acknowledgements

RFC 8899 defines the contents of a probe packet and the acknowledgement (ACK) returned by the recipient.

Each probe packet consists of an SCTP common header followed by a HEARTBEAT chunk and a PAD chunk. The HEARTBEAT chunk causes the recipient to send back a HEARTBEAT ACK packet to confirm that the probe was successful. The PAD chunk specifies the length of the probe packet.

The HEARTBEAT chunk also carries a Heartbeat Information parameter that includes the size of the probe packet in a PROBED_SIZE field. Because the recipient returns this value in its HEARTBEAT ACK packet, the sender can be certain as to what the size of the original probe is, and therefore what is safe to assign as a packet size for further traffic. If the PROBED_SIZE field in a successful exchange is bigger than the PLPMTU, the sender can start using the larger size.

The implementation on Linux uses the timers and variables described in the following subsections.

Timers

There is one timer per transport. The timer is used as the PMTU_RAISE_TIMER defined in the RFC when path MTU discovery is in the Search Complete state, and as the PROBE_TIMER in other states. (You'll get an outline of these various states later in this article.)

The timer is started once PLPMTUD is enabled and path MTU discovery enters the Base state. The timer times out after every PROBE_INTERVAL, causing the node to resend any lost probe packets.

When in the Search Complete state, the timer times out every 30 PROBE_INTERVAL periods. The node then goes back to the Search state, using the timer to trigger resends when necessary.

Variables

The following variables track values and state in Linux path MTU discovery:

  • PLPMTU: Keeps the most recently confirmed PROBED_SIZE. The value equals path MTU - sizeof(IP/IPv6 header). When path MTU discovery enters the Search Complete state, the path MTU used for transmission is updated to the value PLPMTU + sizeof(IP/IPv6 header).
  • PROBE_COUNT: A count for the number of successive unsuccessful probe packets that have been sent. When a probe packet is acknowledged, the value is set to zero.
  • PROBE_INTERVAL: The time interval (in milliseconds) used to schedule the PLPMTUD probe timer. The timer expires if the node fails to receive an acknowledgement to a probe packet after this period. This variable is also the time interval between probes for the current path MTU when probe searching is done.
  • PROBED_SIZE: The size of the current probe packet as determined at the PL. This value is a tentative value for the PLPMTU, awaiting confirmation by an acknowledgement.
  • PTB/PTB_SIZE: As noted above, PTB stands for Packet Too Big; this is sometimes also called Fragmentation Needed. PTB_SIZE is related to the path MTU, but with the IP header length subtracted from the value.

Constants

The following constants are used in making path MTU discovery decisions:

  • BASE_PLPMTU: A configured size that is expected to work for most paths, set to 1200. PLPMTUD starts its probe with this value as PROBED_SIZE. If the real PLPMTU turns out to be smaller, the kernel enters the Error state and allows IP fragmentation.
  • MAX_PROBES: The maximum value of the PROBE_COUNT counter. If consecutive probe attempts of any size exceed this value, the state can change and start to probe in a different rhythm. This constant is set to 3.
  • BIG_STEP, MIN_STEP: The increments added to PROBED_SIZE when a probe succeeds. BIG_STEP is used when the Search state starts. If a probe fails, MIN_STEP is used instead. BIG_STEP is set to 32 and MIN_STEP to 4.

Commands to set options for path MTU discovery

System administrators or developers can change runtime parameters through the mechanisms in this section.

sysctl

The sysctl command provides a default value for new sockets' PROBE_INTERVAL value:


  sysctl -w net.sctp.plpmtud_probe_interval=5000

The parameter is set for the network namespace (netns). A new association takes the value from its socket, and a new transport takes the value from its association. Changes to this parameter affect only sockets that are created subsequently.

setsocketopt

To configure the PROBE_INTERVAL on a fine-grained basis, this system call can change the value for a socket, an association, or even a transport:


  setsocketopt(SCTP_PLPMTUD_PROBE_INTERVAL, interval)

State machine and steps in path MTU discovery

Figure 1 shows the stages in path MTU discovery and how the node moves between them.

Diagram showing the stages in path MTU discovery.
Figure 1: The stages in path MTU discovery.

In the following subsections, we'll examine the most common state transitions.

Base → Search → Search Complete

A normal probe starts from the Base state and probes with a path MTU of BASE_PLPMTU (1200). Once a packet is acknowledged, path MTU discovery enters the Search state. The next probe starts by incrementing the PROBED_SIZE by BIG_STEP (32).

This probe-ack-increment-probe sequence advances until a probe packet fails to be acknowledged. When that happens, the node resends the probe MAX_PROBES-1 more times (2 times, by default) with the same PROBED_SIZE following a wait of PROBE_INTERVAL milliseconds.

If these probes do not succeed, the node starts to probe with the PLPMTU and increments it by MIN_STEP (4) each time. It continues until one probe fails in the same way as before. The node then assumes that the proper PLPMTU is found. The node updates the path MTU on transport and enters the Search Complete state.

Search Complete → Search → Search Complete

In the Search Complete state, the node waits for 30 intervals of PROBE_INTERVAL milliseconds unless it notices a data retransmission.

When there is a data retransmission or the 30 PROBE_INTERVAL periods expire, the node enters the Search state. It starts probing with a probe of PROBED_SIZE+MIN_STEP. If this probe gets acknowledged, the next probe increments the size by BIG_STEP. When a new proper PLPMTU is found, the node updates the transport path MTU with that value and re-enters the Search Complete state.

Search/Search Complete → Base

During a search, any probe that fails with the PROBED_SIZE set to the PLPMTU causes a "Black Hole detected" error, sending the node back to the Base state.

If a probe fails even with the BASE_PLPMTU, the node enters the Error state, where IP fragmentation is allowed.

Packet Too Big or Fragmentation Needed

During PTB packet processing, if the PTB_SIZE is between the PLPMTU and PROBED_SIZE, the next probe starts with the PROBED_SIZE set to PTB_SIZE. This could save some rounds of probing when finding the proper PLPMTU.

Path MTU discovery example scenarios

This section contains examples showing how PROBED_SIZE changes during PLPMTUD probing in different scenarios in Linux SCTP. The examples are based on the topology in Figure 2. The topology has two clients and a router with two interfaces:

  • A host (client) at link1_1 exchanges packets with the router at its link1_2 interface.
  • A host (server) at link2_2 exchanges packets with the router at its link2_1 interface.
Diagram showing the network topology for our examples.
Figure 2: Network topology for our examples.

We begin by starting an SCTP connection from client to server:


  sctp_darn -H 192.168.2.1 -P 8888 -l  # on Server
  sctp_darn -H 192.168.1.1 -P 8888 -h 192.168.2.1 -p 8888 -s  # on Client

Each of the scenarios that follow shows system administration commands that trigger a path MTU discovery sequence change, and the steps followed by the kernel to set the path MTU.

Basic sequence

Many sequences in path MTU discovery return to the one in this section. The sequence can be triggered through the following commands:


  iptables -A INPUT -p icmp -j DROP  # on Client, disable the classical PMTUD
  ip link set link2_1 mtu 1400  # on Router

Steps in path MTU discovery (Base → Search → Complete):

  1. Probed size: 1200 (Starts at BASE_PLPMTU (1200), tries to confirm)
  2. Probed size: 1200 (Confirmed and enters Search, increments by BIG_STEP)
  3. Probed size: 1232 → 1264 → ... → 1356
  4. Probed size: 1388 (3-time-rtx failed, goes back to 1356)
  5. Probed size: 1356 (increments by MIN_STEP) → 1360 → 1364 → ... → 1380
  6. Probed size: 1384 (3-time-rtx failed, tries to confirm 1380)
  7. Probed size: 1380 (confirmed, enters Complete and sets path MTU)
  8. Probed size: 1380 (raise-timer up, enters Search, increments by MIN_STEP)
  9. Probed size: 1384 (3-time-rtx failed, tries to confirm 1380)
  10. Probed size: 1380 (confirmed, enters Complete and sets path MTU)

Other sample sequences

In this section, you'll see the sequences that can be triggered by a number of specific commands.

If this command is entered:


  ip link set link2_1 mtu 1500  # on Router

Then these are the steps in path MTU discovery (Complete → Search → Complete):

  1. Probed size: 1380 (raise-timer up, tries to confirm 1380)
  2. Probed size: 1380 (confirmed, enters Search, increments by MIN_STEP)
  3. Probed size: 1384 (confirmed, increments by BIG_STEP) → 1416 → 1448 → ... → 1480
  4. Probed size: 1512 (3-time-rtx failed, goes back to 1480)
  5. Probed size: 1480 (increments by MIN_STEP)
  6. Probed size: 1484 (3-time-rtx failed, tries to confirm 1480)
  7. Probed size: 1480 (confirmed, enters Complete and sets path MTU)

If this command is entered:


  ip link set link2_1 mtu 1400  # on Router

Then these are the steps in path MTU discovery (Complete → Base → Search):

  1. Probed size: 1480 (raise-timer up, tries to confirm 1480)
  2. Probed size: 1480 (3-time-rtx failed, enters Base, goes back to 1200, sets path MTU)
  3. Probed size: 1200 (confirmed, enters Search, increments by BIG_STEP)
  4. Probed size: 1232
  5. Starts basic sequence

If this command is entered:


  ip link set link2_1 mtu 1000  # on Router

Then these are the steps in path MTU discovery (Complete → Search → Base → Error):

  1. Probed size: 1380 (raise-timer up, tries to confirm 1380)
  2. Probed size: 1380 (3-time-rtx failed, enters Base, goes back to 1200 and sets path MTU)
  3. Probed size: 1200 (3-time-rtx failed, enters Error, allows IP fragmentation)
  4. Probed size: 1200 (3-time-rtx failed, enters Error, allows IP fragmentation)
  5. Probed size: 1200 (...)

If this command is entered:


  ip link set link2_1 mtu 1400  # on Router

Then these are the steps in path MTU discovery (Error → Base → Search → Complete):

  1. Probed size: 1200 (confirmed, enters Base, tries to confirm 1200 again)
  2. Probed size: 1200 (confirmed, enters Search, increments by BIG_STEP)
  3. Probed size: 1232
  4. Starts basic sequence

If this command is entered:


  ip link set link1_1 mtu 1500  # on Client

Then these are the steps in path MTU discovery (Complete → Base):

  1. Probed size: 1380 (rtx-timer reset, enters Base, goes back to 1200)
  2. Probed size: 1200 (confirmed, enters Search, increments by BIG_STEP)
  3. Probed size: 1232
  4. Starts basic sequence

If this command is entered:


  iptables -D INPUT -p icmp -j DROP  # on Client, enable the classical PMTUD
  ip link set link1_1 mtu 1430  # on Client

Then these are the steps in path MTU discovery (Complete → Search → Complete):

  1. 1. Probed size: 1380 (raise-timer up, tries to confirm 1380)
  2. Probed size: 1380 (confirmed, enters Search, increments by MIN_STEP)
  3. Probed size: 1384 (confirmed, increments by BIG_STEP)
  4. Probed size: 1416 (PTB received (path MTU == 1430), tries to confirm the path MTU from it)
  5. Probed size: 1408 (confirmed, increments by BIG_STEP)
  6. Probed size: 1440 (3-time-rtx failed, goes back to 1408)
  7. Probed size: 1408 (increments by MIN_STEP)
  8. Probed size: 1412 (3-time-rtx failed, tries to confirm 1408)
  9. Probed size: 1408 (enters Complete and sets path MTU)

If this command is entered:


  ip link set link2_1 mtu 1400  # on Router

Then these are the steps in path MTU discovery (Complete → Base):

  1. Probed size: 1408 (raise-timer up, tries to confirm 1408)
  2. Probed size: 1408 (PTB received (path MTU < 1408), enters Base, goes back to 1200 and sets path MTU)
  3. Probed size: 1200 (confirmed, enters Search, increments by BIG_STEP)
  4. Probed size: 1232
  5. Starts basic sequence

If these commands are entered:


  iptables -A INPUT -p icmp -j DROP  # on Client, disable the classical PMTUD
  ip link set link2_1 mtu 1300  # on Router
  # on Client, input 1350 bytes data in sctp_darn

Then these are the steps in path MTU discovery (Complete → Base):

  1. Probed size: 1380 (Data RTX happens, tries to confirm 1380)
  2. Probed size: 1380 (3-time-rtx failed, enters Base, goes back to 1200 and sets path MTU)
  3. Probed size: 1200 (confirmed, enters Search, increments by BIG_STEP)
  4. Probed size: 1232
  5. Starts basic sequence

Conclusion

Packets that cause ICMP PTB or Fragmentation Needed errors are often dropped or disabled in networking routers or servers. Classical PMTUD is not able to get the proper path MTU, which causes inefficient data transmission and even packet loss. PLPMTUD provides us with an effective way to overcome this. If you are an SCTP user, this article has shown you the details of how PLPMTUD works in SCTP, and how it can be used in your SCTP programs.

Last updated: February 12, 2024