According to recent statistics, IPv6 traffic has surpassed the 12% mark and its traffic volume is constantly growing. This draws more and more attention to scalability issues within the IPv6 stack, some of them were already (maybe even long time ago) solved in the IPv4 networking stack. Other limitations are new and specific to the Linux IPv6 implementation. This particular blog post discusses a particular issue in the IPv6 routing implementation which can be exposed on highly loaded IPv6 setups (see below for further details if you want to check if you might be affected by this problem).

The core networking code has the requirement to save some metric information to a temporary stable storage space within the kernel. Examples of such metric information are the last measured round trip time for TCP during the last socket close or just the discovered path maximum transmission unit towards that particular end host. While the IPv4 stack can already handle those information lazily by using copy-on-write and more elaborate techniques when such a metric needs to be saved, the IPv6 stack greedily allocated this space during first contact with a particular end-host, cloned the looked up node and reinserted it into the routing trie with the specialized end-host address.

Particularly, systems exchanging just a few packets with every other host and moving on to the next one were forced to allocate a lot of memory upfront. Memory allocation in the kernel should always be bounded by some limits, otherwise it is possible for someone to bring the Linux kernel into memory pressure and let other subsystems of the kernel fail. Enforcing those limits is the IPv6 routing subsystem garbage collector's job.

Multiple tunables define how the garbage collector should keep control over the routing trie. In the /proc filesystem, one can find settings for the maximum number of routing entries, when the garbage collector should start collecting and cleaning up the trie:

  • interval of garbage collector invocations
  • maximum number of routing entries
  • expiration time of routing entries
  • number of routing entries when to engage the garbage collector and wait until further allocations
  • minimum timespan between garbage collector invocations
  • etc.

Those tunables put system admins into a complex situation for tuning the routing code. In the past it was normally enough to raise the maximum number of routing entries and invoke the garbage collector frequently enough by using the default settings, but as IPv6 usage grew bigger and bigger this was no longer enough --- densely populated routing tries led to other performance issues. It can be done better now!

The new implementation in RHEL7.3, which got backported from the upstream Linux kernel, was done by Martin Lau of Facebook. The TCP metrics store in RHEL7 was already handled like in the IPv4 implementation. TCP metrics information live in a dedicated AVL-Tree also known as the inet_peer cache. Routing nodes could lazily reference the information in this tree and didn't need to provide the storage on their own. This leaves us in the situation where only storage for the MTU needed to be provided. The routing code was adapted to allocate the specific routing entries only on need, in particular when an IPv6 Packet-Too-Big ICMPv6 notification is received, which should be a very uncommon event. This very much relieved the routing code from allocating the entries prematurely and thus also relaxed the garbage collector. Thus the above mentioned tunables are only effective for cleaning up PMTU exceptions and redirects.

This IPv6 routing improvement is now available in RHEL-7.3.

Those routing optimizations should not be confused with the IPv4 routing cache removal. This cache consisted of a separate lookup data structure (a hash) to store common routing lookups. Several reasons, not particular performance oriented ones, lead to its removal including the fact that the hash was remotely attackable.

To check if you are affected by the problem described in this article, regularly check the NoRoutes counter in /proc/net/snmp6 (or use nstat as a frontend to read those counters). In a correctly configured IPv6 network setup, those counters shouldn't increase constantly. If they do, further analysis, for example with dropwatch, might be necessary.