Test1.3 initial results initial results of test 1.3

When examining Linux firewall performance, there is a second aspect to packet processing—namely, the cost of firewall setup manipulations. In a world of containers, distinct network nodes spawn quickly enough for firewall ruleset adjustment delay to become a significant factor. At the same time, rulesets tend to become huge given the number of containers even a moderately specced server might host.

In the past, considerable effort was put into legacy iptables to speed up the handling of large rulesets. With the recent push upstream and downstream to establish iptables-nft as the standard variant, a reassessment of this quality is in order. To see how bad things really are, I created a bunch of benchmarks to run with both variants and compare the results.

Benchmarks used for testing

Disclaimer: All but one of the tests that follow were created by people who don't deal with large and high-performance setups on a regular basis. Instead, my focus was to test aspects of the code given that I know the implementation behind the different commands.

Each of these benchmarks runs multiple times and the mean variation is recorded to build confidence in the results. Running the same benchmark for increasing ruleset sizes adds a second dimension to the results—a scaling factor. The following description of tests uses "N" for the value being scaled:

  1. Measure iptables-restore runtime loading a dump file with:
    1. N rules
    2. N chains
    3. 200 chains containing N rules each
  2. In a ruleset with N rules in the filter table's chain X, measure:
    1. Appending a rule to the filter table's chain X.
    2. Deleting the last rule in the filter table's chain X (by rule spec).
    3. Inserting a rule on top of the filter table's chain X.
    4. Appending a rule to the filter table's empty chain Y.
    5. Deleting that rule from the filter table's chain Y again.
    6. Appending a rule to the nat table's chain X.
    7. Deleting that rule from the nat table's chain X again.
    8. Listing the first rule in the filter table's chain X.
    9. Listing the last rule in the filter table's chain X.
    10. Listing the filter table's empty chain Y.
    11. Listing the nat table's empty chain X.
    12. Flushing the filter table's empty chain Y.
    13. Flushing the nat table's empty chain X.
    14. Flushing the filter table's chain X.
  3. In a ruleset with N chains, measure:
    • Flushing all chains.
  4. In a ruleset with 200 chains containing N rules each, measure:
    • Flushing all chains.
  5. In a ruleset with N sets of ruleset add-ons consisting of two user-defined
    chains—containing one and two rules each—and two rules appended to a common chain, measure:

     

    • Adding one further ruleset add-on as described above.

This last benchmark is supposed to be a "real-world example." It stems from Kubernetes and allegedly resembles the firewall config changes happening upon container startup. The less boring aspect, in my opinion, is the fact that the --noflush option of iptables-restore is used. This tactic is a common trick to batch several iptables commands and therefore reduce program startup and caching overhead, and is typically used in performance-critical situations. At the same time, though, optimizing its performance is a bigger challenge, as at program startup not only is kernel ruleset content unknown, but user input is as well.

Baseline results

The initial test results were sobering. Legacy iptables performance was superior to iptables-nft in practically all cases, but the regular iptables-restore tests (1.1 through 1.3) were within close range, as you can see in Figure 1.

test1.1 initial results
initial results of test 1.1
Figure 1: The initial results from test 1.1.

The rest varied from "performs a bit worse" to "horrible," as you can see in Figures 2 and 3.

The initial results from test 2.1.
initial results of test 2.1
Figure 2: The initial results from test 2.1.
test2.6 initial results
initial results of test 2.6
Figure 3: The initial results from test 2.6.

My main suspects for why iptables-nft performed so poorly were kernel ruleset caching and the internal conversion from nftables rules in libnftnl data structures to iptables rules in libxtables data structures. The latter is hard to avoid since iptables-nft shares large portions of the parser with legacy iptables, so I focused on improving the caching algorithm.

Max out the receive buffer

The kernel ruleset is fetched via netlink in chunks of up to 16KB at once, determined by user-space buffers. The kernel, though, supports dumping twice the data at once, and all it took was to double the user-space receive buffer size—a simple fix that improved performance at the large scale of every test that involved caching, in ideal situations by as much as 50%.

Figure 4 is a perfect example of how performance increasingly improved with larger rulesets. The reduced incline of the blue curve also indicates slightly better scaling, although it is still much worse overall than legacy iptables.

How performance increasingly improved with larger rulesets
results of test2.1 in v1.8.3-059-gb5cb6e631c828
Figure 4: The results from test2.1 in v1.8.3-059-gb5cb6e631c828.

Caching chains but not rules

The first real caching optimization was to skip fetching rules if not needed. Taking tests 2.1 and 2.3 as simple examples, there is no reason that the runtime of the rules append or insert (at the beginning of the chain) should depend on the number of rules already in that chain.

Implementing this feature was a simple task. The back-end code in iptables-nft has the function nft_chain_list_get that is supposed to return a linked list of chains that a given table contains. The call to the nft_chain_list_get function implicitly triggered full cache population (unless the cache is already fully populated). I changed this function to fetch only the table's list of chains and make callers request a full-cache update if needed.

To my own surprise, this change significantly improved the results for eight benchmarks. Half of them started to show O(1) complexity (i.e., a constant runtime irrespective of the scaling value). In detail, the results were:

  • Tests 2.1, 2.3, 2.14 and 4.1: Runtime still depends on scaling, but performance is constantly better than legacy iptables. In these cases, the bottleneck is on the kernel side, and user space is already perfect.
  • Tests 2.6 and 2.13: Constant runtime, performance on par with legacy. This shows that legacy iptables can avoid caching in some cases and that its cache is per table.
  • Tests 2.4 and 2.12: Constant runtime, legacy performance depends on scaling value. Those are the real jewels, a no-op in both user and kernel space, delayed only by inflexible caching in legacy iptables.

Selective rule cache per chain

Previous changes' results were a good motivation to push even further into that direction, namely to not just decide whether a rule cache is needed or not, but also for which chain(s). This way a large chain won't slow down rule-cache-requiring operations on other chains like it did in tests 2.5, 2.7 and 2.11.

Implementing this change felt natural since there are code paths that accept an optional chain name, like the --list command. My approach was to extend the signature of the nft_chain_list_get function to accept this chain name. If non-NULL, the called code will fetch only that specific chain from the kernel instead of a full chain dump per table. Consequently, the returned chain list contains only a single entry. In addition to that, I changed the full cache update routine (nft_build_cache) to accept that optional chain name as well, so rule fetching can also happen per-chain.

A downside of the above is that the code has to be aware of partial caches. If a full cache update would follow a partial one, we would end up with duplicate entries. Therefore, routines inserting chains into cache must skip already existing ones. This logic is not possible for rules, so the rule cache update is simply skipped for chains that already contain rules.

Streamlining compatibility checks

An interesting problem is highlighted by the fact that previous optimizations failed to improve test 2.10, although it should be covered. The reason was hidden in a sanity check performed for "list" commands, because iptables-nft needs the ruleset parts it cares about in a compatible state: Base chains need to have the right name and priority, rules can't contain unknown expressions, etc.

This ruleset "preview" was a bit clumsy in that it applied to all table contents, not just the relevant ones. After making it more fine-grained, test 2.10 started giving expected results as well, as you can see in Figure 5.

The results of test 2.10
results of test 2.10 in v1.8.3-064-g48a21d5c7af07
Figure 5: The results of test 2.10 in v1.8.3-064-g48a21d5c7af07.

Optimizing the flush command

Flushing every chain in a table performed poorly, as expected, with a large number of chains because iptables-nft had to fetch the full chain list and create flush requests for each of them. More by coincidence than intention, I noticed that the kernel already supported flushing all of a table's chains in one go, namely by omitting the chain name attribute from the respective netlink message.

If it wasn't for verbose mode, this would be a trivial fix. But since legacy iptables in verbose mode prints each flushed chain's name, iptables-nft has to fall back to the old logic in case the user passed --verbose.

Scaling still looks a bit "funny" in test 3.1 as shown in Figure 6, but that's probably due to kernel code scaling not as well as user space.

The reduced incline of the blue curve also indicates slightly better scaling
results of test 3.1 in v1.8.3-065-gc41b98babd55f
Figure 6: The results of test 3.1 in v1.8.3-065-gc41b98babd55f.

iptables-nft-restore --noflush

Sorting cache management in iptables-nft-restore required a bit more effort than in the other tools. The problem was mostly caused by the fact that the input is not known. Assume that after handling a few commands that don't require a rule cache (simple --append commands for instance), the input contains a command that does require a rule cache (for example, --insert with index). That situation means that previously added rules would have to be merged with the rules fetched from the kernel after the fact. With the option to either insert a rule (at the beginning) or append a rule (at the end), fetched rules would potentially go somewhere in between the previously added ones.

In order to avoid this problem and possibly others related to inconsistent caching, I decided to try a different approach and buffer input up to a certain amount. The relevant changes might be found in commit 09cb517949e69 ("xtables-restore: Improve performance of --noflush operation"). In this case, a buffer of 64KB is used to store input lines after testing them for cache requirements. This process continues until either:

  • Buffer space is exhausted.
  • The input stream signals end-of-file.
  • A command is read that requires a rule cache.

With input buffering in place, test 5.1 started to show perfect results, as you can see in Figure 7.

results of test 5.1 in v1.8.3-094-g09cb517949e69
results of test 5.1 in v1.8.3-094-g09cb517949e69
Figure 7: The results of test 5.1 in v1.8.3-094-g09cb517949e69.

Loose ends

Some test cases remain unoptimized, which means that iptables-nft's performance is worse than that of legacy in those situations. Here's why: Tests 2.2, 2.8, and 2.9 are slow because nftables identifies rules by handle, and a translation from either rulespec or index is required in any case. Since nftables doesn't support fetching a specific rule (for example, by index), user space has no leeway for optimization here. A partial fix would be to implement rule identification by index in the kernel, but in the past this approach was frowned upon due to its inherent concurrency problems.

All of the (flushing) restore tests (1.1, 1.2, and 1.3) remain slow, but that sounds worse than it actually is, looking at the initial test results of 1.3 as shown in Figure 8.

Test1.3 initial results
initial results of test 1.3
Figure 8: The initial results from test 1.3.

These results show that iptables-nft is slower by a factor ranging from about 1.2 to 1.7. In other
words, restoring a dump containing a little over 200k lines takes 1.7s instead of 1.0s. Along with very similar scaling, I suspect that the culprit is slightly less optimal code or the fact that data has to be packed into netlink messages instead of being copied to the kernel in one go via ioctl.

Summary

The good news is that iptables-nft now outperforms legacy iptables by a large degree in most cases, and doesn't lose much ground in others. This testing also shows how well-prepared nftables kernel code is for high loads despite the somewhat inelegant nft_compat module to use the kernel's xtables extensions. So, while migrating to nftables and making use of the extra goodies is still the best path to maximum performance, swapping iptables variants behind the back of legacy applications might be an effective performance regimen.

Of course, all of these results are to be taken with a grain of salt. On one hand, I certainly forgot specific use cases—all of my tests combined probably don't even reach full code coverage. On the other hand, a sophisticated ruleset leveraging what legacy iptables has to offer (including ipset) will likely prevent user space tool performance from becoming the problem in the first place.

Last updated: February 6, 2024