When examining Linux firewall performance, there is a second aspect to packet processing—namely, the cost of firewall setup manipulations. In a world of containers, distinct network nodes spawn quickly enough for firewall ruleset adjustment delay to become a significant factor. At the same time, rulesets tend to become huge given the number of containers even a moderately specced server might host.
In the past, considerable effort was put into legacy
iptables to speed up the handling of large rulesets. With the recent push upstream and downstream to establish
iptables-nft as the standard variant, a reassessment of this quality is in order. To see how bad things really are, I created a bunch of benchmarks to run with both variants and compare the results.
Continue reading “Optimizing iptables-nft large ruleset performance in user space”
For the past three years, I’ve been participating in adding just-in-time compilation (JIT) to CRuby. Now, CRuby has the method-based just-in-time compiler (MJIT), which improves performance for non-input/output-bound programs.
The most popular approach to implementing a JIT is to use LLVM or GCC JIT interfaces, like ORC or LibGCCJIT. GCC and LLVM developers spend huge effort to implement the optimizations reliably, effectively, and to work on a lot of targets. Using LLVM or GCC to implement JIT, we can just utilize these optimizations for free. Using the existing compilers was the only way to get JIT for CRuby in the short time before the Ruby 3.0 release, which has the goal of improving CRuby performance by three times.
So, CRuby MJIT utilizes GCC or LLVM, but what is unique about this JIT?
Continue reading “MIR: A lightweight JIT compiler project”
You’ve probably been taught that picking an algorithm that has the best Big-O asymptotic cost will yield the best performance. You might be surprised to find that on current hardware, this isn’t always the case. Much of algorithmic analysis assumes very simple costs where the order of operations doesn’t matter. Memory access times are assumed to be the same. However, the difference between a cache hit (a few processor clock cycles) and a cache miss that requires access to main memory (a couple hundred cycles) is immense.
This article series is the result of the authors (William Cohen and Ben Woodard) discussion that there is a disconnect on the typical ideas of algorithm efficiency taught in computer science and computer engineering versus what is currently encountered in actual computer systems.
Continue reading “Algorithms != Programs and Programs are not “One size fits all””
This blog post is about my work to improve CRuby performance by introducing new virtual machine instructions and a JIT. It is loosely based on my presentation at RubyKaigi 2017 in Hiroshima, Japan.
As many Ruby people know, the author of Ruby, Yukihiro Matsumoto (Matz), set up a very ambitious goal for performance of CRuby version 3. Version 3 should be 3 times faster than version 2.
Koichi Sasada did a great job improving the performance of CRuby version 2 by about 3 times over version 1, by introducing a byte code virtual machine (VM). So I guess it is symbolic to set up the same goal for CRuby version 3.
Continue reading “Towards The Ruby 3×3 Performance Goal”