Dude, where’s my PaaS memory? Tuning Java’s footprint in OpenShift (Part 2)
The test web service
The test web service implements a simple file cache storing up to 10 copies of any given named file. Uploading copies beyond the 10th one causes the oldest version to be discarded. The server supports a variety of requests allowing
- a new version of a file to be uploaded
- an existing file version to be downloaded
- listing of the name and version counts of all files in the cache
- deletion of all copies of a named file
- deletion of the whole cache
All requests are submitted as HTTP PUTs and responses are returned as HTML formatted text. File upload is achieved by attaching the file contents as a form field using a multipart message. File download includes embedding the file contents in an HTML formatted response between <pre> and </pre> tags.
The test server can be driven via a web page included in the deployed app but it can also be driven from the command line and this latter option is used during testing.
The test setup
The server was exercised in each of four different configurations comprised of a default configuration and a tuned configuration for both the serial and parallel garbage collectors. n.b. the other OpenJDK GCs (CMS and G1) were not tested because they do not provide support for footprint management.
The 4 configurations were run using the following GC options (n.b. option -XX:+UseParallelGC is the default choice so it does not need to be specified for the pardef and paropt configurations)
pardef: -XX:UseParallelGC serdef: -XX:+UseSerialGC seropt: -XX:+UseSerialGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 paropt: -XX:UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90
All four configurations were configured with both a heap minimum and a heap maximum using the following command line options
What is being configured?
The parallel and serial garbage collectors both work by scanning through all objects referenced from the running application and marking them as alive. A scan starts with references to objects from static fields and thread stacks and recursively chases references through object fields. Once the scan has completed the live objects are relocated to the bottom end of the heap (this requires updating any references from one object to another), freeing the rest of the space that was previously occupied (dead objects don’t get moved, they simply get dropped). The biggest difference between the two collectors is that the parallel collector uses more than one thread to do the marking which, in theory, should make it run faster.
Both collectors actually split the heap into a nursery and mature region which they collect at different rates. New objects are allocated in the nursery and are are collected by relatively frequent young GCs. Whenever the nursery space gets filled a young GC scans it for live objects and they are compacted down to one end of the nursery space. If an object is still referenced by the application after several young GCs it gets promoted to the mature space. Most objects only live a short time so the mature space fills fairly slowly. Old GCs happen when the mature space gets filled. An old GC scans both the nursery and mature spaces for live objects, compacting them down to one end of the mature space.
At any GC the collector can decide to map more of the available heap pages into the nursery space or mature space. It can also decide to unmap pages and work in less space. The footprint control model makes mapping decisions based upon the values of two parameters, MinHeapFreeRatio and MaxHeapFreeRatio, with default values 40 and 70.
These two heap ratios specify what %ge excess memory should be mapped beyond that occupied by the live set. Let’s assume, for simplicity, that at GC the live objects occupy 100Mb. The defaults specify that the mapped pages should lie between 140Mb and 170Mb. If the currently mapped heap space is less than 140Mb it needs to be extended by mapping more physical pages. If it is more than 170b it needs to be trimmed by unmapping pages. Obvously, these limits are themselves constrained by the heap minimum and maximum supplied on the java command line (-Xms and -Xms settings).
The paropt and seropt configurations both reset the heap ratios to 20 and 40. This makes the GC trim the extra heap space much more tightly, keeping it much closer to the live data set size. So, with 100Mb of live data the heap would be adjusted to lie between 120Mb and 140Mb, i.e. – there would be about half as much excess space. If the application’s live set size and allocation rate remain constant then this means that GCs would have to happen about twice as often with these settings.
The parallel GC has a slightly more complex footprint control model than the serial GC. It tries to balance memory use against time spent in garbage collection. As we have seen, increasing heap size means less time in GC and decreasing heap size means more time. So these two goals are in conflict and need to be balanced against each other.
The time goal is configured by two parameters, GCTimeRatio and AdaptiveSizePolicyWeight, with default values 99 and 10. GCTimeRatio specifies the worst case GC time the collector should target. A value of 99 means no more than 1% of time should be spent in GC. In practice, that means that the parallel GC has to play cautious. So, it regularly trades off space for time even when the actual GC time is a tiny fraction of 1%. When a young GC occurs it tends just to add more heap, ignoring the MaxFreeHeapRatio value. The result is that the heap size just keeps rising, often up to the heap maximum.
When an old GC occurs the MaxFreeHeapRatio is used to trim the heap but then, as more young GCs occur, heap growth resumes. This uncontrolled heap growth makes pardef perform much worse than serdef even though one might expect a parallel collector to have show better performance than a serial collector.
paropt resets GCTimeRatio to 4, i.e. – a worst case goal of 20%. This effectively places most of the weight in the competing footprint management goals on space rather than time reduction. With this setting the time goal no longer dominates and the heap stays between the limits defined by MinFreeHeapRatio and MaxFreeHeapRatio.
The AdaptiveSizePolicyWeight parameter controls how much previous GC times are taken into account when checking the timing goal. The default setting, 10, bases the timing goal check 90% on previous GC times and 10% on the current GC time. Resetting this to 90 means that the timing goal check is mostly based on to the current GC execution time, i.e. it is more responsive to current rather than historical memory use. This greater responsiveness also usefully limits the extent to which space gets traded off against time.
How’s my garbage collecting?
Normally GC performance is measured by switching on GC logging and analysing the resulting file filled with statistics generated each time the young or old collector runs. This is problematic for two reasons. Obtaining full details of GC performance involves switching on more verbose trace. Since data is written at every GC this can mean that time spent in writes to the log can perturb application execution. Secondly, the log format differs for different GCs and different versions of OpenJDK making it harder to automate data analysis. Both these issues were addressed by implementing a lightweight, GC monitoring program based on OpenJDK’s JVMTI native agent API.
JVMTI (Java Virtual Machine Tools Interface) is an extension mechanism provided by OpenJDK which allows monitoring and debugging tools written in either C++ or a mix of C++ and Java to be dynamically loaded and linked into a running JVM. A loaded agent program registers callbacks with the JVM which allow it to track execution and record performance.
The test monitoring agent addresses both the problems associated with GC logging. It tracks every garbage collection, irrespective of the GC in use, identifying heap usage and GC timings, and computing the same statistics including
- total time spent in GC
- total time spent not in GC
- live data set size
- mapped memory size
The data set and mapped physical memory size values are accumulated as two running averages; the first averaged across the whole run; the second over the last 10 garbage collections. Hi and lo water mark sizes are also recorded.
When a GC occurs the agent decides whether or not to print these summary statistics alongside details specific to the GC itself. The agent always prints the first GC. After that it ensures there is at least a 20 second gap before printing details of a subsequent major (old) GC and at least a 60 second gap before before printing details of a subsequent minor (young) GC.
How do I drive this thing?
The server was deployed as a conventional OpenShift Online application accessed via a public internet address. It was exercised from a single client machine using multiple threads to send requests to the server in parallel. In phase one of the test a single thread warmed up the server by requesting upload of 200 different files. Each file contained just under 2K of data. So, an upload request and subsequent response involved transferring roughly 4K of data. By contrast, a download request only requires transfer of about 2K of data
In phase two of the test 200 threads were created to dispatch a mix of download and upload requests, 2000 requests in total per thread. Each thread performed 9 successive downloads and then an upload. Moving on to a new file at each request and repeating this pattern until all 2000 requests were sent.
Every thread started at its own unique offset into the file sequence, wrapping round to the start after dispatching a request for the 200th file. This ensured that by the end of the first 100 requests for each thread the file cache was fully loaded with 10 copies of each file.
Performance results (40Mb working set)
The following table shows the test timings
Configuration PARDEF SERDEF SEROPT PAROPT ------------- ------ ------ ------ ------ Upload time 309 301 291 299 Execute time 4496 4413 4436 4440 Total time 4805 4714 4727 4739 Total gc time 10.259 16.766 21.422 33.937 (0.21%) (0.36%) (0.45%) (0.72%)
All times are provided in seconds. The upload and execute timings are much the same for all four tests. The very small variation (about 1%) in timings is not significant and is probably due to variations in the availability of internet bandwidth over the 4 runs. Whatever the outcome it is clear that the choice and configuration of GC does not really affect server throughput.
The tiny effect on throughput can also be seen in the total GC times. The absolute figures show a steady increase in the time spent in GC which is the flipside of the reduction in the total amount of heap space in use (see below). However, the GC time as a percentage of total execution time is only increased by 0.5%,
The average and peak values for the live data sizes and mapped physical page sizes are shown in the following table:
Configuration PARDEF SERDEF SEROPT PAROPT ------------- ------ ------ ------ ------ Live avge 56751 52645 43079 41225 (23.8%) (21.5%) (17.6%) (15.9%) Live peak 85179 72961 50755 48741 (35.7%) (29.8%) (20.7%) (18.7%) Phys avge 138573 94072 68499 58471 (58.6%) (38.4%) (28.0%) (22.5%) Phys peak 155136 10266 71576 60672 (65.0%) (41.9%) (29.2%) (23.3%)
Figures are quoted in Mb and as a %ge of the available 256Mb heap maximum.
The true live data set of the application once all files have been loaded is somewhere around 40Mb. Yet the average and peak figures start much higher than that and decrease from each configuration to the next. Why are the live data sizes not all 40Mb?
The values used to compute the average or peak live size are sampled whenever a garbage collection occurs. Each sample value represents the amount of memory the collector is using to using to hold onto objects it thinks are live. However, at certain points during the GC cycle some of the retained space contains dead objects.
When a full garbage collection occurs the live figure is exact. Every object counted in the total is definitely referred to, whether directly or indirectly, from a thread stack or static field. At young garbage collections the live count is overestimated because some of the objects sitting in the mature region are actually dead.
Clearly, the configurations which waste the most space are the ones holding on to the most dead objects (the space has to be wasted on something). This makes sense. The successively better tuned configurations keep the in-use heap size nearer to the true live size. They do so by running collections more often. That means dead objects are squeezed out of the heap faster, lowering the proportion of dead to live objects.
The physical memory average and peak figures are also sampled at every garbage collection. Mappings are only adjusted at GC points so these are truly accurate measures of how much physical memory the JVM has grabbed from the OS. The paropt configuration provides an astonishingly high improvement in footprint, dropping the average and peak figures by a factor of just under 3x relative to pardef and by just under 2x relative to serdef.
The paropt configuration is actually about as tight a configuration as is possible. The figures above show the physical page average only 17Mb above the live set average. Inspection of individual GC statistics at old GC points indicates that the mature space is normally only a few Mb larger than the live set. The remaining 15Mb of mapped memory is the (empty) nursery space.
Performance results (85Mb working set)
The test was rerun with the same thread and iteration counts (200 and 2000, respectively) but with 1200 files. This bumped the live data set up to about 85Mb. Another variation also used 1200 files but ensured that around 15% of the files were deleted after every 200 iterations. This meant that the size of the live data set fluctuated between 70Mb and 80Mb. The results obtained were much the same for both tests so only those for the first test are presented.
Timing results are shown in the following table
Configuration PARDEF SERDEF SEROPT PAROPT ------------- ------ ------ ------ ------ Upload time 1785 1752 1734 1783 Execute time 4572 4542 4554 4530 Total time 6358 6294 6287 6313 Total gc time 16.853 13.440 15.700 37.537 (0.26%) (0.21%) (0.25%) (0.59%)
With this larger data set the overall execution time increased, but mostly that was because because the initial load (which is performed by one thread) took a lot longer. The second phase was very slightly slower (a few percent) which might be because of extra costs managing a larger data set. However it is small enough that it might just be down to variation in available internet bandwidth.
The overall time spent in GC increased with the better tuned configurations but GC costs were still extremely low and the overall test times showed no significant variation across configurations.
Memory utilization figures are shown in the following table:
Configuration PARDEF SERDEF SEROPT PAROPT ------------- ------ ------ ------ ------ Live avge 68920 72308 63123 57940 (29.0%) (29.5%) (25.8%) (22.3%) Live peak 136227 141401 104010 87561 (57.2%) (57.7%) (42.5%) (33.7%) Phys avge 144532 127122 99421 74987 (60.8%) (52.0%) (40.6%) (28.8%) Phys peak 200448 200348 146208 99584 (84.1%) (81.9%) (59.7%) (38.3%)
n.b. the average figures cited here underestimate the actual working set of 85Mb because the average includes the much longer initial upload (which lasts for 1/3rd of the total execution time and only has at most 10% of the full set of files in memory) . The true data set size was obtained by looking at the actual heap size for the last full GC in one of the log files.
These figures indicate that there was still a very large improvement in average physical memory footprint, albeit slightly less than for the 40Mb data set. paropt improved on pardef by a factor just over 2x and improved on serdef by a factor of roughly 1.7x
The drop from 3x to 2x relative to pardef is partly explained by the fact that pardef was already wasting most of the available heap space. Essentially, there was not much room for it to get much worse whereas a larger data set meant pardef had to allocate a proportionately larger amount of headroom.
The drop from 2x to 1.7x relative to serdef was only for the average memory use. The peak figure for serdef was very close to 2x that for paropt i.e. a larger working set gave much the same improvement at worst case. The lower results in this test probably also reflect the fact that the averages have been lowered by the relatively longer time spent in the load phase where the different performance is less differentiated. The bad performance of the first 2 configurations only starts to be significant as the application starts using more memory.