We've all been there. You've spent hours architecting a performance test, convinced you're about to uncover groundbreaking insights. You spin up a big RDS instance, fire up HammerDB, and wait for those "new orders per minute" (NOPM) numbers to skyrocket. But instead, you get a flat line (or worse, a zig-zag). You double the number of virtual users, but the throughput doesn't budge. When I started benchmarking AWS RDS PostgreSQL performance, I expected a straightforward "plug-and-play" experience. Instead, I found that without rigorous optimization, you aren't measuring your database's power—you're measuring the limitations of your testing environment. Here's how I identified and eliminated the hidden bottlenecks that were sabotaging my data.
1. The observer effect: When the benchmarker becomes the bottleneck
In performance benchmarking, the load generator itself can become the limiting factor if it lacks sufficient resources to sustain the intended workload. When this happens, the benchmark no longer reflects the true capacity of the system under test, but instead the limits of the client generating requests.
The problem:
As I increased concurrent users, the performance metrics (NOPM/TPM) stopped scaling and remained mostly flat. At first glance, it appeared that the database had stalled or reached saturation. However, the database CPU utilization remained relatively low, indicating the database was not actually under heavy stress.
Reality check
The real bottleneck was the HammerDB client itself. My driver instance (initially an m7i.4xlarge) was consistently reaching 85%+ CPU utilization. In this state, the client struggled to efficiently manage worker threads, scheduling, and network operations. As a result, it could not generate transaction requests fast enough to fully utilize the database. The apparent throughput plateau was therefore caused by client-side saturation rather than a database limitation.
The fix
To eliminate the load generator as a bottleneck, I upgraded the client instance to an m7i.12xlarge. By significantly over-provisioning the driver environment, I ensured sufficient CPU and system headroom for workload generation. This shifted the bottleneck away from the client and allowed the benchmark to accurately stress the RDS endpoint and capture reliable throughput characteristics.
2. I/O illusion: CPU-bound vs disk-bound
If you are trying to test a CPU's compute power, you must ensure it isn't waiting on the storage layer. This is a common trap in HammerDB due to the use_all_warehouses parameter.
Challenge
Setting use_all_warehouses=true sounds realistic because it touches more data. However, this forces the database into a random I/O pattern. Instead of processing transactions, the CPU enters a Wait state while the SSD controllers fetch disparate data blocks.
The fix
I set use_all_warehouses=false. This narrowed the working set of data, allowing it to be cached entirely within the PostgreSQL buffer pool (RAM).
By eliminating disk latency, I forced the database into a CPU-bound state. Only then could I see the true raw processing power of the underlying architecture without the noise of EBS volume throttling or IOPS limits.
3. Silencing the background noise: The WAL strategy
PostgreSQL is designed for data integrity, which means it does a lot of work behind the scenes to ensure your data survives a crash. In a benchmark, this work manifests as massive performance dips.
The WAL runway
PostgreSQL uses a write-ahead log (WAL) to ensure durability. When the WAL fills up, the database triggers a checkpoint, flushing "dirty" buffers to disk. This process is I/O intensive and causes NOPM graphs to flatline periodically.
The fix
I increased max_wal_size to 200GB. This essentially gave the database a long enough of a "runway" to store every single transaction log for the duration of the test without needing to pause for cleanup.
A word of caution for production
While a massive max_wal_size is a "cheat code" for benchmarking only. Don't copy this to production. Increasing this value increases the time it takes for the database to recover after a crash (recovery time objective). In a benchmark, we don't care about crash recovery. We care about making the CPU the only bottleneck. In production, you have to balance performance with safety.
The pre-flight routine
To ensure every test run started on a level playing field, I used Red Hat Ansible Automation Platform to automate a rigorous cleanup:
- Manual checkpoint: Forces the database to flush everything to disk before the timer starts.
- Vacuum: Cleans up dead tuples (bloat). Without this, the autovacuum daemon might kick in during your peak load, stealing CPU cycles.
4. Scripting for stability
Even with the right hardware, my results were still unpredictable. I eventually realized that the way the benchmark was being cycled was the final piece of the puzzle.
By integrating loadscript and vudestroy into my automation, the results finally became predictable. These commands ensured the script is fully parsed in memory and that zombie connections from previous runs are wiped clean. Only then did the throughput finally climb until it hit the true hardware limit of the RDS CPU.
Benchmarking checklist
If you want data you can actually trust, you must isolate the variables:
- Over-provision the client: Your tester should never be the bottleneck.
- Isolate the bottleneck: If you're testing CPU, stay in the buffer pool (RAM).
- Expand the WAL: Prevent mid-test checkpoints, but remember the trade-off with recovery time in real-world scenarios.
- Automate stability: Use tools like Ansible Automation Platform to run tests multiple times. Aim for a stability percentage (variation) of less than 5%.
By removing hidden bottlenecks, I stopped measuring the noise of my configuration and finally started measuring the true performance of the cloud.