Why your database benchmarking data is probably wrong (and how I fixed mine)

We've all been there. You've spent hours architecting a performance test, convinced you're about to uncover groundbreaking insights. You spin up a big RDS instance, fire up HammerDB, and wait for those "new orders per minute" (NOPM) numbers to skyrocket. But instead, you get a flat line (or worse, a zig-zag). You double the number of virtual users, but the throughput doesn't budge. When I started benchmarking AWS RDS PostgreSQL performance, I expected a straightforward "plug-and-play" experience. Instead, I found that without rigorous optimization, you aren't measuring your database's power—you're measuring the limitations of your testing environment. Here's how I identified and eliminated the hidden bottlenecks that were sabotaging my data.

1. The observer effect: When the benchmarker becomes the bottleneck

In performance benchmarking, the load generator itself can become the limiting factor if it lacks sufficient resources to sustain the intended workload. When this happens, the benchmark no longer reflects the true capacity of the system under test, but instead the limits of the client generating requests.

The problem:

As I increased concurrent users, the performance metrics (NOPM/TPM) stopped scaling and remained mostly flat. At first glance, it appeared that the database had stalled or reached saturation. However, the database CPU utilization remained relatively low, indicating the database was not actually under heavy stress.

Reality check

The real bottleneck was the HammerDB client itself. My driver instance (initially an m7i.4xlarge) was consistently reaching 85%+ CPU utilization. In this state, the client struggled to efficiently manage worker threads, scheduling, and network operations. As a result, it could not generate transaction requests fast enough to fully utilize the database. The apparent throughput plateau was therefore caused by client-side saturation rather than a database limitation.

The fix

To eliminate the load generator as a bottleneck, I upgraded the client instance to an m7i.12xlarge. By significantly over-provisioning the driver environment, I ensured sufficient CPU and system headroom for workload generation. This shifted the bottleneck away from the client and allowed the benchmark to accurately stress the RDS endpoint and capture reliable throughput characteristics.

2. I/O illusion: CPU-bound vs disk-bound

If you are trying to test a CPU's compute power, you must ensure it isn't waiting on the storage layer. This is a common trap in HammerDB due to the use_all_warehouses parameter.

Challenge

Setting use_all_warehouses=true sounds realistic because it touches more data. However, this forces the database into a random I/O pattern. Instead of processing transactions, the CPU enters a Wait state while the SSD controllers fetch disparate data blocks.

The fix

I set use_all_warehouses=false. This narrowed the working set of data, allowing it to be cached entirely within the PostgreSQL buffer pool (RAM).

By eliminating disk latency, I forced the database into a CPU-bound state. Only then could I see the true raw processing power of the underlying architecture without the noise of EBS volume throttling or IOPS limits.

3. Silencing the background noise: The WAL strategy

PostgreSQL is designed for data integrity, which means it does a lot of work behind the scenes to ensure your data survives a crash. In a benchmark, this work manifests as massive performance dips.

The WAL runway

PostgreSQL uses a write-ahead log (WAL) to ensure durability. When the WAL fills up, the database triggers a checkpoint, flushing "dirty" buffers to disk. This process is I/O intensive and causes NOPM graphs to flatline periodically.

The fix

I increased max_wal_size to 200GB. This essentially gave the database a long enough of a "runway" to store every single transaction log for the duration of the test without needing to pause for cleanup.

A word of caution for production

While a massive max_wal_size is a "cheat code" for benchmarking only. Don't copy this to production. Increasing this value increases the time it takes for the database to recover after a crash (recovery time objective). In a benchmark, we don't care about crash recovery. We care about making the CPU the only bottleneck. In production, you have to balance performance with safety.

The pre-flight routine

To ensure every test run started on a level playing field, I used Red Hat Ansible Automation Platform to automate a rigorous cleanup:

Manual checkpoint: Forces the database to flush everything to disk before the timer starts.
Vacuum: Cleans up dead tuples (bloat). Without this, the autovacuum daemon might kick in during your peak load, stealing CPU cycles.

4. Scripting for stability

Even with the right hardware, my results were still unpredictable. I eventually realized that the way the benchmark was being cycled was the final piece of the puzzle.

By integrating loadscript and vudestroy into my automation, the results finally became predictable. These commands ensured the script is fully parsed in memory and that zombie connections from previous runs are wiped clean. Only then did the throughput finally climb until it hit the true hardware limit of the RDS CPU.

Benchmarking checklist

If you want data you can actually trust, you must isolate the variables:

Over-provision the client: Your tester should never be the bottleneck.
Isolate the bottleneck: If you're testing CPU, stay in the buffer pool (RAM).
Expand the WAL: Prevent mid-test checkpoints, but remember the trade-off with recovery time in real-world scenarios.
Automate stability: Use tools like Ansible Automation Platform to run tests multiple times. Aim for a stability percentage (variation) of less than 5%.

By removing hidden bottlenecks, I stopped measuring the noise of my configuration and finally started measuring the true performance of the cloud.

Why your database benchmarking data is probably wrong (and how I fixed mine)

Eliminating hidden bottlenecks in AWS RDS PostgreSQL benchmarking

1. The observer effect: When the benchmarker becomes the bottleneck

The problem:

Reality check

The fix

2. I/O illusion: CPU-bound vs disk-bound

Challenge

The fix

3. Silencing the background noise: The WAL strategy

The WAL runway

The fix

A word of caution for production

The pre-flight routine

4. Scripting for stability

Benchmarking checklist

Layered sandboxing for AI agents: OpenShift and OpenShell

How obs-mcp boosts AI-native OpenShift observability

Red Hat build of Agent Sandbox: Isolated workload management with Kubernetes

Run Claude Code locally with vLLM and OpenShift AI

Verified boot in automotive with AutoSD

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links