Performance on your API

The role of APIs has evolved a lot over the past few years. Not long ago, web APIs were mainly used as simple integration points between internal systems. That is no longer true. Nowadays, APIs often are the core system of a company, one on top of which several client – web and mobile – applications are built.

When APIs were only used for back-office tasks such as extracting reports, their performance was never a key factor. However, APIs have slowly moved towards the critical path between an end-user and the service a company offers. This increase in criticality entails a direct consequence: performance of APIs really matters now.

It doesn’t matter how extremely well built your front-end applications are if the API data sources take several seconds to respond. Or even worse, if their performance is unreliable. Performance matters a great deal, and more so in a world of microservices, which means the source of what a client application shows is probably being aggregated from multiple APIs behind the scenes.

We could go as far as to say that the best feature your API can have is great performance. And we know that the only true way to improve towards a goal is by carefully picking key metrics, and iteratively measuring and tweaking your system until the stated goals are met. For APIs, the process to measure and improve the performance is load or stress testing.

This article will focus on describing how to successfully run a load test on your API. We’ll start with a simple, unmeasured API and progress to add an access control layer and make sure that everything is battle-tested and ready to handle production traffic. Let’s go then!


A good starting point is always to decide what will be tested. It could be a general test of all your API endpoints, a single one of them, or a subset that you might want to troubleshoot and improve.

For the rest of this article, we’re going to use a sample API in all our tests. This is a Node.js API for the Cards Against Humanity game. It has three endpoints:

  • /question – returns a random black card
  • /answer – returns a random white card
  • /pick – returns a random pair of question and answer

Load tests are most useful if the workload that is being tested is as similar as possible to the one that your API will be handling in real life. It’s not very useful to know that your API can keep up with 400 requests per second if you don’t know whether your real traffic will be higher or lower than that or if the load will be uniform across all endpoints.

For that reason, you should start by gathering usage data from your API. You can probably get this data either directly from your API server logs or from any application performance tool you are using (such as New Relic). Before running the first tests on your API, you should be able to have answers for:

  • Average throughput in requests per second
  • Peak throughput (What is the most traffic that you get over a certain period?)
  • Throughput distribution by API endpoint (Do you have any endpoint that gets substantially more traffic than any others?)
  • Throughput distribution by users (A few generate most traffic, or is it more evenly distributed?)

One key thing to think about is what the traffic you are going to simulate during the test will look like. The main options here are:

  1. Repetitive load generation
  2. Simulated traffic patterns
  3. Real traffic

As always, it’s best to start with the simplest approach and evolve to more realistic options progressively. Running your firsts tests with repetitive load generation against your API endpoints will be a great way to validate that your load testing environment is stable. But most importantly, it will also let you find the absolute maximum throughput of your API and therefore establish an upper bound of the performance that you’ll be able to achieve.

Once you have found that maximum, you can start thinking about shaping the traffic you generate to be more realistic. Using real traffic is an ideal scenario, although one that is not always feasible. It might be too hard or simply take too long to set it up, so we suggest an intermediate step: studying your traffic analytics and doing a simple probabilistic simulation. For instance, if you have an API with 100 endpoints, you can examine the usage for the last month to find that 80% of the traffic goes to 20 endpoints, and that the top 3 endpoints take 50% of all traffic. You could then create a list of requests that follows that probability and feed it to your load testing tool. That would be relatively quick to do, and most of the time, it will be close enough to show any problems that you might have with real traffic.

Finally, if you have access to the production logs of the API you are testing, you can replay them to get the most realistic possible test. Most of the load testing tools we’ll talk about in a moment accept a file with a list of requests as an input. You can use your production logs by doing some minimal formatting changes to adapt to the format each tool expects. Once you have it, you are ready to replay production traffic as many times as you want, at the rate you want.

Your load testing setup

Once you are clear about what you want to test, the last part of your preparation work is to set up your testing environment. You’ll want to have a dedicated environment for this &ndash you should never run performance tests against your live production environment (unless you have good reasons for it).

If you already have a pre-production or sandbox environment up and running with your API, then you are all set. Since we’re using a sample API for this article, we’re going to set it up on an AWS server instance.

In our case, we’re working with a simple API that doesn’t need to read from disk or keep a large dataset in memory. Therefore we’ll choose an AWS instance that is tailored for a CPU-bound workload. We’ll go with a Linux c4.large instance.

Note: This choice was actually done after testing the same application running in general instances, which have a similar amount of processing resources and more memory, and realizing that the memory was largely being unused.

Next, we’ll also spin up an instance that will perform the load injection. This is just a server running a program that will simulate our API users by repeatedly sending requests from several concurrent connections to our API server. The higher the load you need to simulate, the more powerful this server will need to be. Once again, this will also be a CPU intensive workload. Here we chose a c4.xlarge AWS instance with 4 virtual cores and an optimized processor with 16 ECU.

We chose to deploy all instances in the same availability zone in order to minimize the impact of external factors related to the network from our test results.

Choose your tool belt wisely

At this point, we have a sandboxed environment running our API and an additional server prepared to start pumping in load. If this is your first time doing a performance test, you will be wondering what is the best way to do that. In this section, we’ll share our decision process of choosing a load generation tool and also do a brief review of some well known options in the market.


The leader of the pack in awareness is probably Apache JMeter. This is an open-source Java application whose key feature is a powerful and complete GUI which you use to create test plans. A test plan is composed by test components which define every piece of the test such as:

  • Threads that are used to inject load
  • Parametrizing HTTP requests used in the test
  • Adding listeners, which are widget-like test components used to display results in different ways


  • It’s the best tool for functional load testing. You can model complex user flows, using conditions and also create assertions to validate the behavior.
  • It’s relatively easy to simulate non-trivial HTTP requests, such as requests that require logging in before or file uploads.
  • Very extensible. There is a large number of community plugins to modify and extend the built-in behaviors
  • Open source and free


  • The GUI has a steep learning curve. It’s bloated with options, and there is a large number of concepts to learn before you can run your first test.
  • When testing with high loads, the workflow becomes cumbersome. You will need to first use the GUI to generate the XML test plan. Then run the test importing that plan with the application in non-GUI mode, since the GUI takes too many resources that are needed to generate more load. You will also need to run the test taking care that all listeners (the components that gather data and show measurements) are disabled since they too consume resources. Finally, you will need to import the raw results data to the GUI to be able to see the results.
  • If your goal is to test a sustained throughput over time (e.g. 1000 requests per second during 60 seconds), it’ll be hard to find the right combination of number of concurrent threads over time and timers between requests to get that steady number.

JMeter is the tool we chose when we started our tests, but we quickly started searching for alternatives. The reason is that even though JMeter is probably the best tool if your goal is to stress test complex user flows on a web application, it can be an overkill when you just need to run a performance test on an a few HTTP API endpoints.


Wrk is a tool that is very similar to the traditional Apache Benchmark (which was first designed as a benchmark for the Apache server). Compared with JMeter, both wrk and ab are radically different beasts:

  • Everything is configured and executed through a command line tool
  • Few but powerful settings, only the essential to generate HTTP load
  • High performance

However, wrk has several improvements over the more traditional ab, the most remarkable of which are:

  • It’s multi-threaded, so it makes it much easier to generate higher loads as it’s able to really take advantage of multi-core processors.
  • Easy to extend the default behavior thanks to the Lua scripting support

As a downside, the default reporting generated is limited both in content and format (text only, no plots). We’ve found wrk to be the best tool when your goal is to find what is the maximum load your API can handle. There just isn’t a quicker tool for that job.


Vegeta is an open-source command line tool but one that takes a different approach than all the previous ones we’ve seen before. It focuses on making it easy to achieve and sustain a target rate of requests per second. It’s meant to test how a service behaves at X requests per second. This is extremely useful when you have actual data or an estimation of what your peak traffic will be and you want to validate that your API will be able to keep up with it.

SaaS tools

As you have seen up until this point, running a simple load test requires some preparation to set up an environment. Lately, some products have appeared that offer load testing infrastructure as a service. We’ve tried two of them: and Blazemeter.

Note: We only tried the free plan of both those services, therefore any feedback only applies to those plans.


This product targets the very same problem we mentioned when reviewing JMeter before: if you need to use it for high loads, you will need to create the tests plans from the GUI and then load those plans in another server running JMeter in non-GUI mode. Blazemeter will let you upload a JMeter test plan and run it from their cloud infrastructure. Unfortunately, the free plan is set too low for our purposes at 50 concurrent users, but we’ll surely come back to this tool in the future.

A simple and powerful cloud load testing service from SendGrid, it has just the right features and nice visual reports. The free plan in is generous and allows a throughput of up to 10,000 requests per second running, which means you can use it to run a real load test.

We recommend using more than one tool, both to double check the results and also to benefit from the different features and approaches each of them take to benchmarking.

Establishing a baseline

We’re going to start by trying to find the maximum throughput that our API can sustain. We define this metric as the number of requests per second that will bring our API server to the maximum CPU utilization, without the API returning any errors or timing out.

Again, it’s important to note that we use CPU as the restricting factor, but it’s important to identify early in the process the resource that becomes a bottleneck for your own API.

It’s essential to have some instrumentation on the API server so we can monitor the resource utilization during the tests. We’re using together with the excellent PM2 module for this.

Our Node.js application runs a very simple HTTP server. Node.js is single-threaded by design, but in order to leverage the two cores available in the c4.large AWS instance, we’re using the clustering feature that PM2 includes to run two worker processes of the application.

Since our API is completely stateless, it would have been easy to use the core cluster module, which in fact is used by PM2 internally, directly. However clustering through PM2 provides some nice shortcut commands to start/stop/reload the application and also monitors the processes.

First run

We’ll begin with a first run of the test using over our API. These are the results of a 30 second test at 10.000 requests per second, which is the maximum throughput that allows in its free plan.

During this test we observe that the processor of the API server only reaches 100% capacity a few times during the test.

This indicates that our API is probably able to handle an even higher throughput. We verify this with a second test by running wrk on our load injector server. The goal is to push our API server to its limit.

wrk -t 4 -c 1000 -d 60 –latency –timeout 3s http://api-server/questions

And here are the results of one of the several repetitions we did of this test:
Running 1m test @ http://api-server/question
threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 62.23ms 30.85ms 1.35s 99.39%
Req/Sec 4.07k 357.61 5.27k 94.29%
Latency Distribution
50% 60.04ms
75% 63.85ms
90% 64.17ms
99% 75.86ms
972482 requests in 1.00m, 189.89MB read
Requests/sec: 16206.04
Transfer/sec: 3.16MB

The results show that our hunch was confirmed: it reached 16,206 requests/sec while maintaining a reasonable latency, with the 99th percentile being at only 75.86ms. We’ll take this as our baseline maximum throughput as this time we saw how the API server was performing at its maximum capacity:

We just saw an easy way to find the maximum traffic load your API is prepared for, while introducing and discussing some tools we discovered during the process.

Last updated: July 31, 2023