Rate Limiting / API Throttling System Design

4 min readSep 24, 2022

Breaking Down the Concepts of Rate Limiting or API throttling in Backend Applications

Rate Limiting or API Throttling simply means controlling or managing the number of requests that affects your server by limiting requests that gets your server performing general computational tasks. This strategy or tactics limits traffic to your network. This system is used for both average users (to prevent server abuse) and bots, as it is a strategy developed to ease and calculate the network traffic on the server.

For example when you try to login to a website but keep putting the wrong password within a short timeframe, you are asked to wait for a period of time before you are granted access again, what this means is that your ability to interact with the endpoints beyond the wall (rate limiter program) has been limited. This is also helpful against malicious attacks such as DDOS, Brute Force, web bots.

How Does Rate limiting or API Throttling Work?

Rate Limiting or API Throttling program is an independent program that is integrated with the server to help mitigate against malicious or any form of network abuse that is harmful to the productivity of the server, it is independent because if it is detached from the server, the server would still handle its tasks. This type of program is integrated with the server as a wall or middleware for MVC architectures. It is positioned to read every request that comes to either the server as a whole or specific endpoints. This is because a program that limits login rates will not limit password reset rate. How does this work in regards to APIs, Some APIs are open to public consumption whether free or paid, and for these external usage a token is used to identify every request to the API, so these requests can be measured by the programs. What happens when a request has no token attached to it, the answer is simple, that request never gets to the server as the core restriction from the rate limiter is to prevent every request that comes without a token from reaching the server.

Generally, rate limiting is used to protect open endpoints as those are the major risk points, endpoints such as login, password reset, registration etc. There are few reasons to abuse closed endpoints but not it is necessary to protect as well because DDOS or DDOS attacks generally want to weigh your server down to cause downtime.

Rate Limiting System Designs And How they Work

Fixed Window: The fixed window design is the system that sets a specified number of request for a particular timeframe or window(seconds, minute or hour) for every IP address or username or id, depending on the method of identification. And when the number of permitted request are exhausted, all incoming request within that time frame is not handled by the server. For example a server can only handle 25 requests per minute, ever request that comes in increments the number of processed requests for that minute and after the 25th request is processed, every other request within that minute is discarded. There is a downside though, as the program is expected to process 25 requests per minute, no matter when the requests comes in with that minute, it tries to handle all the permitted requests that came in within that minute. So for instance 10 requests come in within from 09: 30: 00 to 09: 30: 55 but the remaining 15 requests come in at 09: 30: 58 it will overload the server as it has 2secs to process the requests as the next sets of requests will start coming in at 09: 31: 00.

Token Bucket: This system is a bit complex and is used by most of the big players in the software industry. The logic of this design is to have a specified number of token per user per minute or another metric, I will use minute for this explanation, so whenever the user makes a request to the server, it fetches the number of tokens allocated to that user and if the user has tokens left for that minute, it performs the task the user requested and updates the token to show that the user has a one less token and if the user exhausts the token for that minute, all subsequent requests within that minute will not be served after the program confirms that there are no more requests to server that user.

Leaky Bucket: This type of system processes a request at a time and has a specific number of requests staged. This means that if it can only store 5 request while processing 1 request, then every other request that comes when it is still has 5 requests stored will overflow from the bucket (discarded), and subsequent requests that come when the bucket isn’t full will be stored for processing. This system is efficient because at no point will the server be overworked as it only processes the number of requests that it wants to process. The downside is that request are lost.

These are some of the design systems for Rate Limiting or API Throttling. I hope this was educative enough to help you understand the rate limiting concept better.

Thank you for reading.

Rate Limiting / API Throttling System Design

How Does Rate limiting or API Throttling Work?

Rate Limiting System Designs And How they Work

Written by Build Solutions