Advanced System Design — Scale TinyURL
Discussing how to scale the barebones TinyURL system using Load Balancer and Database Horizontal Partitioning, and related topics like partition key design strategy, partition vs replica vs backup, SQL vs NoSQL.

In the last couple of articles, we discussed how to design a barebones working solution for TinyURL, and also how to improve 90% of the estimated workload by introducing cache.
The system diagram below shows what we currently have. The single web server handles all the HTTP requests from the clients, and the single database stores all the data and handles all the read and write requests. Let’s trace the workflow and see how we can scale each component in the diagram, starting from the web server.

Web Server
HTTP requests are stateless, which means each HTTP request contains enough information for it to be understood and handled by a web server. This allows us to have more than just one web server to serve all the HTTP requests. Once we have multiple web servers available, the users’ HTTP requests can now be routed to any web server in the system (there is a concept of sticky session, which we will cover later). To achieve this, we will need to add another component, Load Balancer, who takes care of the dispatching between the Users and the Web Servers.

Introducing Load Balancer
The role of a Load Balancer is to distribute incoming requests to different Web Servers, and then return the HTTP responses to the users. There are various approaches a Load Balancer can take to route the requests to different Web Servers, such as Random, Least Busy, Round Robin, Sticky Session, etc.. For TinyURL, there is no particular requirement on how a request should be dispatched. We can just use Random or Least Busy for simplicity.
On a site note, Sticky Sessions are used when requests for a same user session should all be sent to one web server, this allows the server to have session data available for all subsequent requests.
Now that we have added more web servers, our web server cluster technically can handle infinite amount of HTTP requests. Another benefit is that we will no longer have Single Point Failure due to the single web server failure. While dispatching should be really quick due to minimal computation, we may still add Load Balancers to avoid Single-Point-Failure from any Load Balancer failure as well.
Database Server
At the very beginning of the design phase, we estimated that one database server should be able to handle the estimated query per second (QPS) and also be able to store all the data. To recap, here is a list of estimated numbers:
- Active Users: 1 million
- Average Write QPS: 10
- Average Read QPS: 100
- Storage: 40G per year
What if we are now told the market is much better than we originally expected?
Our one single database server can no longer store all the data or efficiently handle all the read and write requests .
The solution is to store our data on multiple database servers instead of one single database server. In order to store the data on different database servers, we need a way to partition the data (partitioning, aka, break the whole dataset into separate chunks), and then store different chunks of data on different database servers. You may also see partitioning being referred as sharding elsewhere. For consistency, I will use “partitioning” for the rest of this article.
What is partitioning?
Specifically, horizontal partitioning, which means each row of data shall be stored in the same server, but different rows of data may be stored on different servers. Sometimes vertical partitioning can be used to break up columns in the same table, but our TinyURL table only has two columns, so we won’t need vertical partitioning in this case.
Partition vs Replica vs Backup
It is worth noting that partition is different from replica. The purpose of partitioning is to spread the full dataset on multiple database servers so that each server only stores a portion of the data and handles a portion of the read and write requests. The purpose of replicas is to have multiple copies, normally three, of the full dataset, and each replica can share some read load. Replicas can also be used to restore another replica when one replica fails.
Replica is also different from backup. Replicas are synced online as soon as data gets written, and that’s why replicas can share the read load. Backup is performed offline at a specified schedule. Because backup files are offline, they can only be used for restore purpose, but can not share any read load.
How is partitioning achieved?
By a technique called “Consistent Hashing”. The basic workflow is:
- Identify one column (which will be identified as the Partition Key column)
- Calculate the hash value of the column. A hash function can be used for this calculation. There are many types of hash functions, but they all have the same goal: return a different integer given a different partition key *. The returned integer is in range [0, Hash Size). In real world use cases, the Hash Size is often ²⁶⁴.
* Please note that mathematically, it is possible that a hash function may return the same integer for two different partition keys, and this is called hash “collision”. Because ²⁶⁴ is an extremely large number, we will have to think that the possibility is extremely low that you never run into collision in real world use cases. - The hash value determines which database server the row of data should be stored on. For simplicity, the diagram below shows how one row of data should be mapped to a different database server based on it’s hash value. This type of diagram is called “Consistent Hash Ring”. Any hash value that falls in the range of [0, ²⁶⁴/3) should be mapped to database #1, any hash value that falls in the range of [²⁶⁴/3, 2*²⁶⁴/3) should be mapped to database #2, etc..

In real world implementation, the Consistent Hash Ring is actually broken into many virtual segments, e.g. 1000, and each segment belongs to a specific database server. The diagram below shows a simplified ring with just 2 virtual segments for each database server for demonstration purpose.

How to determine the Partition Key?
The best practice is to first identify how the data will be read. This is because the partition key determines which database server a data record should be retrieved from. The goal is to be able to include the partition key in the majority of your read requests so that they become single partition read. Single-partition read only has to look at one specific partition, thus is much faster and cost less resources compared to a cross-partition read. For a cross-partition read, multiple/all partitions need to be scanned, which is resource intense and costly.
For the TinyURL design, one of the most common read requests is to retrieve the long URL given a specific short URL key. The lookup request looks something like this below. So we will use ShortUrlKey as our partition key for the Url table.
Please note that using ShortUrlKey as the partition key means we won’t be able to do single partition read when we look up a specific Long URL. We will discuss how to cover it later.
Revisit SQL vs NoSQL
In the last article, we compared SQL vs NoSQL for the barebones TinyURL service. Now that we have to add partitioning into the picture, let’s add a few new criteria to the comparison table below.

As you can see, even though it is possible to support horizontal partitioning, the developers may need to design and implement the consistent hashing workflow and worry about data migration when more database servers are added. NoSQL, such as Cosmos DB and Cassandra, will handle partitioning automatically and the developer only needs to determine the sharding keys.
If we are going to use NoSQL so we can take advantage of the automatic partitioning feature, I would have two containers (equivalent to tables in SQL):
- A ShortToLong container, with Short URL Key as the primary key. This container will serve read requests that look up a long URL by a short URL key.
- A LongToShort container, with Long URL as the primary key. This container will serve read requests that check if a long URL already exists and returns its short URL Key if it does exist.
For each write request, make sure the same record is written into both containers. This does produce some duplicated data, but because NoSQL is extremely fast in write I/O and technically can be scaled out infinitely, we do not need to worry about storage size. In Azure Cosmos DB, you can create a stored procedure to make the two writes transactional. For more information, please see Microsoft documentation.
Scalable TinyURL System
Finally, let’s see what we have come up with so far.

You will notice that every component in the system is now scalable and can handle large-scale workload.
- Users (hoping that they will grow organically)
- Load Balancers — Dispatching is fast. Having a set of load balancers allows us to dispatch large amount of requests without having to worry about single point failure by any single load balancer.
- Web Servers — For TinyURL service, because each HTTP request is stateless and we do not need to track any sticky sessions, our web server cluster should be able to process large amount of requests. And it is very easy to add/remove web servers as we need. If you are using Azure or AWS, scaling out web servers is made a simple configuration!
- Database Servers — With the help of partitioning, we can now scale out the number of database servers. Each database server will share some portion of the data storage and some read load. It is possible to scale infinitely as long as the partition key has high cardinality, aka, lots of distinct values.
- Cache Server — It is possible to have distributed cache servers if the amount of data you want to store in cache can not fit into the memory of one single server.
Conclusion
System Design is not easy as it covers lots of topics and also requires the software engineer to have in-depth understanding on how each component works and also the protocols that connect each component. To overcome this, we started small and designed a barebones working solution first. With that, we started identifying the performance bottlenecks and optimized the major workload (read load in this case). The last step we took was to scale each individual component starting from the users. Hope you have enjoyed the design journey this far.
Many thanks for reading!