URL shortener
1. why? (brief)
optimize links across devices, tracking individual links to analyze audience and campaign performance, as well as hiding affiliated original URLs.
2. requirements (given + clarified)
Functional Requirements:
- Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
- When users access a short link, our service should redirect them to the original link.
- Users should optionally be able to pick a custom short link for their URL.
- Links will expire after a standard default timespan. Users should be able to specify the expiration time.
Non-Functional Requirements:
- The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
- URL redirection should happen in real-time with minimal latency.
- Shortened links should not be guessable (not predictable).
Extended Requirements:
- Analytics; e.g., how many times a redirection happened?
- Our service should also be accessible through REST APIs by other services.
3. capacity estimation and constraints
- read-heavy
- lots of redirection requests
- assumed ratio of 100:1
traffic estimates (Queries Per Second - QPS)
URL shortening = write event
URL redirection = read event
ballpark number of URL shortenings: 500M
read/write ratio: 100:1
100 * 500M = 50B redirections per month
QPS
500 million / (30 days * 24 hours * 3600 seconds) = ~200URL shortenings/s (write)
500 milion queries divided by 2.5 million seconds (1 month in seconds) equals about 200URLs per second
with read:write ratio of 100:1, URL redirections per second
200 URLs/s * 100 = 20k redirections per second (read)
OR
50B redirections per month / 2.5 million seconds = 20K redirections per second
READ (QPS) = 20K per second - because this system is read heavy
WRITE = 200 per second
QPS = 20k per second
storage estimates
unit stored = URL shortening request (write)
assumed length of storage = 5 years
ballpark stored object size = 500 bytes
how many objects?
500 million per month * 5 years * 12 months = 30 billion objects
how big?
30 billion * 500 bytes = 15 terabytes (TB)
bandwidth estimates
write requests expect 200 URLs per second
200 * 500 bytes = 100 KB/s
read requests expect ~20K URL redirections per second
20K * 500 bytes = ~10MB/s
READ = ~10MB/s
WRITE = 100KB/s
memory estimates - caching hot URLs (read)
follow the 80/20 rule
20% of the hot URLs are frequently accessed, generating 80% of traffic
then we would like to cache these 20% hot URLs.
requests per second = 20K
requests per day = 20K * 3600 seconds * 24 hours = ~1.7 billion requests per day
to cache 20% of these requests
1.7 billion * 500 bytes = 850 GB
1.7 billion * 500 bytes * 0.2 = 170 GB
cache memory required = 170GB
high level estimates
| type | unit |
|---|
| New URLs (WRITE) | 200/s |
| URL redirections (READ) | 20K/s |
| Incoming data | 100KB/s |
| Outgoing data | 10MB/s |
| Storage for 5 years | 15TB |
| Memory for cache | 170GB |
4. System APIs
in order to explicitly state what is expected from the system.
API for creating and deleting URL (no read needed really)
Create URL
createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expiry_date=None)
| Parameters | type | description |
|---|
| api_dev_key | string | API developer key. e.g. to throttle users based on their quota |
| original_url | string | the original URL |
| custom_alias | string | optional custom name for URL |
| user_name | string | optional user name, e.g. for encoding or url path |
| expiry_date | string | optional expiration date for shortened URL |
| Returns | event | type | description |
|---|
| shortened URL | success | string | the URL to be shared by users (end result) |
| error code | error | Error object | error code and message |
Delete URL
deleteURL(api_dev_key, url_key)
| Parameters | type | description |
|---|
| api_dev_key | string | Api developer key |
| url_key | string | the shortened URL to be deleted |
| Returns | event | type | description |
|---|
| "Url Removed" | success | string | message indicating success |
| error code | error | Error object | error code and message |
how do we detect and prevent abuse?
currently there is no limit to how many times a URL can be consumed, a malicious user can use that URL many times to fill out our redirecting capacity. to prevent abuse, users can be limited via their api_dev_key, to set a limit to the number of URL creations and redirections per time window (can be different durations per user key, with differently tiered pricing)
5. Database Design
defining DB schemas in the early stages help understand the data flow, and would later guide data partitioning.
the nature of data we will store:
- we need to store billions of records
- each object we store is small (less than 1K)
- there are no relationships between records - other than storing which user created a URL
- our service is read-heavy
two tables
URL table
| key | type |
|---|
| Primary Key - Hash | varchar(16) |
| Original URL | varchar(512) |
| Creation Date | datetime |
| Expiration Date | datetime |
| UserID | int |
User table
| key | type |
|---|
| Name | varchar(20) |
| Email | varchar(32) |
| Creation Date | datetime |
| Last Login | datetime |
NOSQL, because:
- we anticipate storing billions of rows
- we don't need to use relationships between objects
- easier to scale
6. Basic System Design and Algorithm
explore a two options
a. Encoding actual URL
- compute unique hash (md5, sha256)
- then encode hash to display
reasonable question should be: what should be the length of the short key?, 6, 8 or 10 characters?
Using base64 encoding, 6 letters would result in 64^6 = ~68.7 billion possible strings
Using base64 encoding, 8 letters would result in 65^8 = ~281 trillion possible strings
assuming we're sticking to 5 year expiry, we only need to store 30 billion strings, so 68.7 billion would suffice for our system
if we use the md5 algorithm as our hash function, it will produce a 128 bit hash value.
1 base64 digit represents 6 bits of data.
encoding to base64 turns it into 128 / 6 = 21 characters
sidenote - use base64url encoding (which is url-safe as it replaces characters '+' and '/')
since we only have space for 6 characters per short key, how will we choose our key?
taking just the first 6 letters could result in duplication, and to resolve that, we can choose some other characters out of the encoding string or swap some characters.
issues with this solution
- if multiple users enter the same URL, they can get the same shortened URL, (clarify whether or not this is acceptable. this means we can't tie this url to a user, therefore cannot cap limits)
- what if parts of the URL are URL encoded? is that really a problem? read more here
workaround for issues
- append a sequence number counting up, then generate a hash
- append user id to input url
b. generating keys offline
we can have a standalone Key Generation Service (KGS) that generates 6 letter strings beforehand and stores them in a database.
this acts as an index for the URLs, linked to real URLs.
this service will ensure that there is no collision in the DB (use SQL DB for this)
this service has to support 30 billion keys (for the 30 billion URLs that we will potentially store in the database)
can concurrency cause problems? *(ASYNC)
- if there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database.
solution ^
- use two tables to store keys
- one for fresh keys
- one for used keys
- as soon as KGS gives keys to one of the servers, it can move them to the used keys table.
- for simplicity, as soon as KGS loads some keys in memory (RAM), it can move the to the used keys table. this ensures each server gets unique keys. if the service is down before assigning all keys in memory, we lose those keys - which would be acceptable, given the huge number of unique keys we have.
what would be the key-DB size?
- with base64 encoding, we can generatr 68.7 billion 6 letter keys. assuming we need 1 byte to store one alphanumeric character, we can store these keys in:
6 bytes (characters per key) * 68.7B (unique keys) = 412 GB
isn't KGS a single point of failure? (SPOF)
- yes
- to solve this, we can have a standby replica of KGS
- whenever the primary server dies, the standby server can take oer to generate and provide keys.
we can look up the key in our db to get the full URL. if it's presentin the DB, issue HTTP302 redirect, with the stored URL in the 'location' field. if key is not present, return HTTP404 not found / redirect user to homepage.
should we impose size limits on custom aliases?
our service supports custom aliases. Users can pick ay 'key' they like, but providing a custom alias is not mandatory. (if the user picks a key that clashes, the DB should return 'key has already been taken')
key clashes
- to prevent key clashes, we can also use namespacing, by adding the user's username in the url route, eg
tinyurl.com/username/urlbase64key
7. Data Partitioning and Replication
- how to divide and store our billions of URLs
a. Range Based Partitioning (horizontal partitioning - split by 'difference', as defined by me)
store URLs in separate partitions based on the first letter of the hash key
WIP https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR
a. Hash Based Partitioning
have a hash function that turns a unique value in your horizontally scaling database into a predetermined set of numbers (e.g. 1-10), and run it through a hash function.
tangliqun_5673 --> hash function --> 3
map that number to your database. with this, we can scale to 10 databases. however, this is quite inflexible, as it does not work with less or more indices. To solve this, we use consistent hashing. (link to consistent hashing here)
Some key properties of the hash function used in Hash based partitioning:
- Our hashing function will randomly distribute URLs into different partitions
- with a good hashing function, it will distribute it in an even distribution (random, even distribution)
- the hash function MUST be a pure function. It must always provide the same output when given the same input (because we need it to find the index the next time we access the data)
- the hashing function turns the input into a number of a range defined by you (e.g. [1 to 256])
8. Caching
- and LRU (least recently used) cache eviction policy fits
- the hotter the URL, the more we shoud cache it
- if the URL is created, but rarely used, this means we only cause the slowdown of a rare few users when we run out of memory for cache.
- we can have caches that get updated ONLY when there is a cache miss
- we can also have multiple instances of caches to distribute load
- we can even have multiple caches near different CDNs, and each cache stores different lists of urls, in order to be able to serve the hottest urls based on location. (e.g. lowyat.net in malaysia is used a lot more in malaysia and singapore, while taskrabbit is used more in USA and other countries that have the service.)
9. Load Balancing
- round robin load balancing is a good default, easy to implement and has 0 overhead
- add load balancing layers at 3 places:
- between client and application servers (browser
<-> app) - between application servers and database servers (app
<-> db) - between application servers and cache servers (app
<-> cache)
purging or DB clenaup
- what happens to expired links?
two ways:
- actively search for and delete expired links
- puts a lot of pressure on the database
- lazy cleanup
- when a user tried to access an expired link, we delete the link and return error to user
- have a separate cleanup service that slowly (periodically) removes expired links from storage and cache. run this service only when the user traffic is expected to be low.
- default expiration time (2 years) to ensure this flow will run and there will never be dead links that don't get deleted
- after removing expired link, we can add key back into the Key generation service DB to be reused
10. Telemetry (analytics)
- what are the most important things to watch to keep our system maintained and prepared for spikes / downtime?
- country of visitor - helps us decide (when we scale) where to place our servers (closest to countries where there are most users)
- date and time of access - helps us gauge peak times
- web page referred to on the click - helps us detect spam(?)
- browser / platform where the url is being accessed from - dunno?
11. Security and Permissions
- can users create private urls?
- we can store a private/public flag with each url, to be checked when being accessed.
- can the user set a whitelist of users who can access a URL? (authorization)
- we can create a new table to store UserIDs that have permission to view a specific URL