URL shortener

1. why? (brief)

optimize links across devices, tracking individual links to analyze audience and campaign performance, as well as hiding affiliated original URLs.

2. requirements (given + clarified)

Functional Requirements:

Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
When users access a short link, our service should redirect them to the original link.
Users should optionally be able to pick a custom short link for their URL.
Links will expire after a standard default timespan. Users should be able to specify the expiration time.

Non-Functional Requirements:

The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
URL redirection should happen in real-time with minimal latency.
Shortened links should not be guessable (not predictable).

Extended Requirements:

Analytics; e.g., how many times a redirection happened?
Our service should also be accessible through REST APIs by other services.

3. capacity estimation and constraints

read-heavy
- lots of redirection requests
- assumed ratio of 100:1

traffic estimates (Queries Per Second - QPS)

URL shortening = write event URL redirection = read event ballpark number of URL shortenings: 500M read/write ratio: 100:1

100 * 500M = 50B redirections per month

QPS

500 million / (30 days * 24 hours * 3600 seconds) = ~200URL shortenings/s (write)

500 milion queries divided by 2.5 million seconds (1 month in seconds) equals about 200URLs per second

with read:write ratio of 100:1, URL redirections per second

200 URLs/s * 100 = 20k redirections per second (read)

50B redirections per month / 2.5 million seconds = 20K redirections per second

READ (QPS) = 20K per second - because this system is read heavy WRITE = 200 per second

QPS = 20k per second

storage estimates

unit stored = URL shortening request (write) assumed length of storage = 5 years ballpark stored object size = 500 bytes

how many objects?

500 million per month * 5 years * 12 months = 30 billion objects

how big?

30 billion * 500 bytes = 15 terabytes (TB)

bandwidth estimates

write requests expect 200 URLs per second

200 * 500 bytes = 100 KB/s

read requests expect ~20K URL redirections per second

20K * 500 bytes = ~10MB/s

READ = ~10MB/s

WRITE = 100KB/s

memory estimates - caching hot URLs (read)

follow the 80/20 rule 20% of the hot URLs are frequently accessed, generating 80% of traffic then we would like to cache these 20% hot URLs.

requests per second = 20K requests per day = 20K * 3600 seconds * 24 hours = ~1.7 billion requests per day

to cache 20% of these requests

1.7 billion * 500 bytes = 850 GB 1.7 billion * 500 bytes * 0.2 = 170 GB

cache memory required = 170GB

high level estimates

type	unit
New URLs (WRITE)	200/s
URL redirections (READ)	20K/s
Incoming data	100KB/s
Outgoing data	10MB/s
Storage for 5 years	15TB
Memory for cache	170GB

4. System APIs

in order to explicitly state what is expected from the system.

API for creating and deleting URL (no read needed really)

Create URL

createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expiry_date=None)

Parameters	type	description
api_dev_key	string	API developer key. e.g. to throttle users based on their quota
original_url	string	the original URL
custom_alias	string	optional custom name for URL
user_name	string	optional user name, e.g. for encoding or url path
expiry_date	string	optional expiration date for shortened URL

Returns	event	type	description
shortened URL	success	string	the URL to be shared by users (end result)
error code	error	Error object	error code and message

Delete URL

deleteURL(api_dev_key, url_key)

Parameters	type	description
api_dev_key	string	Api developer key
url_key	string	the shortened URL to be deleted

Returns	event	type	description
"Url Removed"	success	string	message indicating success
error code	error	Error object	error code and message

how do we detect and prevent abuse?

currently there is no limit to how many times a URL can be consumed, a malicious user can use that URL many times to fill out our redirecting capacity. to prevent abuse, users can be limited via their api_dev_key, to set a limit to the number of URL creations and redirections per time window (can be different durations per user key, with differently tiered pricing)

5. Database Design

defining DB schemas in the early stages help understand the data flow, and would later guide data partitioning.

the nature of data we will store:

we need to store billions of records
each object we store is small (less than 1K)
there are no relationships between records - other than storing which user created a URL
our service is read-heavy

two tables

URL table

key	type
Primary Key - Hash	varchar(16)
Original URL	varchar(512)
Creation Date	datetime
Expiration Date	datetime
UserID	int

User table

key	type
Name	varchar(20)
Email	varchar(32)
Creation Date	datetime
Last Login	datetime

NOSQL, because:

we anticipate storing billions of rows
we don't need to use relationships between objects
easier to scale

6. Basic System Design and Algorithm

explore a two options

a. Encoding actual URL

compute unique hash (md5, sha256)
then encode hash to display

reasonable question should be: what should be the length of the short key?, 6, 8 or 10 characters?

Using base64 encoding, 6 letters would result in 64^6 = ~68.7 billion possible strings Using base64 encoding, 8 letters would result in 65^8 = ~281 trillion possible strings

assuming we're sticking to 5 year expiry, we only need to store 30 billion strings, so 68.7 billion would suffice for our system

if we use the md5 algorithm as our hash function, it will produce a 128 bit hash value.

1 base64 digit represents 6 bits of data.

encoding to base64 turns it into 128 / 6 = 21 characters

sidenote - use base64url encoding (which is url-safe as it replaces characters '+' and '/')

since we only have space for 6 characters per short key, how will we choose our key?

taking just the first 6 letters could result in duplication, and to resolve that, we can choose some other characters out of the encoding string or swap some characters.

issues with this solution

if multiple users enter the same URL, they can get the same shortened URL, (clarify whether or not this is acceptable. this means we can't tie this url to a user, therefore cannot cap limits)
what if parts of the URL are URL encoded? is that really a problem? read more here

workaround for issues

append a sequence number counting up, then generate a hash
append user id to input url

b. generating keys offline

we can have a standalone Key Generation Service (KGS) that generates 6 letter strings beforehand and stores them in a database.

this acts as an index for the URLs, linked to real URLs.

this service will ensure that there is no collision in the DB (use SQL DB for this)

this service has to support 30 billion keys (for the 30 billion URLs that we will potentially store in the database)

can concurrency cause problems? *(ASYNC)

if there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database.

solution ^

use two tables to store keys
- one for fresh keys
- one for used keys
as soon as KGS gives keys to one of the servers, it can move them to the used keys table.
for simplicity, as soon as KGS loads some keys in memory (RAM), it can move the to the used keys table. this ensures each server gets unique keys. if the service is down before assigning all keys in memory, we lose those keys - which would be acceptable, given the huge number of unique keys we have.

what would be the key-DB size?

with base64 encoding, we can generatr 68.7 billion 6 letter keys. assuming we need 1 byte to store one alphanumeric character, we can store these keys in:

6 bytes (characters per key) * 68.7B (unique keys) = 412 GB

isn't KGS a single point of failure? (SPOF)

yes
to solve this, we can have a standby replica of KGS
whenever the primary server dies, the standby server can take oer to generate and provide keys.

how do we perform a key lookup?

we can look up the key in our db to get the full URL. if it's presentin the DB, issue HTTP302 redirect, with the stored URL in the 'location' field. if key is not present, return HTTP404 not found / redirect user to homepage.

should we impose size limits on custom aliases?

our service supports custom aliases. Users can pick ay 'key' they like, but providing a custom alias is not mandatory. (if the user picks a key that clashes, the DB should return 'key has already been taken')

key clashes

to prevent key clashes, we can also use namespacing, by adding the user's username in the url route, eg tinyurl.com/username/urlbase64key

7. Data Partitioning and Replication

how to divide and store our billions of URLs

a. Range Based Partitioning (horizontal partitioning - split by 'difference', as defined by me)

store URLs in separate partitions based on the first letter of the hash key

WIP https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR

a. Hash Based Partitioning

have a hash function that turns a unique value in your horizontally scaling database into a predetermined set of numbers (e.g. 1-10), and run it through a hash function.

tangliqun_5673 --> hash function --> 3

map that number to your database. with this, we can scale to 10 databases. however, this is quite inflexible, as it does not work with less or more indices. To solve this, we use consistent hashing. (link to consistent hashing here)

Some key properties of the hash function used in Hash based partitioning:

Our hashing function will randomly distribute URLs into different partitions
with a good hashing function, it will distribute it in an even distribution (random, even distribution)
the hash function MUST be a pure function. It must always provide the same output when given the same input (because we need it to find the index the next time we access the data)
the hashing function turns the input into a number of a range defined by you (e.g. [1 to 256])

8. Caching

and LRU (least recently used) cache eviction policy fits
the hotter the URL, the more we shoud cache it
if the URL is created, but rarely used, this means we only cause the slowdown of a rare few users when we run out of memory for cache.
we can have caches that get updated ONLY when there is a cache miss
we can also have multiple instances of caches to distribute load
we can even have multiple caches near different CDNs, and each cache stores different lists of urls, in order to be able to serve the hottest urls based on location. (e.g. lowyat.net in malaysia is used a lot more in malaysia and singapore, while taskrabbit is used more in USA and other countries that have the service.)

9. Load Balancing

round robin load balancing is a good default, easy to implement and has 0 overhead
add load balancing layers at 3 places:
- between client and application servers (browser <-> app)
- between application servers and database servers (app <-> db)
- between application servers and cache servers (app <-> cache)

purging or DB clenaup

what happens to expired links?

two ways:

actively search for and delete expired links

puts a lot of pressure on the database

lazy cleanup

when a user tried to access an expired link, we delete the link and return error to user
have a separate cleanup service that slowly (periodically) removes expired links from storage and cache. run this service only when the user traffic is expected to be low.
default expiration time (2 years) to ensure this flow will run and there will never be dead links that don't get deleted
after removing expired link, we can add key back into the Key generation service DB to be reused

10. Telemetry (analytics)

what are the most important things to watch to keep our system maintained and prepared for spikes / downtime?

country of visitor - helps us decide (when we scale) where to place our servers (closest to countries where there are most users)
date and time of access - helps us gauge peak times
web page referred to on the click - helps us detect spam(?)
browser / platform where the url is being accessed from - dunno?

11. Security and Permissions

can users create private urls?
- we can store a private/public flag with each url, to be checked when being accessed.
can the user set a whitelist of users who can access a URL? (authorization)
- we can create a new table to store UserIDs that have permission to view a specific URL