URL shortener

1. why? (brief)

optimize links across devices, tracking individual links to analyze audience and campaign performance, as well as hiding affiliated original URLs.

2. requirements (given + clarified)

Functional Requirements:

  1. Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
  2. When users access a short link, our service should redirect them to the original link.
  3. Users should optionally be able to pick a custom short link for their URL.
  4. Links will expire after a standard default timespan. Users should be able to specify the expiration time.

Non-Functional Requirements:

  1. The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
  2. URL redirection should happen in real-time with minimal latency.
  3. Shortened links should not be guessable (not predictable).

Extended Requirements:

  1. Analytics; e.g., how many times a redirection happened?
  2. Our service should also be accessible through REST APIs by other services.

3. capacity estimation and constraints

traffic estimates (Queries Per Second - QPS)

URL shortening = write event URL redirection = read event ballpark number of URL shortenings: 500M read/write ratio: 100:1

100 * 500M = 50B redirections per month

QPS

500 million / (30 days * 24 hours * 3600 seconds) = ~200URL shortenings/s (write)

500 milion queries divided by 2.5 million seconds (1 month in seconds) equals about 200URLs per second

with read:write ratio of 100:1, URL redirections per second

200 URLs/s * 100 = 20k redirections per second (read)

OR

50B redirections per month / 2.5 million seconds = 20K redirections per second

READ (QPS) = 20K per second - because this system is read heavy WRITE = 200 per second

QPS = 20k per second

storage estimates

unit stored = URL shortening request (write) assumed length of storage = 5 years ballpark stored object size = 500 bytes

how many objects?

500 million per month * 5 years * 12 months = 30 billion objects

how big?

30 billion * 500 bytes = 15 terabytes (TB)

bandwidth estimates

write requests expect 200 URLs per second

200 * 500 bytes = 100 KB/s

read requests expect ~20K URL redirections per second

20K * 500 bytes = ~10MB/s

READ = ~10MB/s

WRITE = 100KB/s

memory estimates - caching hot URLs (read)

follow the 80/20 rule 20% of the hot URLs are frequently accessed, generating 80% of traffic then we would like to cache these 20% hot URLs.

requests per second = 20K requests per day = 20K * 3600 seconds * 24 hours = ~1.7 billion requests per day

to cache 20% of these requests

1.7 billion * 500 bytes = 850 GB 1.7 billion * 500 bytes * 0.2 = 170 GB

cache memory required = 170GB

high level estimates

typeunit
New URLs (WRITE)200/s
URL redirections (READ)20K/s
Incoming data100KB/s
Outgoing data10MB/s
Storage for 5 years15TB
Memory for cache170GB

4. System APIs

in order to explicitly state what is expected from the system.

API for creating and deleting URL (no read needed really)

Create URL

createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expiry_date=None)

Parameterstypedescription
api_dev_keystringAPI developer key. e.g. to throttle users based on their quota
original_urlstringthe original URL
custom_aliasstringoptional custom name for URL
user_namestringoptional user name, e.g. for encoding or url path
expiry_datestringoptional expiration date for shortened URL
Returnseventtypedescription
shortened URLsuccessstringthe URL to be shared by users (end result)
error codeerrorError objecterror code and message

Delete URL

deleteURL(api_dev_key, url_key)

Parameterstypedescription
api_dev_keystringApi developer key
url_keystringthe shortened URL to be deleted
Returnseventtypedescription
"Url Removed"successstringmessage indicating success
error codeerrorError objecterror code and message

how do we detect and prevent abuse?

currently there is no limit to how many times a URL can be consumed, a malicious user can use that URL many times to fill out our redirecting capacity. to prevent abuse, users can be limited via their api_dev_key, to set a limit to the number of URL creations and redirections per time window (can be different durations per user key, with differently tiered pricing)

5. Database Design

defining DB schemas in the early stages help understand the data flow, and would later guide data partitioning.

the nature of data we will store:

two tables

URL table

keytype
Primary Key - Hashvarchar(16)
Original URLvarchar(512)
Creation Datedatetime
Expiration Datedatetime
UserIDint

User table

keytype
Namevarchar(20)
Emailvarchar(32)
Creation Datedatetime
Last Logindatetime

NOSQL, because:

6. Basic System Design and Algorithm

explore a two options

a. Encoding actual URL

reasonable question should be: what should be the length of the short key?, 6, 8 or 10 characters?

Using base64 encoding, 6 letters would result in 64^6 = ~68.7 billion possible strings Using base64 encoding, 8 letters would result in 65^8 = ~281 trillion possible strings

assuming we're sticking to 5 year expiry, we only need to store 30 billion strings, so 68.7 billion would suffice for our system

if we use the md5 algorithm as our hash function, it will produce a 128 bit hash value.

1 base64 digit represents 6 bits of data.

encoding to base64 turns it into 128 / 6 = 21 characters

sidenote - use base64url encoding (which is url-safe as it replaces characters '+' and '/')

since we only have space for 6 characters per short key, how will we choose our key?

taking just the first 6 letters could result in duplication, and to resolve that, we can choose some other characters out of the encoding string or swap some characters.

issues with this solution

workaround for issues

b. generating keys offline

we can have a standalone Key Generation Service (KGS) that generates 6 letter strings beforehand and stores them in a database.

this acts as an index for the URLs, linked to real URLs.

this service will ensure that there is no collision in the DB (use SQL DB for this)

this service has to support 30 billion keys (for the 30 billion URLs that we will potentially store in the database)

can concurrency cause problems? *(ASYNC)

solution ^

what would be the key-DB size?

6 bytes (characters per key) * 68.7B (unique keys) = 412 GB

isn't KGS a single point of failure? (SPOF)

how do we perform a key lookup?

we can look up the key in our db to get the full URL. if it's presentin the DB, issue HTTP302 redirect, with the stored URL in the 'location' field. if key is not present, return HTTP404 not found / redirect user to homepage.

should we impose size limits on custom aliases?

our service supports custom aliases. Users can pick ay 'key' they like, but providing a custom alias is not mandatory. (if the user picks a key that clashes, the DB should return 'key has already been taken')

key clashes

7. Data Partitioning and Replication

a. Range Based Partitioning (horizontal partitioning - split by 'difference', as defined by me)

store URLs in separate partitions based on the first letter of the hash key

WIP https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR

a. Hash Based Partitioning

have a hash function that turns a unique value in your horizontally scaling database into a predetermined set of numbers (e.g. 1-10), and run it through a hash function.

tangliqun_5673 --> hash function --> 3

map that number to your database. with this, we can scale to 10 databases. however, this is quite inflexible, as it does not work with less or more indices. To solve this, we use consistent hashing. (link to consistent hashing here)

Some key properties of the hash function used in Hash based partitioning:

8. Caching

9. Load Balancing

purging or DB clenaup

two ways:

  1. actively search for and delete expired links
  1. lazy cleanup

10. Telemetry (analytics)

  1. country of visitor - helps us decide (when we scale) where to place our servers (closest to countries where there are most users)
  2. date and time of access - helps us gauge peak times
  3. web page referred to on the click - helps us detect spam(?)
  4. browser / platform where the url is being accessed from - dunno?

11. Security and Permissions