Overview - First things first

what is architecture and why does it matter

from martin fowler brief video 14 mins from the article above

step 1 - clarify your requirements

for example, for a twitter-like service:

(C)RUD - will the users of our service be able to post tweets and follow other people
C(R)UD - should we also design to create and display the user's timeline?
Large files - will tweets contain photos and videos
are we focusing on the backend only or are we developing the front-end too?
Go through large amounts of data - will users be able to search tweets?
Data manipulation - Do we need to display hot trending topics?
Difficult features - will there be any push notification for new (or important) tweets?
other questions i can come up with:
- will the user be able to sort and filter tweets by certain criteria?
- will the user be able to edit/undo the tweet after posting it?

step 2 - back of envelope estimation

always a good idea to estimate the scale of the system we're going to design.

helps later when:

scaling
partitioning
load balancing
caching

things to estimate for:

what scale is expected
- number of new tweets, number of tweet views, number of timeline generations per sec etc
how much storage needed
- different requirements if users can have photos and videos in tweets (sql vs something else i guess)
network bandwidth usage expected
- this will be crucial in deciding how we will manage traffic and balance load between servers
- how to estimate this?

key question: identify how your system is expected to grow.

for example, for a twitter-like service, users will grow, identify the parts that will scale more than linearly. (i guess, identify one-to-many relationships here, or 1:n and m:n relationships)

which will lead to:

more reads (how much more, how does it scale, linearly, or expontentialy?)
exponentially more replies per tweet
- multiplied by the ways one can provide 'feedback', like
  - favouriting
  - retweeting
  - replying
exponentially more push notifications

seems like a worked out solution is needed for the tweet thread / reply thread problem

but will not lead to exponentially more ____ per user:

User entities
tweets
timeline generations

as it scales per-user.

step 3 - system interface definition

define what APIs are expected from the system.

establishes the exact contract expected, and also ensure if we haven't gotten any requirements wrong.

example APIs

postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, ...) generateTimeline(user_id, current_time, user_location, ...) setTweetAsFavourite(user_id, tweet_id, timestamp, ...)

step 4 - defining data model

to clarify how data flows between the different components of the system.
later, it guides data partitioning and management

be able to:

identify various entities of the system
how they will interact with each other
different aspects of data management:
- storage
- transportation
- encryption

some entities for the twitter-like service:

User: UserID, Name, Email, DoB, CreationData, LastLogin, etc
Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, Attachment(if photos or videos supported), etc.
UserFollow: UserID1, UserID2 note: why decide to have a separate follow table/collection for this?
FavouriteTweets UserID1, TweetID, Timestamp also this

step 5 - high level design

draw block diagrams with 5-6 boxes representing the core components of the system. needs to identify enough components needed to solve the actual problem from end to end.

for twitter at a high level:

we need multiple app servers to serve all the read/write requests
load balancers in front to distribute traffic
if more read traffic vs write
- have separate servers to handle reads
- efficient read-heavy optimized database like what that can store all tweets and support a huge number of reads

aside: read-heavy workloads vs write-heavy workloads

read-heavy

examples: articles in a newspaper or a blog are rarely updated, but are read very frequently.
caching and replication helps
additional read-only servers (master slave db)

what is good for read-heavy components

Redis
Hadoop - stored data in 128mb disk blocks and is designed to read massive amounts of data and process them

write-heavy

examples: writing access logs in a system or bank transactions.
caching and replication can actually hurt performance here.

what is good for write-heavy components

MySQL - very good write performance to disk
Cassandra apparently, but depending on what type of write.
Hive (for Hadoop HDFS) - reads and writes billions of TB of data at a time

stuff that is good for read AND write heavy components

Hadoop + Hive
Cassandra

other databases to check out

new databases such as Google Spanner, Azure Data Warehouse, and our eponymous database, MemSQL

this post

aside: what are the core components?

8 commonly used scalable system design patterns

TODO: learn more about the core components needing consideration

Step 6: Detailed Design

dig deeper into 2-3 major components (based on what the interviewer wants to know more about, their feedback should guide us toward what parts of the system need further discussion).

should be able to:

present different approaches
their pros and cons
explain why we prefer one approach over the other
the important thing is to consider tradeoffs (how much you consider is the key here i think) while keeping system constraints in mind

some questions to consider:

since we will be storing a massive amount od ata, how should we partition our data to distribute it to multiple databases?
how to handle hot users (traffic spike) who tweet a lot of follow lots of people
since user's timeline will contain most recent (most relevant) tweets, should we try to store data in a way that is optimized for scanning the latest tweets? (how the user approaches our app)
how much and at which layer should we introduce cache to speed things up (the network layer thingy?)
what components need better load balancing? TODO: what does this mean

step 7: Identifying and resolving bottlenecks

single points of failure
- what are we doing to mitigate it
do we have enough replicas of the data
- so if we lose A FEW SERVERS, we can still serve our users
do we have enough copies of different services to that a few failure will not cause a total system shutdown?
how are we monitoring the performance of our service?
- do we get alerts when critical components fail or their performance degrades?
  - also identifying what should be a critical alert (because rn everything has an alert)

other components in a system to consider

from here

CAP theorem
How to become horizontally scalable in every layer
Vertical partitioning vs. horizontal partitioning
Read-heavy vs. write heavy
Clustering: partitioning vs. replication
Real-time vs. near real-time vs offline
How to make web server horizontally scalable using reverse proxy
How to make reverse proxy horizontally scalable
How to make database horizontally scalable
- Shared nothing database cluster vs. shared storage database cluster
- MySQL NDB vs. Percona vs. Oracle Database Cluster vs. SQL Server Cluster
Using cloud database service vs. scaling self-hosted database
Scaling system on cloud hosted environment vs. self hosted environment
DNS round robin
Caching
- Disk caching
- Local memory caching
- Distributed memory caching
  - memcached
  - Redis
Hardware scaling
- RAID storage
- SAN storage
- Fiber network interface
Network capacity consideration
- Local cache vs. network distributed cache
- local pre-compute to reduce network traffic (e.g. MapReduce Combiner)
Storage scaling
- Increasing reading bandwidth (RAID, replication, memory caching, distributed network storage, hdfs, etc.)
- Increasing writing bandwidth (RAID, partitioning, memory write buffer, distributed network storage, hdfs, etc.)
- Scaling storage capacity (RAID, distributed network storage, data compression, etc.)
Database/datawarehouse optimization
- Database index
- Column store index vs. row store index
Search optimization (ElasticSearch, Solr)
Peak-time preparation strategies (cloud vs. self-hosting, AWS Auto Scaling, Google - Cloud Autoscaler
Cost optimization
- Google Cloud Preemptible, AWS Spot Instance
- Free CDN: CloudFlare, Incapsula
Resources monitoring, debugging, troubleshooting
Automate everything
Load test vs. stress test

refs:

picking a database (based on read or write-heavy applications) designing twitter-like

intro