Overview - First things first
what is architecture and why does it matter
from martin fowler
brief video 14 mins from the article above
step 1 - clarify your requirements
for example, for a twitter-like service:
- (C)RUD - will the users of our service be able to post tweets and follow other people
- C(R)UD - should we also design to create and display the user's timeline?
- Large files - will tweets contain photos and videos
- are we focusing on the backend only or are we developing the front-end too?
- Go through large amounts of data - will users be able to search tweets?
- Data manipulation - Do we need to display hot trending topics?
- Difficult features - will there be any push notification for new (or important) tweets?
- other questions i can come up with:
- will the user be able to sort and filter tweets by certain criteria?
- will the user be able to edit/undo the tweet after posting it?
step 2 - back of envelope estimation
always a good idea to estimate the scale of the system we're going to design.
helps later when:
- scaling
- partitioning
- load balancing
- caching
things to estimate for:
- what scale is expected
- number of new tweets, number of tweet views, number of timeline generations per sec etc
- how much storage needed
- different requirements if users can have photos and videos in tweets (sql vs something else i guess)
- network bandwidth usage expected
- this will be crucial in deciding how we will manage traffic and balance load between servers
- <span class="text-highlight">how to estimate this?</span>
key question: identify how your system is expected to grow.
for example, for a twitter-like service, users will grow, identify the parts that will scale more than linearly. (i guess, identify one-to-many relationships here, or 1:n and m:n relationships)
which will lead to:
- more reads (how much more, how does it scale, linearly, or expontentialy?)
- exponentially more replies per tweet
- multiplied by the ways one can provide 'feedback', like
- favouriting
- retweeting
- replying
- exponentially more push notifications
<span class="text-highlight">seems like a worked out solution is needed for the tweet thread / reply thread problem</span>
but will not lead to exponentially more ____ per user:
- User entities
- tweets
- timeline generations
as it scales per-user.
step 3 - system interface definition
define what APIs are expected from the system.
establishes the exact contract expected, and also ensure if we haven't gotten any requirements wrong.
example APIs
postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, ...)
generateTimeline(user_id, current_time, user_location, ...)
setTweetAsFavourite(user_id, tweet_id, timestamp, ...)
step 4 - defining data model
- to clarify how data flows between the different components of the system.
- later, it guides data partitioning and management
be able to:
- identify various entities of the system
- how they will interact with each other
- different aspects of data management:
- storage
- transportation
- encryption
some entities for the twitter-like service:
- User: UserID, Name, Email, DoB, CreationData, LastLogin, etc
- Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, Attachment(if photos or videos supported), etc.
- UserFollow: UserID1, UserID2 note: <span class="text-highlight">why decide to have a separate follow table/collection for this?</span>
- FavouriteTweets UserID1, TweetID, Timestamp <span class="text-highlight">also this</span>
step 5 - high level design
draw block diagrams with 5-6 boxes representing the core components of the system.
needs to identify enough components needed to solve the actual problem from end to end.
for twitter at a high level:
- we need multiple app servers to serve all the read/write requests
- load balancers in front to distribute traffic
- if more read traffic vs write
- have separate servers to handle reads
- efficient read-heavy optimized database <span class="text-highlight">like what</span> that can store all tweets and support a huge number of reads
aside: read-heavy workloads vs write-heavy workloads
read-heavy
- examples: articles in a newspaper or a blog are rarely updated, but are read very frequently.
- caching and replication helps
- additional read-only servers (master slave db)
what is good for read-heavy components
- Redis
- Hadoop - stored data in 128mb disk blocks and is designed to read massive amounts of data and process them
write-heavy
- examples: writing access logs in a system or bank transactions.
- caching and replication can actually hurt performance here.
what is good for write-heavy components
- MySQL - very good write performance to disk
- Cassandra apparently, but depending on what type of write.
- Hive (for Hadoop HDFS) - reads and writes billions of TB of data at a time
stuff that is good for read AND write heavy components
other databases to check out
new databases such as Google Spanner, Azure Data Warehouse, and our eponymous database, MemSQL
this post
aside: what are the core components?
8 commonly used scalable system design patterns
<span class="text-highlight">TODO: learn more about the core components needing consideration</span>
Step 6: Detailed Design
dig deeper into 2-3 major components (based on what the interviewer wants to know more about, their feedback should guide us toward what parts of the system need further discussion).
should be able to:
- present different approaches
- their pros and cons
- explain why we prefer one approach over the other
- the important thing is to consider tradeoffs <span class="text-highlight">(how much you consider is the key here i think)</span> while keeping system constraints in mind
some questions to consider:
- since we will be storing a massive amount od ata, <span class="text-highlight">how should we partition our data</span> to distribute it to multiple databases?
- how to handle hot users (traffic spike) who tweet a lot of follow lots of people
- since user's timeline will contain most recent (most relevant) tweets, should we try to store data in a way that is optimized for scanning the latest tweets? (how the user approaches our app)
- how much and <span class="text-highlight">at which layer</span> should we introduce cache to speed things up (the network layer thingy?)
- what components need better load balancing? <span class="text-highlight">TODO: what does this mean</span>
step 7: Identifying and resolving bottlenecks
- single points of failure
- what are we doing to mitigate it
- do we have enough replicas of the data
- so if we lose <span class="text-highlight">A FEW SERVERS</span>, we can still serve our users
- do we have enough copies of different services to that a few failure will not cause a total system shutdown?
- how are we monitoring the performance of our service?
- do we get alerts when critical components fail or their performance degrades?
- also identifying what should be a critical alert (because rn everything has an alert)
other components in a system to consider
from here
- CAP theorem
- How to become horizontally scalable in every layer
- Vertical partitioning vs. horizontal partitioning
- Read-heavy vs. write heavy
- Clustering: partitioning vs. replication
- Real-time vs. near real-time vs offline
- How to make web server horizontally scalable using reverse proxy
- How to make reverse proxy horizontally scalable
- How to make database horizontally scalable
- Shared nothing database cluster vs. shared storage database cluster
- MySQL NDB vs. Percona vs. Oracle Database Cluster vs. SQL Server Cluster
- Using cloud database service vs. scaling self-hosted database
- Scaling system on cloud hosted environment vs. self hosted environment
- DNS round robin
- Caching
- Disk caching
- Local memory caching
- Distributed memory caching
- Hardware scaling
- RAID storage
- SAN storage
- Fiber network interface
- Network capacity consideration
- Local cache vs. network distributed cache
- local pre-compute to reduce network traffic (e.g. MapReduce Combiner)
- Storage scaling
- Increasing reading bandwidth (RAID, replication, memory caching, distributed network storage, hdfs, etc.)
- Increasing writing bandwidth (RAID, partitioning, memory write buffer, distributed network storage, hdfs, etc.)
- Scaling storage capacity (RAID, distributed network storage, data compression, etc.)
- Database/datawarehouse optimization
- Database index
- Column store index vs. row store index
- Search optimization (ElasticSearch, Solr)
- Peak-time preparation strategies (cloud vs. self-hosting, AWS Auto Scaling, Google - Cloud Autoscaler
- Cost optimization
- Google Cloud Preemptible, AWS Spot Instance
- Free CDN: CloudFlare, Incapsula
- Resources monitoring, debugging, troubleshooting
- Automate everything
- Load test vs. stress test
refs:
picking a database (based on read or write-heavy applications)
designing twitter-like
intro