Key Characteristics of Distributed Systems

scalability

the capability of a system, process, or a network to grow and manage increased demand.

a scalable system has to be able to continuously evolve in order to support the growing amount of work

reasons to scale:

increased data volume (more data sources, more datapoints being recorded)
increased amount of work (more data report requests, more transactions)

generally the performance of a system declines with system size, due to:

management cost
environment cost - network speed becomes slowed because machines tend to be far apart from each other

some tasks may not be distributed exercise: identify which parts in a system cannot be distributed (at work or in an example large system

horizontal scaling - add servers

good examples are Cassandra and MongoDB

vertical scaling - add ram/cpu/storage etc. to your one server

usually has downtime when adding capacity (because its one server, you gotta restart your server)
good example is MySQL as it allows for an easy way to switch from smaller to bigger machines note: but how?

reliability

reliability = the probability a system will fail in a given period

for example, in Amazon, any user transaction should never be canceled due to a failure of the machine that is running the transaction. if a user has added an item to their shopping cart, the system is expected not to lose it.
other examples:
- scheduled reports should not fail if the machine processing it fails
- a saved report should not be lost if the machine crashes

availability

availability - the time a system remains operational to perform its required function in a specific period.

if an aircraft/app is down for maintenance, it is considered not available during that time.

reliability vs availability

if a system is reliable, it is available. However, it it is available, it is not necessarily reliable.

Let’s take the example of an online retail store that has 99.99% availability for the first two years after its launch. However, the system was launched without any information security testing. The customers are happy with the system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the third year, the system experiences a series of information security incidents that suddenly result in extremely low availability for extended periods of time. This results in reputational and financial damage to the customers.

efficiency

two standard measures of efficiency are:

response time (latency) - speed
throughput (bandwidth) - volume

The two measures correspond to the following unit costs:

number of messages globally sent by the nodes of the system regardless of message size
size of messages representing the volume of data exchanges.

the complexity of operations supported by distributed data structures (e.g. searching for a specific key in a distributed index) can be characterized as a function of one of these cost units.

serviceability or manageability

serviceability or manageability = the speed with which a system can be repaired of maintained

if time to fix a failed system increases, availability will decrease.

things to consider:

ease of diagnosing and understanding problems when they occur
ease of making updates or modificaitons
how simple is the system to operate

early detection of faults can decrease or avoid system downtime.