2018-02-10

Failure recovery

I've been categorizing distributed system designs into four groups, according to how they recover from the loss of a single critical element (e.g. a piece of server hardware). Recently I realized that there's a fifth category, perhaps more popular than the other four.

Fault Tolerance

Element deaths and slow responses are expected and tolerated 100% of the time, with no noticeable degradation of service when a single failure happens. Examples: ADIRS, rpc hedging.

High Availability

Loss of an element provokes the automatic withdrawal of the dead element from service. Clients which were talking to the now-dead element automatically recover (either by fast keepalive timeout or asynchronous notification), and they replay the lost requests to other in-service elements. Non-idempotent requests are handled correctly, though it takes extra time to ensure that they are not committed twice. The loss of a single service element cannot, on its own, cause the loss of any request. However, significant performance degradation can attend the recovery. Examples: DNS, Alertmanager.

Failover

Loss of an element automatically triggers the withdrawal of the dead element from service, including the promotion of hot standby elements to serving where necessary to restore service. In-flight requests are lost, and some clients may experience full timeouts and errors. Examples: MariaDB/PostgreSQL "high availability", NGINX Plus "high availability".

Disaster Recovery

Loss of an element leads to an urgent automated alert, but no recovery of service happens until a human approves it. The service is partially unavailable until the recovery happens. Examples: NFS, DRBD.

Dunning-Kruger Mode

Loss of an element leads to an urgent automated alert, but no recovery of service happens until a human figures out how to rebuild the system from scratch. The service is partially unavailable for the next couple of weeks, as service users gradually ask what happened to functionality they had come to rely on. Examples: your email server, your source code repository, your SSO server...

No comments:

Post a Comment

Failure recovery

I've been categorizing distributed system designs into four groups, according to how they recover from the loss of a single critical ele...