Failure is an Option
Submitted by t3rmin4t0r (@t3rmin4t0r) on Wednesday, 29 June 2011
To talk about failure in the cloud, how likely it is and how different the remedial measures are compared to a data-centre setup.
The fundamental assumption of the cloud is that someone else runs your machines, buys your disks and routes your network. Unfortunately that means that there is really no way you can tell when a machine will fail, when a storage setup will error out and when your network connectivity will choke.
Even in such an error prone and unreliable setup, it is still possible to get a reliable system with great uptime by making your applications more agile. Most importantly, recovery from failure takes a different approach from a fixed node setup by an always roll-forward dynamic system.
Targeted at Developer/Ops