With 2 recent outages in AWS‘s east region, I believe it’s worth writing a post about the oft-touted tenet of architecting for failure. If you’re using AWS, your application should be built and tested* for an entire AWS availability zone failure as was the case in the most recent 2 failures in the US East AWS region. Providers like Heroku should really be providing this sort of resilience. My site which runs on Heroku went down (shock horror being such a precious web service and all!) with both these outages and I’ve been talking to the Heroku team about this shortcoming. I’ll post back here when I have a better understanding of their fault tolerance strategies.
While amazon are regularly espousing the “architect for failure”, few companies fully understand how to do it, and even fewer actually test such an event. Certainly part of the blame has to lie with AWS for failing to test their redundant power supply setups, especially given the same problem has happened twice in as many weeks, but nonetheless, this is the reason they always build more at least two AZs in any one region, and in the case of US-EAST-1, four AZs!
Cloud advantage specializes in fault tolerant cloud architectures, but one of the missing pieces is a clear test strategy for new AWS users. Simon Elisha, principal solution architect for AWS here in Australia provides are great introductory presentation on this sort of thinking. Whilst with the latest failure in AWS, Netflix admit some amount of disruption of service to customers, their architecture is also renowned to be extremely resilient to failure. Their team was so focussed on designing for failure, they invented chaos monkey to simulate a variety of different failure events, and more specifically gorilla monkey to simulate an entire AZ failure.
The truth is that the cloud will be the future for tech infrastructure, but a paradigm (please excuse the use of that word) change is required in application development. A good place to start is with AWS’s whitepaper itself.
Despite what is said by customers and the like regarding this latest disruption, I’m very interested to here what comes from the AWS team’s postmortem on root cause. When an event like this happens to a provider as big as AWS, I believe the actual downtimes for various web sites and the like is less than initially stated in the media. Secondly, it would be good to see as thorough postmortem from properties like Instagram and Netflix about why their services failed to failover to other availability zones sooner.
If you’re reading this blog, I’d love the chance to chat to your firm about how to build a robust cloud infrastructure since I believe with the right design, almost always a better result than your current infrastructure is achievable.
Finally an AWS postmortem: http://aws.amazon.com/message/67457/