This summer’s storms and the earlier outages on Amazon Web Services (AWS) and Salesforce.com are a reminder that disaster recovery plans are an important part of your cloud services.
I have read many of the articles and blog postings on the variety of outages and I believe that there are some lessons to learn as both users of Amazon and other cloud services and as providers of cloud services.
If you want to provide a high availability service…
1) Architect your service to assume that failures will happen.
2) Don’t provide your service out of a single data center (availability zone for AWS)
3) Power outages continue to be one of the most likely causes of a data center outage even with backup generators so understand how your data center or data center provider handles power outages including both design and testing.
4) Have a documented disaster recovery plan and test it regularly.
Even though the above is very straightforward it is clear that many users of Amazon and other cloud services have not done this and many SaaS providers could improve in this area.
If you have a major outage…
1) Communicate quickly, accurately, and transparently with your customers about the problem and the estimated time of resolution.
2) As soon as you know the root cause let your customers know and let them know the long-term resolution.
Amazon got high marks for communicating clearly what their problems were. Customers will always resent the fact that they have to pry information out of you or that you aren’t being totally transparent with the cause of the problem. Unfortunately transparency does not let you off the hook for missing important things that you should have done when designing your system. Customers will make a judgement about whether the outage “should have” occurred or whether it was unavoidable. If you have several outages which customers feel that you should have avoided, of course they are going to be unhappy.
The availability of a SaaS or Cloud service is critical and customers expect you to design your solution with the appropriate level of availability. Not every application needs five 9’s of availability, you need to pick the level that is appropriate for you solution.
These outages are not an indictment of any one provider or cloud computing in general but are a reminder that disaster recovery needs to be in the architecture, written plans needs to exist and they need to be tested. Focus on the disasters which are most likely. The spring and summer outages this year are a reminder that loss of power continues to be the most likely problem.