On Feb 28th, Amazon Web Services (AWS), went down for approximately 4 hours. The outage occurred in Amazon’s US-East-1 facility in Virginia, their biggest region, and disrupted business continuity for some of the world’s most popular web services, including Docker, Airbnb, Slack, GitHub, MailChimp, Wix, Trello, Canva, Salesforce, and many more.
This outage had huge implications for business continuity, due to the sheer amount of services that are reliant on S3. By the time the entire AWS system was running properly again, analytics firm Cyence estimates that over $310 million dollars in losses were caused by the 4-hour downtime.
Let’s take a look at some of the biggest lessons we have learned, including what steps businesses can take to mitigate the risk of a cloud outage.
When dealing with analytics infrastructure, you must expect the unexpected. If this outage has taught us anything, it is that such occurrences, although extremely rare, are inevitable even for the most well-established cloud service providers. Organizations must think ahead and devise a robust mitigation strategy before they are exposed to any downtime.
When it comes to mitigation strategies there are few possible paths: opt for a purely on-premise solution, build a hybrid on-premise/cloud solution, go with multiple cloud vendors or adopt a Multi-Region cloud redundancy strategy within a single provider.
In our opinion, Multi-Region strategy offers the best cost/benefit advantages, but with different mitigation strategies to choose from, organizations should conduct a careful cost-benefit analysis to determine which strategy will best serve their specific needs.
If the recent AWS outage has led you to believe that it might be a good idea to keep all of your organization’s data on-premise, then think again. While this may seem like a good idea, it only increases your organization’s risks when it comes to business continuity in case of an outage.
Think about it— if this can happen to Microsoft, Google, or Amazon, then it can certainly happen to your on-premise servers, and in this case you need to rely on your own IT team to have the knowledge, expertise and resources to resolve the issue quickly.
In addition to the risk of an outage being much less likely in the cloud in the first place, top cloud vendors will do everything they can to fix the issue as quickly and as painlessly for their clients possible. In the end, their business depends on it.
A hybrid approach combines the use of the public cloud with an on-prem solution. Instead of having to build all the necessary infrastructure on-premises to withstand the occasional peaks in system usage, companies can utilize the public cloud to offload system resources as needed, and only pay for it during times of heavy usage.
While using a hybrid approach can potentially save operational costs in the long term, it takes the right mix of tools and skilled workers to develop and execute an effective hybrid strategy. This is where the immediate costs and headaches of a further investment in skilled human capital might outweigh the long term benefits of this solution. Furthermore, it does not necessarily mitigate any additional risk of an outage happening. In fact, and as previously noted, having any on-premise servers might actually increase the risks for business continuity in the event of an outage.
With more redundancy safeguards in place, the likelihood of a system suffering any downtime is reduced dramatically. However, spreading workloads across multiple cloud vendors is a prohibitively expensive option for most organizations.
With that said, the most reliable, secure and cost-effective strategy for mitigating the risk of a cloud outage is replication across multiple regions within a single cloud provider. Amazon Web Services, for example, operates within 16 different regions in the world, with 6 of these regions in North America alone.
AWS offers a Cross-Region Replication (CRR) feature, which allows its users to use multiple regions within a AWS ecosystem, greatly increasing redundancy and reducing the chance of incurring complete system downtime to nearly zero.
In case of a massive outage, your data is completely stuck - you cannot query, you cannot write any new data. With multi-zone service, you can at least save new data in a different region while working on getting the affected servers back up.
However, the CRR feature is not available for AWS Redshift. Fortunately, Panoply multi-zone redshift service provides an excellent option for any organization whose data warehouse is hosted on Redshift.
Although there is an inherent leap of faith that an organization must take when passing uptime responsibility to a cloud vendor, the costs, and complexity of establishing redundancy either on-premises or with a hybrid cloud— are typically too daunting for most organizations to consider.
With the dust still settling from the recent AWS outage, the elite cloud providers, including AWS, still offer the best value, security, and reliability to their customers. Moreover, the option to greatly increase reliability via Multi-Region Cloud Redundancy provides an attractive and practical solution for any organization looking to keep their workloads running seamlessly.
Panoply’s data warehouses survive data center outages. We constantly maintain uptime for all of our internal systems, and for those of our customers. We’ve achieved this breakthrough in reliability with our pgproxy utility. For more details, read our blog on the topic.