Today’s considered data management best practices, advocate building resilient architectures that span multiple data centers, regions or even continents.
This week’s massive internet earthquake caused by the Amazon S3 outage in their us-east region was a stark reminder of why. When an outage of such core technology as S3 occurs, it is usually not isolated to that specific technology but affects the general ecosystem. In fact, in this week’s episode, the problem was not isolated just to S3. Many of Amazon’s services, from EC2 to Lambda use S3 internally, which made most infrastructures deployed on Amazon’s us-east region to become unstable.
The Idea behind this best practice is that when one data center goes down, for whichever reason - be it a software bug or physical disaster, another one can pick up the load, providing users with a seamless and error-free experience. While there’s nothing new about this, there’s one problem - some web services do not yet support wide-area-network replications. Making it very difficult and error prone to create such fully robust systems.
One such service is Amazon’s powerful data warehouse - Redshift. At a first glance, as an analytical database, it might seem less critical to have 100% uptime of Redshift. But sometimes, analytics may be critical and urgent - especially during a crisis of losing half of your infrastructure. In big or complex teams, with a strong data-driven culture, you might have critical data, from revenue to costs that you must have constant access to.
In times of turmoil as you transition all of your traffic to a new region, it becomes even more critical to have access to your analytics and make sure that everything runs smoothly for your users.
One popular solution to load-balancing postgres connections is pgbouncer. Pgbouncer is generally a great tool when you already have several Redshift clusters deployed on several regions, yet utilizing it properly will require you to keep them in-sync. Keeping them in Sync means that every write-operation, from inserting new data to schema changes, should be performed in parallel to all of your clusters. Duplicating your database administration tasks, while handling errors and full data synchronization, very fast, becomes a tedious, error-prone and a massive timesink.
At Panoply we’ve been hard at work finding a solution to this problem - we needed our data warehouses to survive data center outages, both for internal use and for our customers.
For this reason we are proud to announce our latest achievement in the space allowing a Multi Zone Redshift cluster inside your Panoply architecture.
We’ve achieved this capability by developing our own unique pgproxy utility that both load balances connections across multiple Redshift clusters for increased concurrency, but also duplicates all write operations, including schema changes, which guarantees that changes can only happen once - but be applied to all clusters across all regions.
The result is an array of clusters with an identical schema and data that behaves like a WAN replication, with a layer of load balancing on top of it. Obviously, such a replication involves duplicating the data, and therefore the cost, multiple times - once for every region. For this reason, this feature isn’t automatically applied to all of our customers and datasets, but enabled specifically for critical data and on a need by need basis.
As a nice side-effect, this feature also duplicates the CPU, memory and concurrency resources, which means that these replicated clusters deliver superior performance. If you are interested in applying this feature to your cluster, ask your rep or a data architect for more information.