Although there are several data warehousing systems on the market, Amazon Redshift and Google BigQuery are the industry behemoths. There is a “prevailing consensus that Amazon is leading the pack.” However, Google is swiftly working to take the lead in terms of cost and functionality.
Though we’ve written extensive comparisons on the technical similarities and differences between the two in our Redshift vs. BigQuery white paper, we recognize that cost is a driving factor in choosing the best data management system.
While we won’t be diving deep into the technical configurations of Amazon Redshift architecture, there are technical considerations for its pricing model. Understanding of nodes versus clusters, the differences between data warehousing on solid state disks versus hard disk drives, and the part virtual cores play in data processing are helpful for examining Redshift’s cost effectiveness.
Essentially, Amazon Redshift is priced by the amount of data you store and by the number of nodes. The number of nodes is expandable. Depending upon the current volume of data you need to manage, your data team can setup anywhere from a single node -- which is a 160 GB or 0.016 TB of solid state disk space -- to a 128 node cluster with a capacity for 16 TB on a hard disk drive.
There are a couple of other caveats to keep in mind. But, before we present those, it’s important for you to understand that Amazon separates its nodal definitions into two meta-types: Dense Compute and Dense Storage. Each of the sub-types within the classifications stores a maximum amount of compressed data.
Dense Compute: Recommended for less than 500GB of data
The smallest in the Dense Compute class is the dc1.large with 2 virtual cores, and .16 TB of SDD storage capacity. You can increase dc1.large from one node to a cluster of 32 nodes, which expands the SSD capacity to 5.12 TB.
Meanwhile, the dc1.8xlarge runs 32 virtual cores and is scalable from a cluster of 2 to 128 nodes which allow a maximum of 326 TB of SSD storage space.
Dense Storage: Recommended for cost effective scalability for over 500GB of data
For data management using hard disk drive space and a larger number of virtual cores, Redshift has two options. The ds2.xlarge can be initiated with only a single node that has up to a 2TB capacity. However, the single node can be increased to a maximum cluster of 32 nodes.
Need a larger cluster instead? You can switch to the ds2.8xlarge option with 36 virtual cores starting with 2 nodes and expandable to a 128 node cluster with a maximum of 16 TB of magnetic HDD space.
Now here is where pricing gets tricky. Redshift has no upfront costs (there is an exception which is described below). However, the price per hour is largely dependent on your region. For example, let’s say your region option is US West (Northern California), and you're running a small startup with a single node. Redshift’s current pricing structure states that you’ll pay $0.33 per hour for that 0.16TB SSD. On the other hand, running a single node in the US East (North Virginia) region costs you $0.25 per hour.
These are all categorized as Redshift’s “On Demand” pricing. If you want the “Reserved Instance pricing” (for further information on this, Google has a blog available), then a longer term commitment of either one or three years available.
A few things to keep in mind regarding infrastructure and pricing for Redshift:
BigQuery presents a challenge when attempting to compare its pricing to Redshift. While Amazon has set parameters for its On Demand, per hour rates, BigQuery offers a calculator where you select a Table Name and are required to estimate how much storage you need, how many streaming inserts will be required, and how many queries you believe you’ll need beyond the first TB of data per month (their website states you’re only charged for queries past the 1 TB mark and queries that return an error do not cost you anything).
On the upside, BigQuery does provide flat rate pricing ranging from $40,000 to $100,000 per month and includes between 2,000 and 5,000 slots. According to the Google pricing website, a slot is a proprietary measure and combines CPU, memory, and networking resources. So, to put it bluntly, they aren’t going to reveal the specifics of their slot infrastructure.
For an in depth study of the Redshift versus BigQuery pricing, our team produced a detailed analysis of the two data management platforms which can be found here. Ultimately, our conclusion was that if you want to use the data you’re storing, rather than it just sitting inert in a bucket, Redshift came out on top.
Amazon Redshift supports automated and manual snapshots of the data warehouse. These backups are stored in Amazon S3. Redshift doesn't charge for backup storage up to 100% of the provisioned storage of an active data warehouse cluster. Beyond this limit or if the Redshift cluster is terminated, backups will be billed at the standard Amazon S3 rate.
In addition to the backups, Amazon Redshift provides automatic recovery support for disk and node failures. To implement this, Redshift mirrors each drive's data to other nodes within a cluster. While provisioning compute nodes, Redshift internally allocates additional storage for each of the nodes.
Although Redshift allocates 2.5 to 3 times the advertised storage for recovery, both the additional storage used for mirroring and recovery automation is provided free of charge for users.
Machine learning and Artificial Intelligence are gaining traction for interfacing, analyzing, and automating data management processes. There are patterns in data. That’s what all data personnel are working to find. But, separating the signal from the noise (as well as cleaning and organizing data) takes significant resources, and often the data lake is really a data swamp.
Rather than having your data team spend an inordinate amount of time dealing with lower level data management processes, utilize the power of Panoply self-optimizing data warehouse architecture that utilizes machine learning and natural language processing (NLP) to streamline the data journey from source to analysis. In the long run, this will streamline your data processing, and free up your data engineers, data scientists, and data analysts to do what they do best (and lower your costs as well).