As we speak the future of cloud computing is being decided. Amazon and Google, as well as Microsoft and a few others, offer multiple cloud solutions for practically everything. Despite the prevailing consensus that Amazon is leading the pack Google is investing heavily in their cloud platform offering to gain ground. Both are releasing colossal features and services on almost a weekly basis, and we, the developers, are richer for it. In our opinion, some of the most interesting innovations are taking place in the space of analytical data warehousing, where on one hand we have Amazon’s Redshift and on the other Google’s BigQuery.
With Panoply.io’ s inception, we had to make a choice: Redshift or BigQuery. As our platform delivers full-stack data automation, a critical chunk of the stack hinges not only on the massively parallel data warehouse used internally to store hundreds of terabytes of data, but the capability to analyze it in minutes. This choice would define us so we were determined to do a thorough comparison of the two and pick the one best for us.
Spoiler Alert! On almost all fronts we found Amazon Redshift to deliver superior results. Significantly so for usability, performance, and cost for almost all analytical use-cases, especially at scale. And yes, at a glance there are apparent complexities to Redshift, but what it surrenders in terms of simplicity it gains in terms of functionality.
This article has been written in retrospect, so throughout the blog we’ll walk you through our logic and argue our case.
Recap: Redshift and BigQuery
This article assumes some familiarity with Redshift and BigQuery, as well as basic knowledge in columnar MPP data warehouses. In this section we’ll cover the basics before drilling down into our comparison. If you already got this covered feel free to skip ahead.
It all comes down to how transactional operations and analytical queries differ from each other. Normal transactional operations are usually indexed and very fast. This is because they must support a huge number of concurrent queries, each targeted to a specific row, which are resolved in the sub-second range. Analytical queries on the other hand are normally performed by just a few analysts where each query is a batch process over a huge dataset that can take minutes or even hours to compute.
Normal relational database system, like Postgres and MySQL, store data internally in row form: all data rows are stored together and are usually indexed by a primary key to facilitate efficient accessibility. This setup is perfect for most transactional operations, like reading rows by id to display your user profile, for example, however it’s ineffective for many analytical use-cases. In analytics, we’re rarely interested in specific rows but in the aggregates of all (or most rows). For example, computing the average age of all of our users, grouped by location. Performing these queries on classic row-oriented databases requires them to read through the entire database, along with all unused columns, to produce the results.
This massive inefficiency is addressed with the advent of columnar databases. Columnar databases store data by column instead of by row. As a result, all values of the same column can be stored sequentially on disk. This structure significantly speeds up analytical queries by only reading the columns involved in the query, resulting in accelerated disk access and CPU cache.
Both Amazon Redshift and Google BigQuery (along with a few others) benefit from a columnar storage structure, making them ideal for analytical use-cases. In addition, they’re designed to be massively parallel where every query is distributed across multiple servers to accelerate query execution.
In analytics, where queries over hundreds of gigabytes are the norm, performance is paramount. Normal transactional queries are performed by your application’s code and therefore affect performance for your users. In an analytical context performance has a direct effect on the productivity of your team: running a query for hours means days of iterations between business question (think compilation time of the past). We needed to select a technology that would scale well not only in terms of data volume but just as importantly, in quantity and complexity of the queries and analysts on our team.
There is a pittance of published benchmarks comparing Redshift and BigQuery performance, and there are none that we could find that ran a real world comparison with Redshift configured with correct sortkeys and distkeys. This was the motivation behind running our own test which is based on “A Comparison of Approaches to Large-Scale Data Analysis” by Pavlo et al. (SIGMOD 2009).
|Load from AWS||460GB||484||438 + 7820|
We used the same test data set from the aforementioned benchmark, but only employed the largest data size available (labeled “5nodes”). That narrowed it down to 2 tables: one (ranking) with 90 million rows at 5.2GB, and the other (uservisits) with 750 million rows at 455GB. The first step was to load the data to both Redshift and BigQuery, and as you can see in the table above, BigQuery’s load operation performs slightly better. The only caveat being that if the data is generated in AWS (that’s where we run our production infrastructure), the time required to transfer the data to Google Cloud Storage came out to an additional 7820 seconds (roughly 2 hours) using Google’s transfer service.
We used 2x DS2 HDD Redshift nodes, in order to maintain consistent with the instance types considered in the cost section below. These are Redshift’s most cost-effective node types but one of the least-performing ones. Meaning that all of the Redshift results shown here can be significantly improved on by trading cost for performance. This approach affords considerable flexibility in fine-tune performance to business needs, especially when we consider distributing data across different clusters/types. For example: we store frequently-accessed critical data on DC1 SSD which gives us just over a 10x performance boost, while everything else is kept on slower/cheaper instance types. BigQuery has no hardware configuration to describe.
For queries, we took a different approach from the mentioned benchmark; instead of using vanilla Redshift, un-optimized or configured in any way, we applied some best-practice optimizations, like compressing tables and setting sort and distribution keys. Previous comparisons measured Redshift’s worst case performance while we needed to evaluate a more realistic use-case, one where engineers fine-tune their databases. That said, we didn’t go to the extreme. This setup is far from perfectly optimized, but it presents a more balanced and realistic comparison between the two databases.
We evaluated 3 queries, each with three variants for measuring different scales of processed data: small, medium and large. The definitions of these variables are borrowed from the Big Data Benchmark from U.C. Berkeley AMPLab which includes code and test data for the comparison. Each query was executed 10 times with an hour interval and the median runtime in seconds was logged. BigQuery has excellent query caching, meaning that all repeated queries complete within a few seconds regardless of the processed data size. Unfortunately, that also made query repetition an ineffective means of producing a statistical measure of BigQuery’s performance. To work around that, with each execution we made a tiny parameterized change to the query (like changing the variable X from 1000 to 1000.01, see below) in order to invalidate the cache. The same approach was used in Redshift, however Redshift caches data, not queries, which gives it an edge at handling similar queries with modified parameters, while under performing at repetitive identical queries.
The first query measures normal scan speed. It has three variations where X = 1000 (small), 100 (medium) and 10 (large):
Note that BigQuery doesn’t allow output results to exceed 128MB. While that’s definitely a lot, we found it to be insufficient for use cases such as utilizing Spark SQL for deep data mining. Instead, we had to save the results of the medium and large variants to a destination table. Obviously, in order to compare apples-to-apples, we applied the same approach in Redshift although it has no such limitation.
The second query measures aggregation speed. It has three variations where X = 8 (small), 10 (medium) and 12 (large):
The third and final query measures join speed. It has three variations where X = 1980-04-01 (small), 1983-01-01 (medium) and 2010-01-01 (large):
Contrary to previous findings that didn’t consider optimization, when reasonably optimized, Redshift outperforms BigQuery in 9 out of 11 use-cases hands down. This is doubly true when the rest of your infrastructure is already on AWS. The only instance where BigQuery has superior performance is in big join operations. In a future blog post we’ll discuss how Redshift joins can be further improved to eliminate the gap.
On the surface, it might seem that Redshift is more expensive. Per GB, Redshift costs $0.08, per month ($1000/TB/Year), compared to BigQuery’s $0.02. However, the devil is in the details.
BigQuery’s cost of $0.02/GB only covers storage, not queries. You pay separately per query based on the amount of data processed at a $5/TB rate. Because BigQuery doesn’t provide any indexes, and many analytical queries cover entire database, we can assume that each query will need to scan a big chunk of the data. Say you have 1TB spread evenly across 50 columns (in several tables). A query that scans through 5 of these columns will process 100GB at a cost of $0.5. This means, that per GB you’ll pay an additional $0.005 per query. If you have 12 such queries per month it will actually cost you $0.08 (0.02 + 0.005 * 12). Which is the same as Redshift. Beyond that, BigQuery costs more.
The realization we came to is that cheap data storage is worthless disjointed from utilization. It’s arguably comparable to storing data on just Amazon S3. When you actually use the data, you’ll start paying big.
In addition to the flat cost, there are three additional drawbacks to BigQuery’s pricing model:
- It’s unpredictable. There is no way to estimate in advance how much you’ll end up paying at the end of the month. Not to mention that costs can fluctuate from month to month based on query patterns. Luckily, this is mitigated with BigQuery’s Cost Control feature at the cost of worsening the following items.
- It’s complex. Just trying to figure out how much each query will cost, and how much you’ll pay per GB is extremely difficult as it requires a thorough examination of each query and the underlying data it’s going to access.
- It discourages data usage. Instead of encouraging analysts to query the database in any and all ways they can imagine you’ll end up worrying about needing to limit them and come up with processes for controlling the volume data being used. As a data-driven company, this notion contradicted our company values.
Usability is probably the most debated, intangible and subjective aspect of all database comparisons. Both Redshift and Bigquery list tens to hundreds of capabilities and features, each with its own configurations, use-cases, edge-cases and bugs. This vast soup of intertwined capabilities is so complex to reason through that it often comes down to personal preferences. That said it cannot be skipped so we mapped out all the key subjects we cared about in terms of usability for the comparison of the two databases:
Simplicity is by far the biggest advantage BigQuery holds over Redshift throughout this entire comparison. BigQuery abstracts away the details of the underlying hardware, database and all configurations. It mostly works out of the box, where in Redshift you need a deep knowledge and particular skillset in order to use and optimize it effectively. This often translates into dedicated engineers battling to keep your company’s infrastructure up to par with your company’s data scale and query patterns as well as industry’s best practices. That’s where Panoply.io comes in; bridging this gap by making Redshift even simpler to use than BigQuery.
Integrations with BI tools are pretty easy and widely supported for both of these databases. Redshift supports standard JDBC/ODBC compatible SQL syntax, as it’s built on top of Postgres and supports many of its features. This means that all of the tools that are compatible with Postgres are also compatible with Redshift. BigQuery on the other hand uses an SQL-like grammar that’s well documented on its website. While it isn’t standard, Google made sure that it’s supported by many of the popular tools out there.
Supported Data Types for both databases include a wide range of classic data types, like text, integers, floats, booleans, etc. Redshift also supports the Numeric type which stores values with user-defined precision, which is important for exact numeric operations, for example, when dealing with financial data where absolute precision is important. Unfortunately, BigQuery doesn’t support a user-defined precision alternative, so you’re bound to have some inaccuracies when dealing with floating point numbers. However, BigQuery does support the Record data type for nested structures which are very useful for semi-structured data.
Updates & Upserts aren’t trivial for both Redshift and BigQuery. While Redshift does support UPDATE and DELETE SQL commands internally the data is always in-append mode, which will result in in performance degradation over time until a VACUUM operation is manually triggered. BigQuery doesn’t support updates or deletions and changing a value would require re-creating the entire table. Neither Redshift or Bigquery supports schema updates or native upsert operations.
Nearline storage is supported by BigQuery as it allows you to offload some of your less critical data to a slower, cheaper storage. Redshift currently doesn’t support that nor the ability to scale compute separately from storage. At Panoply.io we worked around that limitation by providing our own implementation of Nearline storage for our customers.
Loading data into Redshift is a very powerful yet involved process handled by the standard COPY command. It supports a handful of formats, like CSV, JSON and AVRO, each with its own configurations like delimiters and compressions. On the other hand, loading data to BigQuery is a simpler, yet somewhat more limited. For example, it can only be done via the web UI or API, not with standard SQL.
The ecosystem for both Redshift and BigQuery is pretty big. Google promotes the latter aggressively so there are many tools and communities out there to help out with pretty much everything. But for us Redshift’s ecosystem is bigger and stronger. First, AWS is the world’s leading cloud provider, by far, and Redshift is one of their fastest growing products. Second, as we’re already well versed in the AWS ecosystem, and use it for everything else, Redshift fits in perfectly with the other tools we use, like Kinesis. Third, Redshift is built on top of Postgres, which has a huge community and set of tools, like pgbouncer, which we use extensively.
BigQuery is an awesome database, and much of what we do at Panoply.io is inspired by it. We’re working hard to make our platform as easy, simple and fun to use as BigQuery. Simplicity is one of most important aspects of a product, and BigQuery is way ahead on that front.
Inspiring as BigQuery and Google in general may be, practically across the board we found Redshift to be superior. It provides significantly superiorusability, performance, and cost for the majority of analytical use-cases, especially at scale. The only critical apparent drawback of Redshift is its relative complexity as it requires constant low-level tuning of the virtualized hardware and database configurations. This apparent complexity is double edged as it can be seen as greater flexibility that allows us to fine-tune Redshift to our specific needs which will result in an even greater advantage in performance and cost.
Panoply.io’s purpose is to abstract away all of these complexities into a single click without compromising on functionality. So when we set out to develop our platform it was easy to justify the tradeoff of simplicity for functionality and focus on elevating Redshift to the same level of simplicity as BigQuery, and then take it way beyond.