As we speak the future of cloud computing is being duked out. Amazon and Google, as well as Microsoft, Snowflake, and a few others, offer multiple cloud solutions for practically everything. Despite the prevailing consensus that Amazon is leading the pack with its Redshift columnar storage, Google is gaining ground with their cloud platform.
Both release new features and services on almost a weekly basis, and we, the developers, are richer for it. In our opinion, some of the most interesting innovations are taking place in the space of analytical data warehousing, where on one hand we have Amazon’s Redshift and on the other Google’s BigQuery. This post compares Redshift vs. BigQuery in detail.
With Panoply's inception, we had to make a choice: Redshift or BigQuery. As our platform delivers full-stack data automation, a critical chunk of the stack hinges not only on the massively parallel data warehouse used internally to store hundreds of terabytes of data, but the capability to analyze it in minutes. This choice would define us, so we were determined to do a thorough comparison of the two and pick the one best for us.
Spoiler Alert! When we were building Panoply, we found Redshift delivered superior results on almost all fronts, especially usability, performance, and cost. And yes, there are apparent complexities to Redshift, but what it surrenders in terms of simplicity it gains in terms of functionality.
This article has been written in retrospect, so throughout the blog we’ll walk you through our logic and argue our case.
Basics: Redshift vs. BigQuery
This article assumes some familiarity with Redshift and BigQuery, as well as basic knowledge in columnar MPP data warehouses. In this section we’ll cover the basics before drilling down into our comparison. If you already got this covered feel free to skip ahead.
It all comes down to how transactional operations and analytical queries differ from each other. Normal transactional operations are usually indexed and very fast. This is because they must support a huge number of concurrent queries, each targeted to a specific row, which are resolved in the sub-second range. Analytical queries on the other hand are normally performed by just a few analysts where each query is a batch process over a huge dataset that can take minutes or even hours to compute.
Normal relational database systems, like Postgres and MySQL, store data internally in row form: all data rows are stored together and are usually indexed by a primary key to facilitate efficient accessibility. This setup is perfect for most transactional operations, like reading rows by id to display your user profile, for example.
However, a row-centric setup is ineffective for many analytical use-cases because in analytics, we’re rarely interested in specific rows but in the aggregates of all (or most rows). For example, computing the average age of all of our users, grouped by location. Performing these queries on classic row-oriented databases requires them to read through the entire database, along with all unused columns, to produce the results.
This massive inefficiency is addressed with the advent of columnar databases. Columnar databases store data by column instead of by row. As a result, all values of the same column can be stored sequentially on disk. A columnar structure significantly speeds up analytical queries by only reading the columns involved in the query, resulting in accelerated disk access and CPU cache.
Both Amazon Redshift and Google BigQuery (along with a few others) benefit from a columnar storage structure, making them ideal for analytical use-cases. In addition, they’re designed to be massively parallel where every query is distributed across multiple servers to accelerate query execution.
Performance: Redshift vs. BigQuery
In analytics, where queries over hundreds of gigabytes are the norm, performance is paramount. Normal transactional queries are performed by your application’s code and therefore affect performance for your users.
In an analytical context, performance has a direct effect on the productivity of your team: running a query for hours means days of iterations between business question (think compilation time of the past). We wanted technology that would scale well not only in terms of data volume but just as importantly, in quantity and complexity of the queries and analysts on our team.
There are many comparisons that have been done recently that are worth looking at as they have taken Redshift and BigQuery and put them up against large data sets to see how they would perform.
In 2020, Fivetran ran a comparison against multiple cloud data warehouses including BigQuery and Redshift. To do it, they generated a 1 TB TPC data set to use for their benchmark. The TPC data set has 24 tables in a snowflake schema; the tables represent web, catalog and store sales of an imaginary retailer. The largest fact table had 4 billion rows.
From there they ran 99 TPC-DS queries. These queries varied in complexity: They have lots of joins, aggregations and subqueries. We ran each query only once, to prevent the warehouse from caching previous results.
For the configuration of data warehouses they used Redshift 5x ra3.4xlarge $16.30 and BigQuery Flat-rate 600 slots $16.44. One large distinction that Fivetran made was to not fine tune the data warehouses at all by using any form of distribution key. They simply compared the performance of the data warehouses out of the box. Here were their results:
|Data Warehouse||Avg. Time (seconds)||Avg. Cost (cents)|
In this case, Redshift slightly edged out BigQuery. But overall, these performance differences are both acceptable when it comes to ad hoc queries and it is likely that larger queries will require increased fine-tuning regardless of the data warehouse you decide on.
As mentioned earlier, there are a pittance of published benchmarks comparing BigQuery and Amazon Redshift performance, and there are none that we could find that ran a real world comparison with Redshift configured with correct sortkeys and distkeys.
For that reason, when we were evaluating Redshift and BigQuery back in 2017, we ran our own test on “A Comparison of Approaches to Large-Scale Data Analysis” by Pavlo et al. (SIGMOD 2009).
|Load from AWS||460GB||484||438 + 7820|
During that 2017 test, we used the same test data set from the aforementioned benchmark, but only employed the largest data size available (labeled “5nodes”). That narrowed it down to 2 tables: one (ranking) with 90 million rows at 5.2GB, and the other (uservisits) with 750 million rows at 455GB.
The first step was to load the data to both Redshift and BigQuery, and as you can see in the table above, BigQuery’s load operation performed slightly better. The only caveat being that if the data is generated in AWS (where we run our production infrastructure), the time required to transfer the data to Google Cloud Storage came out to an additional 7820 seconds (roughly 2 hours) using Google’s transfer service.
We used 2x DS2 HDD Redshift nodes, in order to maintain consistency with the instance types considered in the cost section below. These were Redshift’s most cost-effective node types but one of the least-performing ones. The Redshift results shown here can be significantly improved by trading cost for performance.
This approach affords considerable flexibility in fine-tune performance to business needs, especially when we considered distributing data across different clusters/types. For example: storing frequently-accessed critical data on DC1 SSD would give us just over a 10x performance boost, while everything else was kept on slower/cheaper instance types. BigQuery had no hardware configuration to describe.
For queries, we took a different approach from the mentioned benchmark; instead of using vanilla Redshift, un-optimized or configured in any way, we applied some best-practice optimizations, like compressing tables and setting sort and distribution keys. Previous comparisons measured Redshift’s worst case performance while we needed to evaluate a more realistic use-case, one where engineers fine-tune their databases. That said, we didn’t go to the extreme. This setup was far from perfectly optimized, but presented a more balanced and realistic comparison between the two databases.
We evaluated three queries, each with three variants for measuring different scales of processed data: small, medium, and large. The definitions of these variables are borrowed from the Big Data Benchmark from U.C. Berkeley AMPLab which includes code and test data for the comparison.
Each query was executed 10 times with an hour interval and the median runtime in seconds was logged. BigQuery has excellent query caching, meaning that all repeated queries complete within a few seconds regardless of the processed data size. Unfortunately, that also made query repetition an ineffective means of producing a statistical measure of BigQuery’s performance.
To work around that, with each execution we made a tiny parameterized change to the query (like changing the variable X from 1000 to 1000.01, see below) in order to invalidate the cache. The same approach was used in Redshift, however Redshift cached data, not queries, which gave it an edge at handling similar queries with modified parameters, while underperforming at repetitive identical queries.
The first query measured normal scan speed. It had three variations where X = 1000 (small), 100 (medium) and 10 (large):
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Note that BigQuery didn't allow output results to exceed 128MB. While that’s definitely a lot, we found it to be insufficient for use cases such as utilizing Spark SQL for deep data mining. Instead, we had to save the results of the medium and large variants to a destination table. Obviously, in order to compare apples-to-apples, we applied the same approach in Redshift although it has no such limitation.
The second query measured aggregation speed. It had three variations where X = 8 (small), 10 (medium) and 12 (large):
SELECT SUBSTRING(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY 1
The third and final query measured join speed. It had three variations where X = 1980-04-01 (small), 1983-01-01 (medium) and 2010-01-01 (large):
SELECT sourceIP, SUM(adRevenue) AS totalRevenue, AVG(pageRank) AS pageRank FROM rankings r JOIN ( SELECT sourceIP, destinationURL, adRevenue FROM uservisits uv WHERE uv.visitDate > '1980-01-01' AND uv.visitDate < X ) nuv on (r.pageURL = nuv.destinationURL) GROUP BY sourceIP ORDER BY totalRevenue DESC LIMIT 1
Contrary to previous findings that didn’t consider optimization, when reasonably optimized, Redshift outperformed BigQuery in 9 out of 11 use-cases hands down. At the time we ran the rest, the only instance where BigQuery had superior performance was in big join operations—no small win given the way data is going.
Cost: Redshift vs. BigQuery
On the surface, it might seem that Redshift is more expensive than BigQuery. Per GB, Redshift costs $0.08, per month ($1000/TB/Year), compared to BigQuery’s $0.02. However, the devil is in the details.
BigQuery’s cost of $0.02/GB only covers storage, not queries. You pay separately per query based on the amount of data processed at a $5/TB rate.
Because BigQuery doesn’t provide any indexes, and many analytical queries cover the entire database, we can assume that each query will need to scan a big chunk of the data. Say you have 1TB spread evenly across 50 columns (in several tables). A query that scans through 5 of these columns will process 100GB at a cost of $0.5. This means, that per GB you’ll pay an additional $0.005 per query. If you have 12 such queries per month it will actually cost you $0.08 (0.02 + 0.005 * 12). Which is the same as Redshift. Beyond that, BigQuery costs more.
The realization we came to is that cheap data storage is worthless disjointed from utilization. It’s arguably comparable to storing data on just Amazon S3. When you actually use the data, you’ll start paying big.
In addition to the flat cost, there are three additional drawbacks to BigQuery’s pricing model:
- It’s unpredictable. There is no way to estimate in advance how much you’ll end up paying at the end of the month. Not to mention that costs can fluctuate from month to month based on query patterns. Luckily, this is mitigated with BigQuery’s Cost Control feature at the cost of worsening the following items.
- It’s complex. Just trying to figure out how much each query will cost, and how much you’ll pay per GB is extremely difficult as it requires a thorough examination of each query and the underlying data it’s going to access.
- It discourages data usage. Instead of encouraging analysts to query the database in any and all ways they can imagine you’ll end up worrying about needing to limit them and come up with processes for controlling the volume data being used. As a data-driven company, this notion contradicted our company values.
Usability is probably the most debated, intangible and subjective aspect of all database comparisons. Both Redshift and BigQuery list tens to hundreds of capabilities and features, each with its own configurations, use-cases, edge-cases and bugs.
This vast soup of intertwined capabilities is so complex to reason through that it often comes down to personal preferences. That said it cannot be skipped so we mapped out the key subjects we cared about in terms of usability:
Simplicity is by far the biggest advantage BigQuery holds over Redshift. BigQuery abstracts away the details of the underlying hardware, database and all configurations. It mostly works out of the box, whereas with Redshift you need a deep knowledge and particular skillset in order to use and optimize it effectively.
This often translates into dedicated engineers battling to keep your company’s infrastructure up to par with your company’s data scale and query patterns as well as industry’s best practices. That’s where Panoply comes in; bridging this gap by making Redshift even simpler to use than BigQuery.
Integrating with BI tools is pretty easy and widely supported for both of these databases. Redshift supports standard JDBC/ODBC compatible SQL syntax, as it’s built on top of Postgres and supports many of its features. This means that all of the tools that are compatible with Postgres are also compatible with Redshift.
Supported data types
Supported data types for both databases include a wide range of classic data types, like text, integers, floats, booleans, time, timez etc. However, BigQuery does support the Record data type for nested structures which is very useful for semi-structured data
Stored procedures are another beneficial feature that has recently been added to both Redshift and BigQuery. This newer feature allows developers to encapsulate their code into concise portions of logic. These procedures can be very helpful when writing data transformations or trying to wrap up complex processes that require a specific order of steps.
The ability to create stored procedures has been a major stumbling block for both BigQuery and Redshift for a long time as most other data warehouse options have provided this feature.
Updates and upserts
Thankfully, as of 2018 Amazon Redshift automatically runs the VACUUM DELETE operation. This ensures that your Redshift instance doesn’t take up unnecessary disk space with rows that were marked for deletion by previous UPDATE and DELETE operations. It also defragments the tables to free up consumed space and improves performance for your workloads.
BigQuery also supports updates or deletions as long as you select to use Standard SQL. There is also a Legacy SQL option for BigQuery that is far more limiting and less familiar for the general user. This recent change allows analysts and engineers the ability to write several DML statements that weren’t possible under the Legacy SQL option.
In addition to adding standard deletes and updates, BigQuery’s Standard SQL also provides developers the ability to use the MERGE clause. So instead of needing to have a staging table and several complex steps like redshift. BigQuery users can write a statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically based on a set of conditions.
Nearline storage is supported by BigQuery as it allows you to offload some of your less critical data to a slower, cheaper storage. Since April 2017, Redshift Spectrum allows Redshift users to query files stored in S3, enabling the separation of storage and compute.
At Panoply, we worked around the limitation of coupled storage and compute by providing our own implementation of nearline storage for our customers.
Loading data into Redshift is a very powerful yet involved process handled by the standard COPY command. It supports a handful of formats, like CSV, JSON and AVRO, each with its own configurations like delimiters and compressions.
On the other hand, loading data to BigQuery is simpler, yet somewhat more limited. For example, it can only be done via the web UI or API, but not with standard SQL.
The ecosystem for both Redshift and BigQuery is pretty big. Google promotes the latter aggressively so there are many tools and communities out there to help out with pretty much everything.
But for us Redshift’s ecosystem is bigger and stronger. First, AWS is the world’s leading cloud provider, by far, and Redshift is one of their fastest growing products. Second, as we’re already well versed in the AWS ecosystem, and use it for everything else, Redshift fits in perfectly with the other tools we use, like Kinesis. Third, Redshift is built on top of Postgres, which has a huge community and set of tools, like pgbouncer, which we use extensively.
Recap: Redshift vs. BigQuery
BigQuery is an awesome database, and much of what we do at Panoply is inspired by it. We’re working hard to make our platform as easy, simple and fun to use as BigQuery. Simplicity is one of most important aspects of a product, and BigQuery is way ahead on that front.
Inspiring as BigQuery and Google in general may be, practically across the board we found Redshift to be superior. It provides significantly better usability, performance, and cost for the majority of analytical use-cases, especially at scale.
The only critical apparent drawback of Redshift is its relative complexity as it requires constant low-level tuning of the virtualized hardware and database configurations. This apparent complexity is double edged as it can be seen as greater flexibility that allows us to fine-tune Redshift to our specific needs which will result in an even greater advantage in performance and cost.
Panoply’s purpose is to abstract away all of these complexities into a single click without compromising on functionality. So when we set out to develop our platform it was easy to justify the tradeoff of simplicity for functionality and focus on elevating Redshift to the same level of simplicity as BigQuery, and then take it way beyond.