Data has become the lifeblood of business and data warehouses are an essential part of that. But knowing which data warehouse makes sense for your business can be tricky.
Today, we continue our discussion of modern data warehouses as we compare Redshift and Snowflake, as well as outlining core considerations when choosing a data warehouse.
Both Snowflake and Redshift are powerful cloud data warehouses, and both offer some really interesting options in terms of managing data.
To begin to differentiate the two, Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. One of the cool points here is that you can start with just a few hundred gigabytes of data and scale to a petabyte or more as your requirements grow.
Snowflake Computing sells a cloud-based data storage and analytics service called Snowflake Elastic Data Warehouse. With this solution, corporate users are able to store and analyze data using cloud-based hardware and software. From there, the data is stored in Amazon S3. Rather than relying on technologies like Hadoop, Snowflake actually leverages the public cloud ecosystem.
As mentioned, both of these solutions are powerful and offer some unique features when it comes to managing data. But, there are definitely differences. With that, let’s dive in.
If you’re working with an Amazon ecosystem, Redshift should be on your list. Redshift integrates with a variety of AWS services such as Kinesis Data Firehose, SageMaker, EMR, Glue, DynamoDB, Athena, Database Migration Service (DMS), Schema Conversion Tools (SCT), CloudWatch, etc.
On the other hand, you can absolutely find Snowflake on the AWS Marketplace with really cool on-demand functions. Integrating Amazon services like Kinesis, Glue, S3, is possible with Snowflake although could require more work. For example, your team can use Snowpipe and a few other AWS services to integrate AWS Kinesis data or for S3 you could implement storage integrations on the Snowflake side.
Both options offer extensive integrations and have healthy ecosystem partners. Redshift is more established and has a bit of a leg up, but Snowflake has come a long way. But if you're looking to simplify your data stack, Panoply offers a cloud data platform that combines storage with tons of pre-built data integrations that automatically sync and store your data for easy querying.
At a very high level, we took a look at pricing models from both Redshift and Snowflake and found that Redshift is often less expensive than Snowflake for on-demand use. That said, it’s important to note that major data warehouse players like BigQuery, Redshift, and Snowflake each have different pricing models.
Redshift charges per-hour per-node, which covers both computational power and data storage. With Redshift, you can calculate the monthly price by multiplying the price per hour by the size of the cluster and the number of hours in a month.
Redshift Monthly Price = [Price Per Hour] x [Cluster Size x [Hours per Month]
Snowflake’s pricing depends heavily on your usage pattern. Since data storage is decoupled from the computational warehouses, the two are billed separately. Snowflake offers a dynamic pricing model—clusters will stop when no queries are running and automatically resume when they are, and they can flexibly resize themselves based on a changing workload. This can potentially save you money when query load decreases.
As examples, and using the US as a reference, Snowflake storage costs begin at a flat rate of $23/TB, average compressed amount, per month accrued daily. Meanwhile, compute costs $0.00056 per second, per credit, for their Snowflake On Demand Standard Edition. But this is where things can get a bit confusing: Snowflake offers seven different tiers of computational warehouses. The smallest cluster, X-Small, costs one credit per hour, or $2/hour. At each level, the number of credits per hour doubles.
In the end, when comparing Amazon Redshift’s 2, 4, and 8 node DC2.8XL clusters with equivalently sized Medium, Large and X-Large Snowflake configurations:
For several years one of the major advantages Snowflake offered was how it treated semi-structured data and JSON. Snowflake allowed developers to query on top of JSON and other less structured data types. This was due to the fact that Snowflake can be treated like a data lake or a data warehouse, giving it a huge advantage over Redshift.
Redshift officially closed this gap as of December 2020 with new native support for JSON. Redshift has provided this new data type under the data type SUPER
that allows you to store semi-structured data in Redshift tables.
In addition, Redshift also supports PartiQL query language to seamlessly query and process semi-structured data. This functionality enables advanced analytics that combine classic structured SQL data (such as strings, numerics, and timestamps) with the semi-structured SUPER
data with superior performance, flexibility, and ease-of-use. This new schemaless data source allows data engineers and analysts to more quickly access data without committing it to a table first.
These advancements on Redshift's side make both cloud data warehouses almost comparable in terms of data types. However, if you work with XML, Arvo, or Parquet, Snowflake wins out.
There are plenty of articles that have compared all of the various cloud data warehouses. However, one that stands out is Fivetran's recent benchmarking report.
To set it up, Fivetran generated a 1 TB TPC data set to use for their benchmark. The TPC data set had 24 tables in a snowflake schema. The largest fact table had 4 billion rows.
From there they ran 99 TPC-DS queries. These queries varied in complexity: They have lots of joins, aggregations, and subqueries. They then ran each query only once to prevent the warehouse from caching previous results.
For the configuration they used Redshift 5x ra3.4xlarge $16.30 and Snowflake Large $16.00. One major decision that Fivetran made was to not fine tune the data warehouses by using any form of distribution key.
As a results, Fivetran's benchmark report compares the performance of the data warehouses out of the box:
Here were the results:
Data warehouse | Avg. time (seconds) |
Avg. cost (cents) |
Redshift | 8.24 | 7.46 |
Snowflake | 8.21 | 7.29 |
The performance difference was negligible. You probably shouldn’t make your choice on data warehouse based on a few fractions of a second.
What this does tell you is that it's possible to perform complex queries on pretty large data sets regardless of which data warehouse you choose. So if you do run into performance problems, it's likely due to poor design...not your tools
When it comes to data, security is a critical foundation. All the data that we create opens up new vulnerabilities to private and sensitive information. Both Redshift and Snowflake take security very seriously, but they have slightly different approaches.
Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition to database security, Amazon Redshift provides these features to manage security:
COPY
command decrypts the data as it loads the table.COPY
, UNLOAD
, backup, and restore operations.Like Redshift, Snowflake provides industry-leading features that ensure the highest levels of security for your account and users, as well as the data you store in Snowflake:
My only point of caution is to make sure you know which Snowflake edition you’re working with—some features aren't available with all editions. For example, if you want to leverage security validations features and work with HIPAA or PCI DSS, you need to be working with Snowflake’s Enterprise Edition for Sensitive Data (ESD).
Whenever you’re working with data, you want results as quickly as possible. But moving fast shouldn't mean breaking things.
Take your time to find out whether a particular data warehouse has the features you need. Conducting a trial or POC (proof of concept) is a great way to test out the waters. Plus, it’ll help you understand integration points and how to manage the entire platform.
The real key is to actually get started. Beginning by identifying your business needs makes tool evaluation easier because you'll know what actually matters...and what doesn't.