Data has become the lifeblood of business and data warehouses are an essential part of that. But knowing which data warehouse makes sense for your business can be tricky.
Today, we continue our discussion of modern data warehouses as we compare Redshift and Snowflake, as well as outlining core considerations when choosing a data warehouse.
Snowflake and Redshift: The basics
Both Snowflake and Redshift are powerful cloud data warehouses, and both offer some really interesting options in terms of managing data.
To begin to differentiate the two, Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. One of the cool points here is that you can start with just a few hundred gigabytes of data and scale to a petabyte or more as your requirements grow.
Snowflake Computing sells a cloud-based data storage and analytics service called Snowflake Elastic Data Warehouse. With this solution, corporate users are able to store and analyze data using cloud-based hardware and software. From there, the data is stored in Amazon S3. Rather than relying on technologies like Hadoop, Snowflake actually leverages the public cloud ecosystem.
As mentioned, both of these solutions are powerful and offer some unique features when it comes to managing data. But, there are definitely differences. With that, let’s dive in.
Ecosystems and integrations
If you’re working with an Amazon ecosystem, Redshift should be on your list. Redshift integrates with a variety of AWS services such as Kinesis Data Firehose, SageMaker, EMR, Glue, DynamoDB, Athena, Database Migration Service (DMS), Schema Conversion Tools (SCT), CloudWatch, etc.
On the other hand, you can absolutely find Snowflake on the AWS Marketplace with really cool on-demand functions. Integrating Amazon services like Kinesis, Glue, S3, is possible with Snowflake although could require more work. For example, your team can use Snowpipe and a few other AWS services to integrate AWS Kinesis data or for S3 you could implement storage integrations on the Snowflake side.
Both options offer extensive integrations and have healthy ecosystem partners. Redshift is more established and has a bit of a leg up, but Snowflake has come a long way. But if you're looking to simplify your data stack, Panoply offers a cloud data platform that combines storage with tons of pre-built data integrations that automatically sync and store your data for easy querying.
Price: Redshift vs Snowflake
At a very high level, we took a look at pricing models from both Redshift and Snowflake and found that Redshift is often less expensive than Snowflake for on-demand use. That said, it’s important to note that major data warehouse players like BigQuery, Redshift, and Snowflake each have different pricing models.
Redshift's pricing model
Redshift charges per-hour per-node, which covers both computational power and data storage. With Redshift, you can calculate the monthly price by multiplying the price per hour by the size of the cluster and the number of hours in a month.
Redshift Monthly Price = [Price Per Hour] x [Cluster Size x [Hours per Month]
Snowflake's pricing model
Snowflake’s pricing depends heavily on your usage pattern. Since data storage is decoupled from the computational warehouses, the two are billed separately. Snowflake offers a dynamic pricing model—clusters will stop when no queries are running and automatically resume when they are, and they can flexibly resize themselves based on a changing workload. This can potentially save you money when query load decreases.
As examples, and using the US as a reference, Snowflake storage costs begin at a flat rate of $23/TB, average compressed amount, per month accrued daily. Meanwhile, compute costs $0.00056 per second, per credit, for their Snowflake On Demand Standard Edition. But this is where things can get a bit confusing: Snowflake offers seven different tiers of computational warehouses. The smallest cluster, X-Small, costs one credit per hour, or $2/hour. At each level, the number of credits per hour doubles.
The final word on cost
In the end, when comparing Amazon Redshift’s 2, 4, and 8 node DC2.8XL clusters with equivalently sized Medium, Large and X-Large Snowflake configurations:
- Redshift is 1.3x less expensive than Snowflake for on-demand pricing
- Redshift is 1.9x to 3.7x less expensive than Snowflake with the purchase of a 1 or 3 year Reserved Instance (RI)
Data support: Snowflake vs. Redshift
For several years one of the major advantages Snowflake offered was how it treated semi-structured data and JSON. Snowflake allowed developers to query on top of JSON and other less structured data types. This was due to the fact that Snowflake can be treated like a data lake or a data warehouse, giving it a huge advantage over Redshift.
Redshift officially closed this gap as of December 2020 with new native support for JSON. Redshift has provided this new data type under the data type
SUPER that allows you to store semi-structured data in Redshift tables.
In addition, Redshift also supports PartiQL query language to seamlessly query and process semi-structured data. This functionality enables advanced analytics that combine classic structured SQL data (such as strings, numerics, and timestamps) with the semi-structured
SUPER data with superior performance, flexibility, and ease-of-use. This new schemaless data source allows data engineers and analysts to more quickly access data without committing it to a table first.
These advancements on Redshift's side make both cloud data warehouses almost comparable in terms of data types. However, if you work with XML, Arvo, or Parquet, Snowflake wins out.
Performance: Redshift vs. Snowflake
There are plenty of articles that have compared all of the various cloud data warehouses. However, one that stands out is Fivetran's recent benchmarking report.
To set it up, Fivetran generated a 1 TB TPC data set to use for their benchmark. The TPC data set had 24 tables in a snowflake schema. The largest fact table had 4 billion rows.
From there they ran 99 TPC-DS queries. These queries varied in complexity: They have lots of joins, aggregations, and subqueries. They then ran each query only once to prevent the warehouse from caching previous results.
For the configuration they used Redshift 5x ra3.4xlarge $16.30 and Snowflake Large $16.00. One major decision that Fivetran made was to not fine tune the data warehouses by using any form of distribution key.
As a results, Fivetran's benchmark report compares the performance of the data warehouses out of the box:
Here were the results:
|Data warehouse||Avg. time
The performance difference was negligible. You probably shouldn’t make your choice on data warehouse based on a few fractions of a second.
What this does tell you is that it's possible to perform complex queries on pretty large data sets regardless of which data warehouse you choose. So if you do run into performance problems, it's likely due to poor design...not your tools
Security: Redshift vs Snowflake
When it comes to data, security is a critical foundation. All the data that we create opens up new vulnerabilities to private and sensitive information. Both Redshift and Snowflake take security very seriously, but they have slightly different approaches.
Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition to database security, Amazon Redshift provides these features to manage security:
- Sign-in credentials - Access to your Amazon Redshift Management Console is controlled by your AWS account privileges.
- Access management - To control access to specific Amazon Redshift resources, you define AWS Identity and Access Management (IAM) accounts.
- Cluster security groups - To grant users inbound access to an Amazon Redshift cluster, you define a cluster security group and associate it with a cluster.
- VPC - To protect access to your cluster by using a virtual networking environment, you can launch your cluster in an Amazon Virtual Private Cloud (VPC).
- Cluster encryption - To encrypt the data in user-created tables, you can enable cluster encryption when you launch the cluster.
- SSL connections - To encrypt the connection between your SQL client and your cluster, you can use secure sockets layer (SSL) encryption.
- Load data encryption - To encrypt your table load data files when you upload them to Amazon S3, you can use either server-side encryption or client-side encryption. When you load from server-side encrypted data, Amazon S3 handles decryption transparently. When you load from client-side encrypted data, the Amazon Redshift
COPYcommand decrypts the data as it loads the table.
- Data in transit - To protect your data in transit within the AWS cloud, Amazon Redshift uses hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for
UNLOAD, backup, and restore operations.
- Compliance - Redshift has an array of compliance certifications that may be relevant to your unique business needs.
Like Redshift, Snowflake provides industry-leading features that ensure the highest levels of security for your account and users, as well as the data you store in Snowflake:
- Network/site access - Site access is controlled through IP whitelisting and blacklisting, managed through network policies, while private/direct communication between Snowflake and your other VPCs happens through AWS PrivateLink.
- Account/user authentication - MFA (multi-factor authentication) for increased security for account access by users. Support for user SSO (single sign-on) through federated authentication.
- Object security - Controlled access to all objects in the account (users, warehouses, databases, tables, etc. through a hybrid model of discretionary access control (DAC) and role-based access control (RBAC).
- Data security - All data is automatically encrypted using AES 256 strong encryption. All files stored in stages (for data loading/unloading) are automatically encrypted using either AES 128 standard or 256 strong encryption. Encrypted data is periodically rekeyed and Snowflake offers support for encrypting data using customer-managed keys.
- Security Validations - Soc 2 Type II and PCI DSS compliant. Support for HIPAA compliance.
My only point of caution is to make sure you know which Snowflake edition you’re working with—some features aren't available with all editions. For example, if you want to leverage security validations features and work with HIPAA or PCI DSS, you need to be working with Snowflake’s Enterprise Edition for Sensitive Data (ESD).
The data warehouse decision
Whenever you’re working with data, you want results as quickly as possible. But moving fast shouldn't mean breaking things.
Take your time to find out whether a particular data warehouse has the features you need. Conducting a trial or POC (proof of concept) is a great way to test out the waters. Plus, it’ll help you understand integration points and how to manage the entire platform.
The real key is to actually get started. Beginning by identifying your business needs makes tool evaluation easier because you'll know what actually matters...and what doesn't.