If you're thinking about creating a data warehouse from scratch, one of the options you are probably considering is Amazon Redshift. Redshift is very powerful, but figuring out how much it'll cost can be tricky.
In this article, we'll explain how Amazon Redshift pricing works and what issues you'll need to consider to figure out how much it would cost to build your data warehouse on Redshift.
- Key terminology for understanding Redshift pricing structure
- Additional features
- Tools for managing Redshift spending
Key terminology for understanding Amazon's data warehouse pricing structure
Redshift's pricing uses a lot of jargon. But only a handful of terms are critical:
- Concurrency: Redshift is designed around the idea that it should be as easy as possible to throw a ton of resources at your analytics/BI solution whenever you need them. The key to this strategy is being able to use many resources in parallel. A bunch of CPUs working together, for example, is more affordable than one mega computer—and they are much cheaper to scale up or down as your needs change.
- Node: Because Amazon Redshift is organized around concurrency, you don't rent an actual machine on Redshift. Instead, you'll use one or more "nodes," which is a bucket of computing power and storage. As you'll see in the next section, Redshift offers several types of nodes.
- Cluster: A cluster is a group of one or more nodes of the same type that share an Amazon Redshift engine.
You pay for nodes and clusters by the hour, with pricing depending on your location (the prices listed below are for Northern California).
Basic Redshift pricing
If you were building a data warehouse on Redshift a year ago, you'd have to wrestle with questions such as, “do I want ‘dense compute’ or ‘dense storage’ nodes?” Understanding these options could be pretty confusing. But due to new services Amazon has rolled out, the choices are simpler.
Today, all you need to do is ask yourself, do I need more than 1 TB of storage?
For smaller data warehouses: Use DC2 nodes
If your data warehouse isn't going to be very big, odds are you'll need less than 1 TB of storage. In that case, the AWS Redshift pricing guide suggests you use DC2 nodes.
DC2 nodes are optimized for building small data warehouses that are fast and affordable. They give you a decent amount of computing power, and your data is stored on a zippy solid-state drive. DC2 nodes come in 2 sizes:
- dc2.large: If you have a small amount of data—less than 1/6 of a terabyte—you can get what you need for pennies. For $0.33/hour, you get the equivalent of 2 CPUs and 15 GB of memory.
- dc2.8xlarge: For $6.40/hour, you get the equivalent of 32 CPUs, 244 GB of memory, and 2.56 TB of storage.
For larger data warehouses: Use RA3 nodes
If your data warehouse is going to handle a ton of data, you'll need to use RA3 nodes. RA3 nodes offer a formidable amount of computing power. And since solid-state drives are too expensive to use for all of the data stored in a large data warehouse, RA3 nodes use a neat trick called Manage Storage, where Redshift dynamically stores frequently-used data on solid-state drives and offloads the rest of the data to Amazon's S3.
With a smaller database, it makes sense to pay one rate for a preconfigured amount of computing power and storage, just like you might buy a preconfigured laptop on Amazon. But with big data warehouses, one-size-fits-all doesn't make sense. If you need gobs and gobs of storage, you shouldn't have to also buy a ton of computing power unless you actually need it. So with RA3 nodes, you pay separately for computing power and storage.
- Computing Power
- ra3.4xlarge: For $3.0606/hour, you get the equivalent of 12 CPUs and 96 GB of memory
- ra3.16xlarge: For $14.424/hour, you get the equivalent of 48 CPUs and 384 GB of memory
- Storage: $0.0271 per gigabyte per month
Data transfer pricing
Regardless of which type of nodes you use, you may also be charged for importing data into and exporting data from your Redshift data warehouse. If the data you are transferring is stored in Amazon S3 and that data is stored in the same Amazon Web Services region as your data warehouse, you won't be charged for data transfer. If not, you'll be charged at standard AWS transfer rates.
Redshift's pricing for additional features
If you have more sophisticated needs, Redshift offers a variety of optional features. Here are a few of the more frequently used options you might consider adding to your Redshift setup. They may cost a little extra, but could they save you time, hassle, or unexpected budget overages.
Redshift Spectrum and federated query
One of the real pains about building a data warehouse is that you have to import all of the data you're going to use even if you only infrequently use most of that data. But if you store a lot of your data on AWS, Redshift can query that data without importing it:
- Redshift Spectrum: Redshift can query data you've stored in Amazon S3 at a cost of $5 per terabyte of data scanned as well as some additional charges (e.g., you're charged when you make a request against one of your S3 buckets).
- Federated query: Redshift can query data in Amazon RDS and Aurora PostgreSQL databases. There aren't any additional costs for using federated query beyond the charges you pay for using Redshift and these databases.
What if your data warehouse gets hit with a usage spike? Redshift's got you covered. With Concurrency Scaling, you can set up your data warehouse to automatically grab more resources when your needs jump dramatically, then automatically release these resources when they are no longer needed.
AWS Redshift pricing for Concurrency Scaling is a bit tricky. Every Amazon Redshift cluster earns one hour of free Concurrency Scaling for every day of normal usage, and each cluster can accumulate up to 30 hours of free Concurrency Scaling usage. If you go over your free credits, you're charged for the additional cluster(s) for every second you use them.
Amazon Redshift automatically backs up your data warehouse for free. But sometimes it's also useful to take a snapshot of your data at a point in time. For clusters using RA3 nodes, you will be charged for this additional backup storage at standard Amazon S3 rates. For clusters using DC nodes, you'll be charged for any manual backup storage that takes up space beyond the amount of storage included in the rates for your DC nodes.
In addition to offering on-demand rates, Redshift also offers reserve instances, which provide a significant discount if you commit to a one-or three-year term. The Amazon Redshift pricing page says that "customers typically purchase Reserved Instances after running experiments and proof-of-concepts to validate production configurations"—a good practice to follow with any long-term data warehouse contracts.
Tools for keeping your Redshift spending under control
Because many components of AWS Redshift pricing are dynamic, there's always a danger that your costs will spike. This can be a particular concern if you are trying to make your Redshift data warehouse as self-service as possible. If one department gets carried away in how hard they are hitting the data warehouse, it could blow a hole in your budget.
Luckily, over the past year Amazon has added a variety of features and tools to help you keep a lid on costs and catch increases in usage before they get out of hand. Here are a few examples:
- You can set up a cluster so it has daily, weekly, and/or monthly limits on the usage of Concurrency Scaling and Redshift Spectrum. And you can configure that cluster so that when it hits those limits, it either temporarily disables the feature, sends an alert, or logs the alert to a system table.
- You can set storage limits on schemas, which are a method for creating a collection of database objects. For example, Yelp created a schema called 'tmp' where Yelp employees could prototype database tables. Yelp used to run into the problem that sometimes staff experiments ate up so much storage that they would slow down the entire data warehouse. After Redshift added controls for setting schema storage limits, Yelp used these controls to eliminate the problem.
- Redshift has added query monitoring that makes it easy to identify which queries are chewing up CPU time. That allows you to get on top of potential problems before they spiral out control—e.g., rewriting a CPU-intensive query so it's more efficient.
In this article, we've given you an overview of how to understand Amazon Redshift pricing. The good news is that if you know what you're doing, Redshift's pricing structure lets you spend no more than is necessary to serve your data warehouse needs.
The bad news is that you really need to know what you're doing, because figuring out how much Redshift is going to cost you isn't simple.
Redshift's pricing structure is another example of the fundamental downside of using Redshift—it's a powerful tool for large, complex data warehouses, but using it requires a substantial investment of time to understand it well enough to use it effectively. So before you go down that road, you might want to ask yourself, is there a simpler solution for my business's needs?
If you're asking yourself that question, you might want to take a look at Panoply. Panoply is a modern, fully self-service data warehouse with integrated ETL that's designed from the ground up to give you a lot of power without the pain associated with more complex solutions.
What's more, Panoply's pricing structure is simple and transparent: You pay for the number of data sources you connect and the amount of data you store. That's it. You can query as much as you'd like and add as many users as you'd like at no additional cost, so you don't have to worry about your bill ballooning from month to month.
Regardless of which data warehouse you choose, having a good sense of your all-in costs is key. The reality is that the right data warehouse, in combination with a great data program, will be worth every penny.