If you're thinking about creating a data warehouse from scratch, one of the options you are probably considering is Amazon Redshift. Redshift is very powerful, but figuring out how much it'll cost can be tricky.
In this article, we'll explain how Amazon Redshift pricing works and what issues you'll need to consider to figure out how much it would cost to build your data warehouse on Redshift.
Redshift's pricing uses a lot of jargon. But only a handful of terms are critical:
You pay for nodes and clusters by the hour, with pricing depending on your location (the prices listed below are for Northern California).
If you were building a data warehouse on Redshift a year ago, you'd have to wrestle with questions such as, “do I want ‘dense compute’ or ‘dense storage’ nodes?” Understanding these options could be pretty confusing. But due to new services Amazon has rolled out, the choices are simpler.
Today, all you need to do is ask yourself, do I need more than 1 TB of storage?
If your data warehouse isn't going to be very big, odds are you'll need less than 1 TB of storage. In that case, the AWS Redshift pricing guide suggests you use DC2 nodes.
DC2 nodes are optimized for building small data warehouses that are fast and affordable. They give you a decent amount of computing power, and your data is stored on a zippy solid-state drive. DC2 nodes come in 2 sizes:
If your data warehouse is going to handle a ton of data, you'll need to use RA3 nodes. RA3 nodes offer a formidable amount of computing power. And since solid-state drives are too expensive to use for all of the data stored in a large data warehouse, RA3 nodes use a neat trick called Manage Storage, where Redshift dynamically stores frequently-used data on solid-state drives and offloads the rest of the data to Amazon's S3.
With a smaller database, it makes sense to pay one rate for a preconfigured amount of computing power and storage, just like you might buy a preconfigured laptop on Amazon. But with big data warehouses, one-size-fits-all doesn't make sense. If you need gobs and gobs of storage, you shouldn't have to also buy a ton of computing power unless you actually need it. So with RA3 nodes, you pay separately for computing power and storage.
Regardless of which type of nodes you use, you may also be charged for importing data into and exporting data from your Redshift data warehouse. If the data you are transferring is stored in Amazon S3 and that data is stored in the same Amazon Web Services region as your data warehouse, you won't be charged for data transfer. If not, you'll be charged at standard AWS transfer rates.
If you have more sophisticated needs, Redshift offers a variety of optional features. Here are a few of the more frequently used options you might consider adding to your Redshift setup. They may cost a little extra, but could they save you time, hassle, or unexpected budget overages.
One of the real pains about building a data warehouse is that you have to import all of the data you're going to use even if you only infrequently use most of that data. But if you store a lot of your data on AWS, Redshift can query that data without importing it:
What if your data warehouse gets hit with a usage spike? Redshift's got you covered. With Concurrency Scaling, you can set up your data warehouse to automatically grab more resources when your needs jump dramatically, then automatically release these resources when they are no longer needed.
AWS Redshift pricing for Concurrency Scaling is a bit tricky. Every Amazon Redshift cluster earns one hour of free Concurrency Scaling for every day of normal usage, and each cluster can accumulate up to 30 hours of free Concurrency Scaling usage. If you go over your free credits, you're charged for the additional cluster(s) for every second you use them.
Amazon Redshift automatically backs up your data warehouse for free. But sometimes it's also useful to take a snapshot of your data at a point in time. For clusters using RA3 nodes, you will be charged for this additional backup storage at standard Amazon S3 rates. For clusters using DC nodes, you'll be charged for any manual backup storage that takes up space beyond the amount of storage included in the rates for your DC nodes.
In addition to offering on-demand rates, Redshift also offers reserve instances, which provide a significant discount if you commit to a one-or three-year term. The Amazon Redshift pricing page says that "customers typically purchase Reserved Instances after running experiments and proof-of-concepts to validate production configurations"—a good practice to follow with any long-term data warehouse contracts.
Because many components of AWS Redshift pricing are dynamic, there's always a danger that your costs will spike. This can be a particular concern if you are trying to make your Redshift data warehouse as self-service as possible. If one department gets carried away in how hard they are hitting the data warehouse, it could blow a hole in your budget.
Luckily, over the past year Amazon has added a variety of features and tools to help you keep a lid on costs and catch increases in usage before they get out of hand. Here are a few examples:
In this article, we've given you an overview of how to understand Amazon Redshift pricing. The good news is that if you know what you're doing, Redshift's pricing structure lets you spend no more than is necessary to serve your data warehouse needs.
The bad news is that you really need to know what you're doing, because figuring out how much Redshift is going to cost you isn't simple.
Redshift's pricing structure is another example of the fundamental downside of using Redshift—it's a powerful tool for large, complex data warehouses, but using it requires a substantial investment of time to understand it well enough to use it effectively. So before you go down that road, you might want to ask yourself, is there a simpler solution for my business's needs?
If you're asking yourself that question, you might want to take a look at Panoply. Panoply is a modern, fully self-service data warehouse with integrated ETL that's designed from the ground up to give you a lot of power without the pain associated with more complex solutions.
What's more, Panoply's pricing structure is simple and transparent: You pay for the number of data sources you connect and the amount of data you store. That's it. You can query as much as you'd like and add as many users as you'd like at no additional cost, so you don't have to worry about your bill ballooning from month to month.
Regardless of which data warehouse you choose, having a good sense of your all-in costs is key. The reality is that the right data warehouse, in combination with a great data program, will be worth every penny.