Your company has reached the point where the only way to analyze your data on the scale needed to help your business succeed is to build or buy a data warehouse. How do you figure out how much it's going to cost?
The unhelpful but oh-so-true answer: it depends. Implementing a data warehouse can cost anywhere from $200 to $2 million depending on the complexity and uniqueness of your data, the scale of your organization, whether you build the data warehouse from scratch versus buying a Data Warehouse As a Service (DWaaS), pricing structures, etc.
And implementing the data warehouse is only one piece of the cost. You'll also need to consider ongoing costs such as maintenance. If the costs varies so widely, how do you come up with an estimate you can have any confidence in?
In this article, we'll show you how to break down the components that go into the cost of a data warehouse, come up with a rough estimate, and get a return on your investment that will help your business thrive.
What am I paying for?
Ten years ago, the cost-benefit ratio for building a data warehouse wasn't that great.
Back then, most data warehouses were built from scratch and installed on prem. That meant a lot of money spent on hardware and software plus the people to maintain them. Even if the data warehouse was hosted elsewhere, you were still buying the hardware and software necessary for them to run.
For large organizations, it still paid off to build a data warehouse. Sure, it was very expensive and most of the money was spent on overhead. But it was the only way to obtain the critical business insights that could be derived from analyzing vast quantities of your data.
Today, the costs of a data warehouse are significantly less than they used to be. With Data Warehouse As a Service (DWaaS) providers such as Redshift, BigQuery, and Azure, infrastructure costs typically tend to be less, and some of the headaches of getting set up and maintenance have been taken off your shoulders. And with full-service DWaaS such as Panoply, setup and maintenance costs have dropped dramatically.
As a result, when you use a modern data warehouse, you spend less on overhead and more on uncovering insights hidden in your business data—insights that can make the difference between failure and success.
What are the components of a data warehouse's cost?
To estimate the cost of a data warehouse, there are several components you need to consider:
The most straightforward cost you'll pay is for your hardware and software infrastructure.
For traditional data warehouses where the hardware is either on premises or in the cloud, you'll need to pay for one or more servers—both the hardware and server software license—and the storage.
For example, most traditional data warehouses have an area for development, testing and staging, and production. This often requires one server for development and testing, another for production. For storage, in addition to ensuring your development/testing and production servers have enough disk space, you may need to add more storage to your backup systems.
If you're using a standard DWaaS platform, pricing can differ widely between vendors. And as you'll see below, making an apples to apples comparison is extremely challenging. For example, many vendors use a proprietary method for calculating how much computing power costs per unit of time. As a result, it's almost impossible to come up with a meaningful estimate of a vendor's compute costs without running some tests. So if you're going to try to decide which vendor to choose based in part on price, you're going to have to spend a significant amount of time working through each model in order to come up with estimates.
Internal headcount and external consultants
Hardware and software are only one part of the story. An equally critical component is the cost of the highly skilled people you need to develop and maintain the data warehouse.
Whether you hire staff or outsource the work to consultants, you'll need to factor in the cost of the following:
- Analysts to figure out what problems the data warehouse will be designed to solve.
- Data engineers who can model the data, optimize the data warehouse for performance, etc. Some of these skills are the same across vendors, but some (such as optimization), require data engineers who have an intimate understanding of a cloud service provider's tools and environment; having staff who are Azure-certified doesn't help much if you decide to go with Redshift.
- Managers to oversee your initial data warehouse setup. And if your data warehouse is going to require additional dedicated staff to continue to maintain and develop it, you may have to factor in the cost of increasing the number of managers who these new staff report to.
The other human capital cost you'll need to take into consideration is hiring, training, and turnover. This can be a significant cost for data engineers, who are in short supply.
Thankfully, by taking tedious data tasks like ETL pipeline management and schema setup off your team's plate, managed data warehouses like Panoply can significantly reduce the money you spend on headcount.
The last component of a data warehouse's cost is often overlooked: opportunity costs.
If it takes you a year instead of two months to build your first version of your data warehouse, that will cost you an additional 10 months when you are deprived of analytical insights into your business—time when your competitors might gain an edge.
If you're a midsize business and implementing your data warehouse sets you back hundreds of thousands of dollars, you might end up delaying the launch of a product whose cost you otherwise could have afforded.
And then there's the hard to qualify but equally critical opportunity cost of your company's focus. Traditional data warehouses, for example, are notorious for chewing up a lot of staff bandwidth—and not just IT staff but business owners and users as well.
Losing bandwidth doesn't just mean that other needs won't get taken care of. Let's suppose you're working in a rapidly growing company, where there are too many balls in the air. If a data warehouse project is eating up IT or engineering staffs’ focus, they may miss the signs that a mission-critical IT system is in serious trouble and is about to land your company in a major crisis.
How do I estimate the cost of my data warehouse?
To come up with a rough estimate of the price tag for your data warehouse, you'll need to figure out the following:
If you are using a SaaS data warehouse, you'll need to start by understanding the factors in a SaaS vendor's pricing structure and then estimate as best you can what your needs are. For example, if the vendor charges per terabyte of storage, you'll have to estimate how much storage your data warehouse will need.
If you're lucky, you either know or can reach out to people in your industry who have built a similar type of data warehouse. If not, estimating can get tricky. Usually your vendor will have some advice based on their experience. But you may also want to see if you can run a trial or pilot project that will give you a rough idea of what you'll need.
No matter which route you choose, starting with a rough guess and adding a cushion in case you end up using more resources than you expected is always a good idea.
Ongoing maintenance and engineering costs
For a managed data warehouse, maintenance costs are usually baked into the monthly fee. If you aren't using a full-service DWaaS, your best strategy is to talk to the vendors you are considering and reach out to people in your industry who own a similarly sized data warehouse.
Ongoing engineering costs tend to be pretty low for full-service DWaaS. For traditional DWaaS, engineering costs can be hard to estimate. There are some ongoing costs, such as basic data warehouse performance tuning, where your data engineers can probably give you a pretty good guess. The big unknown is how often you'll have to restructure parts of your data warehouse to accommodate changes in how staff are using it and how your business needs have evolved. Again, other people who've built similar types of projects can help give you a sense of the range of work that is reasonable to expect.
Once you have a rough idea as to how many data engineers with what mix of experience you'll need, you can use online tools to get a salary range for each type of engineer. For example, according to Payscale.com, in 2020 an experienced data engineer living in Silicon Valley would make between $93,000 and $194,000 a year, with an average salary of $140,000.
Other tool costs
The last cost you'll need to estimate are the additional tools you'll need to purchase.
Depending on which vendor you choose, you may need to buy tools that make it easy to transform your data. For many data warehouse projects, data engineers use third party ETL tools such as Stitch or Fivetran to dramatically reduce the amount of work it takes to clean up and structure your data. Alternatively, your staff may use more generic data transformation tools like Matillion or Alteryx.
Regardless of which DWaaS vendor you decide to use, you will almost certainly need to get BI tools, such as Tableau, Chartio, or Google Data Studio, to make it easy for users to analyze and visualize the data that's in your data warehouse.
Top cloud data warehouse pricing structures
While there are dozens of well-known cloud-based data warehouse vendors, the following is a brief overview of the pricing structures of the vendors you're most likely to consider. Most DWaaS pricing structures are quite complex—e.g., some cost components vary by location. These summaries are just intended to give you a feel for the pricing differences among these vendors.
To figure out how much Amazon's Redshift will cost you, start by estimating what type and how many "nodes" you'll need. If your data warehouse is likely to be less than 1 TB, Amazon suggests you choose DC2 nodes, which run from $0.25/hour to $4.80/hour. If you need lots of storage, you'll need to go with RA3 nodes with managed storage. RA3 nodes use solid-state drives for local storage and automatically offloads the rest of your data to Amazon's S3. RA3 nodes cost from $0.85/hour to $6.80/hour, and you'll pay a separate charge of $0.024/GB/month for storage. Redshift charges extra for backup storage and data transfer, and it also offers options such as concurrency scaling.
In addition to offering on-demand rates, Redshift also offers Reserve Instances, which provide a significant discount if you commit to a one-or three-year term. Amazon's pricing page says that "customers typically purchase Reserved Instances after running experiments and proof-of-concepts to validate production configurations"—a good practice to follow for any standard DWaaS data warehouse vendor who offers discounts for long-term contracts.
The main factors that go into estimating the cost of Google BigQuery are storage and compute. Storage will run you $0.02/GB/month; any data that isn't used for 90 days just automatically moved to long-term storage, which costs $0.01/GB/month. You pay an additional fee of $0.01 per 200 MB of streaming inserts.
Calculating compute usage is more complicated. You'll need to estimate the number of users per day X number of queries per users X average data usage per query; you'll pay $5 per terabyte of query usage after the first TB. If you prefer a flat fee, BigQuery offers $10,000 a month -- or $8500 a month if billed annually -- for 500 "slots." What are slots? Google unhelpfully defines them as "a unit of computational capacity required to execute SQL queries".
Microsoft used to offer a DWaaS that just covered data warehouses. Starting in the fall of 2019, they switched to offering Azure Synapse Analytics, which provides data warehouse, data lakes, and big data analytics services. Data storage is billed at $122.88 per terabyte per month, and you can also purchase disaster recovery, threat detection, and additional support.
Like BigQuery, Azure has its own method for calculating compute: Data Warehouse Units. 100 data warehouse units will cost you $1.21/hour. Azure also offers a reserved capacity option where you sign a longer-term contract in exchange for discounts.
Snowflake has an unusual pricing model for compute. Snowflake requires that you buy a Virtual Warehouse consisting of either one server or a cluster of servers. How much a Virtual Warehouse costs depends on which version you decide to choose; the cheapest is two credits/hour, which costs $2/hour. Virtual Warehouses are configured so that if your users are no longer running queries against a cluster, it'll automatically stop, then automatically resume when users start executing new queries.
Snowflake storage begins at a flat rate of $23 per terabyte per month. They also charge for a variety of administrative services. Like the other standard DWaaS vendors, you can choose between on-demand versus pre-purchased at a discount rate.
Cloud data warehouse pricing structure comparison
If all those different pricing models started to make your head spin, here's a summary of how each cloud data warehouse structures its pricing. At this point, you can probably tell that we have a bias for offering approachability, both in our product and our pricing.
Number of Nodes
Price of Nodes
Streaming Insert Quantity
Number and Complexity of User Queries
Integrated Data Sources (No additional ETL Costs)
ETL Tool Costs
Data Warehouse “Units”
ETL Tool Costs
Data warehouses are much more affordable and involve fewer headaches than they used to. But as we've seen, the process of estimating a data warehouse's cost is pretty daunting if you are considering a traditional DWaaS vendor. If you're going that route, you should plan on devoting a considerable amount of time and effort to it. And you should be prepared for the fact that until your users start to intensively use your data warehouse, you won't really know how accurate your estimate was.
Luckily, at Panoply, we make pricing out your data warehouse a snap. Instead of wasting hours coming up with an educated guess as to how many "nodes" or "slots" you need, you can just choose a pricing tier and you're good to go. And if you underestimate or overestimate your business needs, changing your tier is easy and relatively inexpensive. That means you can spend less of your time playing elaborate guessing games and more of your time figuring out how to discover insights in your data that will help your company grow.