Imagine that you are in a meeting with a potential investor, and they ask you for specific information about your company to help them decide whether to invest in your company or not. Maybe they ask you for sales records, the annual growth rate per sales representative, or the marketing channels with the highest ROI.
Could you answer those questions during the same meeting, or would you have to request answers from your IT team? How many people would have to be involved? How long would this data gathering process take?
As your business grows, finding this kind of data becomes an increasingly important and difficult job. One solution to problems like this is a data warehouse. A data warehouse is an analytical database that serves as your company’s source of truth for actionable data and insights across the organization.
When do you need a data warehouse?
Having a data warehouse becomes more critical the larger your organization grows, but even smaller companies with a lot of data can benefit from having one. Typical use cases include:
- Integrating data from disparate sources
- Simplifying the presentation and availability of data
- Creating forecasts for resourcing decisions
These use cases are common across many businesses, but setting up a data warehouse is an investment. How do you know when it’s worth it?
Here are a few situations that might indicate you need a data warehouse sooner rather than later:
1. You have several data sources to query
How do you handle queries that require data from multiple in-house and third-party services? For example, you might store some customer data in your application database, but other information might be locked away in a cloud service like Salesforce or HubSpot. Each data source stores part of the data you need to build a complete customer profile, but consolidating these sources can be a huge challenge.
A robust data warehouse will extract, transform, and load (ETL) your data to help you consolidate different sources into a central repository so you can view useful summaries or projections that no single system could provide. This is also where tools like Panoply can be really useful as they can pull in data from multiple sources, removing the need for an additional ETL tool.
2. Your data must be transformed
In an ideal world, all the information you need maintains the same structure, regardless of where it came from.
In the real world, we have legacy systems, multiple operating systems, and different programming languages that all treat common entities like dates, time zones, currency, and arrays differently.
This is where the “T” in ETL comes in. As you extract data from each source, you can transform it and save it in a common format. This allows you to standardize data types, remove corrupted data, and even apply custom business rules depending on how you plan to use the data.
3. You have a high volume of data
All of the data retrieval operations above become more expensive and difficult the more data you have. Running inefficient queries on 5 gigabytes of data isn’t really a big deal on most modern machines, but running inefficient queries on 5 petabytes of data is going to be problematic.
If a database query cannot be run or common queries need to be run in batches, you need to start investigating a data warehouse.
There’s not a magic line where you know you need a data warehouse, but I would say that if a database query cannot be run or common queries need to be run in batches, you need to start investigating it. More data is inherently hard to manage, but having a data warehouse to centralize and optimize it will make a huge difference.
4. You need user or team-specific dashboards and reports
As your company grows, different stakeholders will need access to your data for different reasons. For example, the customer service team might care about usage patterns, while the sales team might care more about customer lifetime value and upgrade patterns. The marketing team might care more about acquisition source, while the engineering team might care about response time for each customer.
When querying your data is a difficult manual operation, you’re not likely to create hundreds of queries to account for each use case, so you’ll probably try to run a single report that covers everything. This is messy, slow, and error-prone.
A data warehouse can offer different views or dashboards for different stakeholders. Because data can be replicated and flattened in multiple ways, each of these queries can be immediate and tailored to the requesting user.
5. Your data is stored in high-availability (HA) systems
Uptime is always important, but in some applications, high-availability is a contractual or legal requirement. For example, your primary application database is likely critical to your business operations, so executing a long-running query that requires a lock on the database is a non-starter.
Some companies work around this by processing requests at night or during scheduled outages, but this is limiting. These delays mean that your business teams cannot independently gather important data quickly, and the loss of agility will put you at risk if your competition is able to respond to changes faster than you are.
6. You run a lot of complex queries regularly
Think about the kinds of reports you run regularly. For example, maybe you need to know how many licenses have been sold since the beginning of the quarter? Or which customers have been your most active users? Or which are at the highest risk of churn? How many databases and tables do you have to touch to get this information?
Queries like those above are often executed by running complex SQL queries aggregated by date or customer ID. While you can index a SQL database on these fields, indexes have limits and get even more difficult when data is spread out across multiple shards. Having a SQL wizard on your team can help, but relying on a few experts to tweak and maintain these queries is a huge bottleneck.
One of the main advantages of having a data warehouse is that many of the common queries you run will be easier because the data model in your warehouse can be optimized for the kind of information you need to gather.
This problem is just as true in NoSQL databases where table scans are the fallback method when an index is unavailable. Re-indexing large collections is expensive and requires even more storage space as you grow.
One of the main advantages of having a data warehouse is that many of the common queries you run will be easier because the data model in your warehouse can be optimized for the kind of information you need to gather. Data can be flattened or replicated as needed without modifying your production data models or over-indexing your data.
7. You need to apply data mining or machine learning to your data
Finally, what happens when even dashboards and reports aren’t enough to spot trends or complex multivariate relationships?
You can use machine learning analysis to understand much more about your data: predicting buying patterns, sending personalized offers, building retention models, and more. The problem is that mining useful analysis like this almost always requires a dedicated data warehouse to centralize everything.
Once you have a warehouse, you can use tools like Weka or RapidMiner to discover insights from your data. Once you’ve built the necessary models, you can run them to respond immediately to changes or abnormalities.
Do you need a data warehouse?
While having a data warehouse is helpful for growing businesses at a certain scale, not every company needs one right away. If you are part of a small team with a single database, you may be able to get by with a read-only replica and a few SQL queries. So long as you never outgrow this solution, it might be a simpler option.
Having a data warehouse also requires you to be clear about what you want to measure, control, or learn. Without clear goals, it’s impossible to know how to structure the data in your data warehouse, so it will just add to your maintenance overhead. Building a data warehouse can be expensive and time-consuming, so it’s not an undertaking to be taken lightly.
That said, tools like Panoply make it much easier to integrate data sources and build your own cloud data warehouse. Panoply offers a ton of integrations and automated data extraction and storage processes. They offer pricing that is very manageable for mid-sized businesses and include live support with each package.
To learn more about how Panoply's making data warehousing easier for companies like yours, book a personalized demo.