Data Warehouse Design: The Good, the Bad, the Ugly

Written by Daniel Paes | Nov 17, 2021 2:44:00 PM

One of the significant challenges in handling data comes from the massive amount of data your organization is most likely generating constantly. Having the ability to react quickly and accurately turns into a differential item among competitors.

More and more companies are becoming aware of the need to handle data wisely.

They're looking for ways of getting the most out of their information. And this is probably where they first hear about data warehouses.

I want to present data warehouses and their components in this article.

To start, you'll get an overview of what a data warehouse is. Then we'll go over some considerations that I find mandatory when building a data warehouse.

To conclude, I'll do a quick recap and then share some other concerns and clarifications.

The basics of data warehousing

Data warehouse databases (DWs for short) are a decision support system. Once you understand how they can make the analysis of historical data much easier, it's easy to see why people call them warehouses.

As the name suggests, this decision support system helps in the decision-making process by using available data from within the company.

DW is a result of solving questions around specific subjects.

It has a different approach from the typical operational applications, such as billing or human resource systems. The latter focuses on operational needs such as pay and resource allocation, for example.

And due to DW systems' peculiar characteristics, such as being time-sensitive and not having detailed information such as phone numbers, they aren't a good option for operational reports.

Staying safe during data warehouse implementation

I can't stress enough the importance of having a solid security policy for each phase of your data warehouse rollout.

Simply put, the more closely you allow your users to the source systems so they can explore your data, the more they could potentially access without the correct permissions. And no one wants a data leak.

On top of your existing policies, your data warehouse must follow different requests.

DW databases can solve critical decision support tasks; however, in some cases, the only place this information exists is within sensitive data. In the best "for your eyes only" style, sometimes it's good to restrain those who don't truly need access.

The data warehouse fun begins

Now we've got our bases covered in terms of what a data warehouse is, it's time for us to define who will be its users and what subjects it'll cover.

It's always a good starting point to inform department stakeholders in the early stages of your data warehouse development. After all, they'll know who's responsible for or interested in the resulting key performance indicators, or KPIs for short.

Remember that they can help you solve problems like "Where can I find data that could help me?" or "What will we use to analyze that?"

Which DW building technique should you use?

It's also important to choose the best technique to follow as you build your DW.

The most famous one is the Kimball approach. (I'll describe its phases a bit later in this article.) Ralph Kimball is considered the father of data warehousing.
The Inmon method is the oldest approach, but it's a solid choice.
I also want to mention data vault modeling and CDP/DMP if you're involved in AdTech.

I always like to clarify that although some characteristics of a customer data platform (CDP) make it seem like a data warehouse, it isn't.

A CDP can work as a data lake for your data lakehouse, though. To find out more, check out this CDP guide.

Giving your data warehouse layers

So far, we've clarified who your team is and what your needs are. Now, let's see the layers (or phases) on how your data flows from your applications into your data warehouse.

Your OLTP layer

At this stage, you should start by defining what you'd use as a data source to create your dashboards.

The goal is to build your OLTP (online transaction processing) layer, also known as the staging area or landing zone area. It's where you define the retention plan for this data.

Keep in mind that this stage shouldn't have any data transformation, as the data remains available if some reload is needed on the next phases.

This area lets you complete concurrent processes while the source applications are running. It's a data source for the next layer, the ODS area.

Your ODS layer

OK, let's say you've got your data sources configured and pumping data into your staging area. Great! But you'll need to make some sense of your disparate systems.

Think about it: How can you certify that the tables have the same kind of information? Will you need to have a conformed format for your data?

Here's where the ODS (operational data store) layer is helpful.

The ODS puts all your data sources in order in the correct format. It also removes duplicated data.

In other words, it helps you develop the best and most unified operational database possible.

Cube models and data granularity

Once your sources are busy creating useful data for your analysts, you're ready to create your cube models. Here's where you'll use techniques such as iceberg queries and window queries.

At this point, you've advanced far beyond the operational nature of your data.

These techniques transform your data, allowing you and your team members different levels of analysis. What we call data granularity helps you gain more specific, more targeted & more useful insights from the same dataset.

Let's imagine that you want to look for an address of a retailer. That information is of high granularity. In other words, it has a specific value to look for (the store address).

But what if you want to see how many stores are in a specific county?

In that case, you'll have data of low granularity, and you'll need to count your addresses to produce this outcome.

You can find out more about granularity here.

Setting up the infrastructure, in the cloud or on-premises

The idea of business intelligence comes from a simpler era regarding data structures and interest in extracting information from the data.

Most data sources originated in structured formats (most from relational databases) or semi-structured formats from web services (such as the XML format of the good old SOA services).

In the past, people needed to rely on heavily regular expressions to extract data from a single field; therefore, it made sense to invest in big guns to store your data, as it was supposed to increase over time exponentially.

What's called the "public cloud" presented an attractive option, eliminating the need for these expensive investments.

Cloud providers, including AWS and Azure, are acquiring more evangelists on their data storage solutions.

The serverless approach is exciting and saves time. However, the configuration and management of these new services might be cumbersome to your existing environment.

For this reason, I advise you to go easy on migrating and rebuilding your existing pipelines into new ones. Instead, choose less critical ones to start with and then evolve into the cloud. (Some call this the hybrid approach.)

What we've covered so far about data warehousing

Coming up with the correct design for your data warehouse is challenging all by itself.

Even for the most experienced data architect, focusing on unclear requirements and discovering more complex ones during development makes having a simple dashboard a distant dream.

Therefore, let's consider the concerns you should have in mind.

Other approaches besides Kimball

In this article, I focused on the Kimball approach because it's the most used one. But what if you're interested in different techniques?

You can investigate another old and reliable framework—the Inmon approach.

Another approach worth finding out about is the data vault. It's an excellent option for fast-changing schemas.

Also, it's my preferred method when modeling for streaming data.

Finding the right tools for your data warehouse

Sometimes it's hard to find the right tools to handle all your data sources and their pipeline ingestions. In part because data warehouse techniques are technology agnostic by their nature.

Panoply can handle most of the bottlenecks related to data ingestion and data warehouse object management with the ease of a few clicks.

With Panoply you can empower your users with a friendly interface, making it easy to have quicker insights from unexpected sources.

You can sign up for a free demo, and there's also a helpful blog.

View full post