A data warehouse aims to make sense of a specific subject over time by analyzing historical data. This system, also called the decision-making support system, can tackle trends as diverse as the percentage of client churn or beer consumption trends within a geographical area.
Incremental loading is one of those crucial issues you need to consider when defining your load pipelines.
In this post, I'll explain what incremental loading is and why it's so important. I'll continue by defining some key components and end with a short conclusion
What is incremental loading, and why do we need it?
Also known as "change data capture" and "delta load," the incremental load is responsible for taking a snapshot of the most recent data. It's essential to make sure your data warehouse is consistent and reliable.
With that said, to understand the benefits of an incremental load, you need to understand what a data warehouse is.
So I'll start with some basics on the data pipeline steps you need to take when loading your data warehouses databases. Doing so should give you a crisp and clear understanding of why it's essential.
Put into simple terms, data warehouses are databases that support the decision-making process. They make it possible for users to analyze data over time.
Below is a quick refresher of the most-used data warehouses phases under the Kimball approach:
Data warehouse phases using Kimball
- Online Transactional Processing (OLTP / Staging area / Landing Zone): Data is extracted from its source systems into transitional storage. This is data in its raw state, as close as possible to its source.
- Operational Data Stores (also known as ODS): A database with all necessary quality processes and correlations among different sources. The data here is in a state curated for users.
- Data Warehouse (also known as DW / EDW): Here data is stored in tables called dimensions or facts. Data redundancy is not an issue as it will show its evolution over time.
- Online Analytical Processing cubes (also known as OLAP / Presentation Layer): This is the final data that users access for their reports, with specific metrics as needed.
You can't (or at least you shouldn't) have detailed information such as email addresses in your data warehouse.
As the staging area is not ready for database transactions, you should interrogate your data in the early transactional stage for this type of request. The ODS works better for operational inquiries such as "who did what where?"
Are you still with me? Good.
Now that we have our bases covered, we can move on to what you need to consider when designing your load structure.
How your data warehouse is loaded
Every application with meaningful data for your data warehouse is called a data source.
Each data source pump has its own constraints when loading data into a data warehouse, subdivided into two major groups: streaming and batch loads.
- We use the streaming approach when data is available continuously. Picture your health tracker data coming from your smartwatch.
- We use batch techniques when your data source provides bulk data, such as a retailer application's point-of-sale data.
The second is more common when your source system has a specific loading time to avoid concurrency of internal processes.
A micro-batch is a mixture of the two as it combines the continuous load approach with the periodicity found in the batch approach. Micro-batches are convenient when your data comes in a constant flow but not longer than ten minutes in smaller batches.
Now that you have a high-level understanding of how your data warehouse gets filled, we can move on to the next step: identifying the business keys.
What are business keys?
Every data source has what is called "business keys." These keys help us identify what gives each record its uniqueness from a business perspective for that specific application.
It's important to have a solid central repository where you can find all information about your data sources. A data catalog can assist here. It's a central repository with all your data sources that you can explore.
Panoply can help you tame your data while you enjoy your coffee (or tea, if you prefer) without worrying about it.
It will help if you have business keys to identify unique records. Due to the high level of available details, the table below has bigger cardinality or "high data granularity."
Cardinality example:
ID | Name | Profession | Address |
1 | Adam Smith | Fireman | 4389 Boul. Faraway App 6 |
2 | John Smith | Croupier | 4389 Boul. Faraway App 1 |
3 | Juliette Bull | Saleswoman | 64 Holest Nawy App 6 |
4 | Lucille | Croupier | 1234 Runaway Train |
5 | Samantha | Policewoman | 4389 Galway Road |
We can use a combination of "Name" and "Profession" in this example. Those are the business keys or the fields that give unicity for business purposes.
For example, we could use the "ID" as the unique primary key, but it won't help much if it references a different record than those in the example above.
Once the business keys are clear, we can define our attributes and metrics: upon which we can base understand the data trends in our data sources, such as the relevant metrics and attributes that yield accurate insights.
How do business keys relate to incremental keys?
Once you know how to identify your data source's business keys, you must determine which field will fit your load's incremental key. In addition to the business keys, the incremental key is responsible for loading only the new version of your data.
Based on both business and incremental keys, you'll correctly be able to load the last version of the data only. This version will update its existing version in the staging/ODS phase.
Doing so will update your data warehouse with a new version of your record, leaving the previous version with a deprecated flag on it.
In this scenario, I am not talking about slowly changing dimensions. The business and integration keys are the ones that give us what we call the "unicity of the record," or what makes it unique, creating what is called data versioning.
Data versioning tells us to differentiate between the record's current state and its previous states. It enables something called "temporal analysis" of the record. The data versioning on source systems is expected to exist on the ODS phase of your data pipelines processes.
Conclusion
In this article, we covered incremental load and its importance for data integrity.
We started by presenting other names for it—the delta load and change data capture—and reviewed what comprises a data warehouse based on Kimball methodology, followed by the kinds of approaches for continuous or bulk data loads.
We ended by explaining business and incremental keys and distinguishing them from system keys.
Here at Panoply, we understand how burdensome all this is.
We always advise you to use the incremental load option whenever possible when interrogating your data. Doing so removes the duplicated views of your data snapshot, reducing the burden when managing your data ingestion pipelines.
The incremental load is strongly recommended (even mandatory) when defining and developing your data pipelines, especially in the ODS phase. It can help you load the data from your data sources correctly by loading your table properly and even scaling your peak load processes by splitting your data into different pipelines.