One of the most critical steps in building a data warehouse or building a data lake is integrating your data sources into one format.
Data integration is a crucial step, and it can be done using Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) processes. While ETL was the traditional method, ELT has emerged as a more efficient process with the rise of cloud technologies. Here's a closer look at the architecture of ETL and ELT pipelines to help you make an informed decision. The core difference between the processes is when and where data transformations take place.
So, which one should you use for your business? ELT or ETL?
In a previous post, I discussed ETL vs. ELT from a broader perspective, along with the advantages and disadvantages of each process. In this post, I'll go into the details of the architecture of ETL and ELT data pipelines.
To help choose the best approach, I'll discuss the things you should consider in your ELT design, best practices to follow, and challenges to overcome in the process.
To start, let's explore how data flows through an ETL pipeline. The following figure illustrates the key steps in ETL data flow:
Imagine you have a CSV file containing employee data and two relational databases with product data from two suppliers. You're intending to collect this data, map it to your format, and store it in your database or data warehouse.
In the first step, you'll gather these data sources and collect data using an extracting tool.
Your input data can come in different formats from a variety of sources such as:
However, some of your data sources can be updated after your initial data extraction.
While some sources may notify you when such changes occur, others may be unable to identify modifications to data. In this situation, you may have to extract data fully or partially to reflect the changes in your sources.
Furthermore, you might not need all the attributes in your supplier's product database. You may have to carefully plan and filter the data elements to extract from the sources.
Undoubtedly, transforming is the most crucial step in the ETL process. This is where your input data is cleansed, mapped, and transformed to match the schema in your data warehouse.
First, you'll cleanse the data by removing duplicates, finding missing values, and sorting out ambiguities in data.
Then, you'll perform transformations to map the data to your own format.
One of the common transformations you may come across is encoding and decoding values. For instance, your input data may have "Y" and "N" encodings, whereas you may have stored them as "Yes" and "No."
Similarly, you may need to convert attributes such as gender, dates, and currency data to make sure they are in the same format, as well.
Another common transformation is aggregating and splitting values. You may calculate a total value to aggregate some attributes or split an address into multiple fields to store them easily.
Furthermore, you may join data from different sources to create combinations that are meaningful for your business.
Once all the transformations are done, you'll move the data to a temporary staging area so that you can easily roll back in case of any issues.
In the final step, you'll load the staged data into the target data tables in the data warehouse. However, it can get more complicated in practical scenarios.
For instance, your supplier data can be frequently updated because of changes in product types and quantities.
In such scenarios, you may have to decide if your data ETL process should load all data each time or load only the modified data. These approaches are known as full and incremental load, respectively.
In addition, you may have to decide on the volume of data to load during each ETL process run. You can either load your data in batches or as streams if your data updates in real time.
As you may have realized, designing the ETL process can quickly become rather complex. Here are some of the challenges you may come across when laying out an ETL pipeline:
Because cloud data warehouse use is increasing and unstructured data is being used more often for analyses, the ETL process is shifting to manage large amounts of data.
In ELT, the transformation step happens after loading all input data to the data destination.
Your data destination can be a cloud data warehouse or a data lake. These storage options come with highly scalable processing power that allows loading and transforming large amounts of data.
As the image above shows, the ELT pipeline is more straightforward than the ETL process.
You'll first extract and load your input data to the staging data tables in your data destination. After performing data transformations, usually in SQL, the data will be moved to the final data tables and stored for future use.
Best Practices to Follow With ELT
If you're using a cloud data warehouse and integrating data using ELT, you may still need to plan your data pipeline.
Here's a list of best practices to follow in this process. Keeping in line with these will go a long way in building reliable and effective data pipelines.
At the end of the day, choosing between ELT and ETL can depend on the requirements of your business, the current system you have, and your future plans.
Whether you decide to use ELT or ETL, planning and implementing your own data pipeline can be complicated and inefficient.
If you're looking for a cloud data platform that will enable you to connect your input sources and start syncing your data within minutes, Panoply is for you.
Panoply offers automated integrations to a number of data sources and BI tools to get some insights into your data.
You can explore the Panoply platform with a 21-day free trial and see how your data requirements will fit in.