Panoply Blog: Data Management, Warehousing & Data Analysis

How to Improve the ETL Process Flow with Row Data Processing

Written by Alon Brody | Feb 1, 2016 5:03:00 PM

ETL processes (extract, transform and load) might be fundamentally flawed.

When we save data we do so for a purpose. In the context of reporting that purpose is to enable us to answer questions regarding our business; what is the number of active users in our application, which platform is experiencing the least crashes, what is our monthly revenue, etc. To do so, BI departments maintain databases, containing several ETL (Extract, Transform and Load) processes.

ETL processes will typically run every night to extract, transform and load all the relevant data into aggregated tables which populate the next morning’s reports; answering a series of business questions we have decided to ask on a recurring basis. In theory, this sounds like an excellent solution. Executives receive data reports pertaining to their questions and that data can be refreshed within seconds given need. In reality, the solution is fundamentally flawed.

Algorithms are perfect. The humans building them are not

The flaw is inherent not to the technology behind ETL but to the mental shortcuts we as human beings take when defining these processes. We can see only until the fork in the road. Our understanding of what is important and what questions we should be asking is in a constant state of flux. The ETL processes that we are defining to answer these important questions are not.

Let’s say we transform our data and maintain an aggregation that shows the number of installs per platform (iOS, Android and Desktop) per day. This is an easy transformation that will enable us to understand which platform brings in more users to the app on a daily basis. Two months down the line there is a drop in installs and the aggregation indicates that since last Saturday. android installs have dropped by 14% daily. We now have a piece of data that says X and need to figure out what it means and how to respond.

The problem is that when we created this aggregation we eliminated variables that at that time seemed irrelevant. It could be that a competitor has outbid us in android advertising or that there is a new android phone app that we are not supporting yet. We don’t know. The aggregation we’ve been using for months is now obsolete as it can’t answer this question.

Keep in mind that the purpose of aggregations is to save run time and disk space and that the raw data will usually be backed up and saved in an alternate and cheaper storage. So we can now change our ETL process and create a new aggregation for installs taking into account the variables that we have learned are important. Unfortunately, this aggregation table will exist from this time onwards. Comparing trends and statistics to previous time periods are critical to context so it’s a pity we didn’t think about this question when we first wrote our ETL process.

Raw data VS Granular aggregation

Truth be told, they are the same. The most granular aggregation will always be the raw data. Raw data contains a massive amount of answers to questions that we haven’t begun to think to ask yet; all we need to do is figure out the right questions. Thankfully, current database technologies can handle billions of rows with very short query run times.

We no longer have to rely on ETLs that transform data with predefined queries. Instead we should utilize ETLs that enrich the data; what I like to call EEL (Extract, Enrich and Load). Enrichment can handle processes such as converting IP to location, translation of epoch times, etc. Relocating transformations to their proper place on the query side. Therefore, you should create aggregations only when you ask questions (views, materialized views or complex queries).

Don’t get me wrong, I’m in favor of cutting down query times and getting answers in the minimum time possible but aggregations bind us to the same old questions, queries and reports.

There are other ways that we can reduce query time such as running reports in advance for daily dashboards instead of the transform process (trust me you won’t feel the difference). Businesses want to be the best in class and to get there we can’t do the same old things but expect different results. By going raw with your data you’ll be able to adapt to the future questions your business will need to ask without losing precious time, resources and most importantly data.