What is Data Transformation, Anyway?

By Peter Weinberg | August 24, 2020 | ETL/ELT

If you work with data of any real size, chances are you’ve heard of ETL before. If not (or if you just like having your memory refreshed), here’s a summary: ETL is a framework for moving data from transactional databases, where it’s collected, to analytical databases, where it can be processed for insights.

ETL, as the acronym suggests, is actually composed of three separate processes: Extract, Transform, and Load. The meaning of Extract and Load are pretty straightforward here; the data is extracted from the transactional database (or other source) and loaded into an analytical database.

The Transform step can be a bit fuzzier as a concept, especially given that its place in the order of operations has shifted a bit in the 50 years since ETL was introduced. Today, we’re going to dig into exactly what we mean when we talk about transformation, then look at why moving transformation to the end of the process—in other words, doing ELT instead of ETL—makes data processes faster and more efficient.

Data transformation, explained

So what do we mean by transformation? A lot of things, actually. Broadly, the transformation step covers anything you’ll be doing in order to prepare the data for analysis. This could be as simple as weeding out duplicate rows, re-formatting date- and timestamps, or splitting customer names into “first name” and “last name” fields, but it could also mean automatically validating records for completeness or log-scaling your data to set up for a machine learning project.

Let’s look at a concrete example. Suppose your analytics operation combines data from three different sources: a storefront API, a social media platform, and a web traffic analytics service. As an analyst, you want to be able to merge all this data seamlessly so that you can report the conversion rate from your most recent social media campaign.

Annoyingly for you, though, each of these data sources reports their date and time stamps slightly differently, and some of them don’t always return complete records. So in the transformation step, you need to delete columns that contain erroneous web analytics data, remove any incomplete or duplicate records, and convert the long-form date entries from your social media API into standard datetime format.

Traditionally, these sorts of transformations have happened prior to loading all your data into a data warehouse, so that everything is pristine, standardized, and ready for analysis before storage. But the transformation doesn’t necessarily stop after loading.

SQL-fluent analysts will often create views, which save query results for future reference, and that’s also a form of transformation. In short, transformation makes analysis easier, whether by structuring, standardizing, or just combining your data into a more useful form.

The shift to ELT

As we mentioned above, the ETL framework has been in place for around 50 years now, which is pretty impressive longevity in the tech world. ETL made sense when storage was expensive and slimming down your data meant major cost savings.

But as storage becomes ever-cheaper and even non-tech companies generate a flood of data, traditional ETL is a potential liability. Putting the transformation step in the middle of your data pipeline can create a bottleneck where data can be stuck in limbo in IT, waiting for its turn in the transformation queue, before being eventually loaded into the data warehouse.

Meanwhile, your analysts and data scientists will be twiddling their thumbs while they wait for the data they need to finish their analyses. Even worse, your analysts could get tied up doing minor transformations on data they won’t even need just to keep the data flowing, which means they’re doing busywork instead of analytics.

ELT eliminates this bottleneck by simply changing the order of operations. Instead of extracting everything, transforming everything, and then loading everything, in an ELT (Extract, Load, Transform) pipeline, the data is loaded into your warehouse just as soon as it’s extracted, essentially raw.

This means your analysts, who are capable of transforming data on the fly, will get all the data they need as soon as it’s ready, and the rest of your team won’t be left waiting for valuable insights. An added benefit of leaving transformation for last is that you won’t lose any granularity by modeling data before loading it into your warehouse, and you won’t need to retool your ETL pipelines down the line if you realize some valuable data points are being filtered out in your automated transformation step.

How to make ELT work for you

The speed and efficiency benefits of adopting an ELT approach are huge, especially for data operations on a tighter budget. It's tough to choose between expanding your data engineering team or your analytics team, or to dedicate analyst time to doing data engineering instead of reporting. Making the right decision here can be the difference between plateauing or hitting an inflection point for your growing business.

Fortunately, there are some ready-made solutions for this problem. Panoply’s combo of a data warehouse and built-in ETL takes the hassle out of data extraction and loading, eliminating engineering tasks that can eat into your analysts’ and data scientists’ time.

Plus, Panoply's Redshift architecture connects easily with your SQL workbench, making it simple to create all the transformations you need. Just connect a business intelligence tool for company-wide access to data that's always up-to-date, standardized, and ready to query.