ETL

Data Blending: What It Is And How To Do It

 

ETL, ELT, data joins, data blending. Does it all make you feel as mixed up as the ingredients in a blended protein-powder-banana-acai smoothie?

A Brief Definition

Simply put, data blending takes data from multiple data sources and combines that data into one useful dataset.

This is not a new concept. Data analysts estimate that 80% of their “analytics” time is spent creating new datasets, rather than analyzing data for insight. And we all know that valuable data sources are multiplying at a dizzying rate. So how can we combine data to mine for facts or correlations when sources are as diverse as databases, spreadsheets, social media, and pdfs?

ETL or ELT?

The creation of new datasets often requires heavy lifting. Maybe you need to write a new software program, and writing it’s time-consuming. Or maybe you need to allocate your best Excel Jedi to the job. Or maybe you have so many divergent sets of data that finding common dimensions (links) for your destination data set feels like herding cats.

Extract, transform, load (ETL) was an early approach to combining datasets. It’s been part of computing since the 1970s and is still widely used in data warehousing. The ELT process is to:

  1. Copy data from different sources (Extract).
  2. Standardize the data as needed (Transform).
  3. Store the data in a receiving data warehouse (Load).

ETL was once very attractive because it creates a single version of data that programmers and business analysts could mine.

Increasingly, however, we see a strong trend away from ETL to the data blending model of ELT. In this new system, we:

  1. Copy data from different sources (Extract).
  2. Store clean data as-is in a receiving data warehouse (Load).
  3. Combine data as needed to solve a particular use case (Transform).

Why Use Data Blending?

Data blending offers some distinct advantages over the traditional ETL model.

One option is to set up a traditional ELT data warehouse and then use joins when you want to see a new data combination. Let’s use a left join of two tables as an example. You identify two tables, one “left’ and one “right.” (Fun fact: The left table is dominant, just like the left side of your brain dominates in logic, science, and math.)

Then you run a query that - for starters - returns the entire left-table. Next, the query looks for rows in the right-table that share the dimension you’re looking for. If your shared dimension is Product ID, the join produces a new table with all the data from left-table and additional rows from the right-table. The query adds a right-table row each time it finds a Product ID that matches a Product ID in the left-table.

That means a lot of duplication. Other disadvantages of joins:

Blended data systems improve on the weaknesses of ETL systems. You can set up a blended data warehouse where data sources maintain their unique qualities (including level of detail) and business analysts run ad hoc queries without needing support from computer scientists. A wide range of software tools are readily available to support data blending, some popular tools are offered by Tableau and Alteryx (more below).

How Do You Blend Data?

Let’s take a high level view of how to set up and use blended data.

Set Up Your Blended Data Source

The basic steps are:

  1. Acquire the data - find the data you need in diverse sources, tables, social media, pdf, etc. Get access/permission.
  2. Join the data - combine and load the data to a destination (data warehouse) so you can use it.
  3. Clean the data - delete or fix bad data, and redesign the remaining data into a usable format.

Those three steps look just like ELT, don’t they? They’re easier said than done, but Alteryx and others offer self-serve data prep tools to get you going.

Analyze Your Data

Both Alteryx and Tableau provide sweet interfaces so business users can explore data for its deeper intelligence.

Much like a join, data blending views are based on a primary information source, a secondary source, and a dimension (link) field. Unlike joins, blended data self-serve tools:

As an example, let’s look at a blended data query for data sources with different levels of detail. One source contains information about quarterly sales goals. The other data set contains information about each salesperson's performance, week-by-week.

A join of the two sources would include a row for each salespersons’ weekly sales, which isn’t the information you want. That kind of duplication is a clue that you would be better off with blending. A blend of the data sources would aggregate weekly detail into quarterly totals - the results you do want.

If have use cases that require the strengths and flexibility of blended data, explore software tools like Tableau and Alteryx. They’re designed to get analytical power to end users/analysts. Also remember that Panoply can handle your entire data pipeline. It uses machine learning to fast-track data cleansing and prep, so your source data is ready for analysis in minutes. Panoply is also flexible enough to work with any other visualization tool you choose.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
ETL
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.