A Brief Definition
Simply put, data blending takes data from multiple data sources and combines that data into one useful dataset.
This is not a new concept. Data analysts estimate that 80% of their “analytics” time is spent creating new datasets, rather than analyzing data for insight. And we all know that valuable data sources are multiplying at a dizzying rate. So how can we combine data to mine for facts or correlations when sources are as diverse as databases, spreadsheets, social media, and pdfs?
ETL or ELT?
The creation of new datasets often requires heavy lifting. Maybe you need to write a new software program, and writing it’s time-consuming. Or maybe you need to allocate your best Excel Jedi to the job. Or maybe you have so many divergent sets of data that finding common dimensions (links) for your destination data set feels like herding cats.
- Copy data from different sources (Extract).
- Standardize the data as needed (Transform).
- Store the data in a receiving data warehouse (Load).
ETL was once very attractive because it creates a single version of data that programmers and business analysts could mine.
- Copy data from different sources (Extract).
- Store clean data as-is in a receiving data warehouse (Load).
- Combine data as needed to solve a particular use case (Transform).
Why Use Data Blending?
Data blending offers some distinct advantages over the traditional ETL model.
One option is to set up a traditional ELT data warehouse and then use joins when you want to see a new data combination. Let’s use a left join of two tables as an example. You identify two tables, one “left’ and one “right.” (Fun fact: The left table is dominant, just like the left side of your brain dominates in logic, science, and math.)
Then you run a query that - for starters - returns the entire left-table. Next, the query looks for rows in the right-table that share the dimension you’re looking for. If your shared dimension is Product ID, the join produces a new table with all the data from left-table and additional rows from the right-table. The query adds a right-table row each time it finds a Product ID that matches a Product ID in the left-table.
That means a lot of duplication. Other disadvantages of joins:
- Results can be different, depending upon which table you decide will be “left.”
- Join complexity increases rapidly as you add more tables to your query.
- Databases might not support cross-database joins.
- Data with different levels of detail can frustrate your query.
- ETL systems are designed and maintained by an IT department, not the business analysts/users who need information. ETL systems are centrally governed and slow to change. They don’t support ad hoc queries that would require a whole new ETL process for support.
Blended data systems improve on the weaknesses of ETL systems. You can set up a blended data warehouse where data sources maintain their unique qualities (including level of detail) and business analysts run ad hoc queries without needing support from computer scientists. A wide range of software tools are readily available to support data blending, some popular tools are offered by Tableau and Alteryx (more below).
How Do You Blend Data?
Let’s take a high level view of how to set up and use blended data.
Set Up Your Blended Data Source
The basic steps are:
- Acquire the data - find the data you need in diverse sources, tables, social media, pdf, etc. Get access/permission.
- Join the data - combine and load the data to a destination (data warehouse) so you can use it.
- Clean the data - delete or fix bad data, and redesign the remaining data into a usable format.
Those three steps look just like ELT, don’t they? They’re easier said than done, but Alteryx and others offer self-serve data prep tools to get you going.
Analyze Your Data
Much like a join, data blending views are based on a primary information source, a secondary source, and a dimension (link) field. Unlike joins, blended data self-serve tools:
- Can provide automated data discovery to find a dimension that links your data sets.
- Return only a view of your combined data, leaving your data sources intact
- Are good at handling data with different degrees of granularity.
As an example, let’s look at a blended data query for data sources with different levels of detail. One source contains information about quarterly sales goals. The other data set contains information about each salesperson's performance, week-by-week.
A join of the two sources would include a row for each salespersons’ weekly sales, which isn’t the information you want. That kind of duplication is a clue that you would be better off with blending. A blend of the data sources would aggregate weekly detail into quarterly totals - the results you do want.
If have use cases that require the strengths and flexibility of blended data, explore software tools like Tableau and Alteryx. They’re designed to get analytical power to end users/analysts. Also remember that Panoply can handle your entire data pipeline. It uses machine learning to fast-track data cleansing and prep, so your source data is ready for analysis in minutes. Panoply is also flexible enough to work with any other visualization tool you choose.