ETL Databases: Extracting the New Wealth

 

Black Gold!

In big data circles, it’s become trendy to proclaim that “data is the new oil,” a reference to the days when untapped reserves of crude represented huge wealth for the taking.

OK. Maybe not exactly for the taking. First they had to find it, drill for it, transport it, and succeed at various other dangerous, expertise-heavy activities. In both the challenges and rewards, the similarities between then (exploiting the newly oil-driven economy) and now (success in our data-driven, disruptive marketplace) are unmistakable.

A 2018 survey of organizations found huge untapped reserves of information. A majority of organizations (80%) estimated that most of their data (50% to 80%) is unstructured and mostly inaccessible. In 2017 CIO 100 research, almost 80% of Chief Information Officers said that data analytics would be a spending priority in the next two years. Welcome to the Texas oil boom of data.

ETL Can Tap into Data

Since its development in the 1970s, ETL has been the go-to method for moving data between storage databases and larger, integrated warehouse-like data structures. So what’s ETL?

Extract, transform, load (ETL) uses three steps to create useable databases or data warehouses:

Step 1: Extract

This step covers reading data from different sources. With the explosion of data available on every single aspect of modern business, the number of potential data sources has proliferated as well. The data you need might come in the form of pdfs, video, social media data, or something else entirely. As a result, finding and reading the sources you need can be a big task. And even once you've tracked down all the data you need, you'll still need to make sure it's clean and ready for storage and/or analysis. That brings us to Step 2.

Step 2: Transform

A database or data warehouse needs to clean and standardize the data at this point. If you have data that doesn’t “fit” into your data schema, it’ll probably be deleted or discarded. Among the data that does make it through your transformation process, you might still find you're losing something essential: data that’s too granular can end up being aggregated, meaning that it loses some of its original detail.

Step 3: Load

After you standardize the data to fit your database or data warehouse design, you can load/store it. You create what’s called a single version of the truth, a pool of data that can be tapped. That’s why businesses use ETL.

Using an ELT Database or Warehouse

You’ve seen that the concept behind an ETL process is simple, with just three easy steps. You might be tempted to go the DIY route. Why not dedicate staff to manually code ETL software to create an ETL database or warehouse? It’s your data, after all, and you know what you want better than anyone.

But there’s more than one way to create an ETL system. Moreover, there can be significant costs associated with the inefficiencies and the learning curve of trying to do it yourself. Let’s take a high-level look at building an ETL system and some ETL tools to help.

Build an ETL Database or Warehouse

There are three basic approaches to building a ETL infrastructure.

Batch processing requires that you begin by setting standards for your data. As a simple example, you wouldn’t accept “Nebraska” in the “Country” field. Then, you need to find data in all your diverse sources, in all its diverse formats, read, standardize, validate, transform, stage, and publish it.

All those processes are complex. If you mishandle them, you risk corrupting the data in your ETL pipeline. 

That's why it can be helpful to to bring in pre-built ETL tools. ETL Tools can shoulder some or all of the burden of building an ETL database or warehouse. ETL testing tools are also important at each step along the way. ELT testing is critical because you want your efforts to produce accurate, usable data. We gathered some of the best-known ETL tools for data warehousing below: 

 Stitch is an open-source solution that rapidly moves data.

Blendo and Fivetran are ETL tools that focus on making your data quickly accessible without any need for you to write code or scripts.

The Matillion ETL tool was designed to work with Google BigQuery and Amazon Redshift. If you already use a lot of Amazon products, its integrations are made for you.

Some ETL tools are free or open source, like:

Apache Camel, which helps you define routing and mediation rules, so your developers can integrate data into your ETL infrastructure.

Apache Kafka, which reads and writes data streams. It supports your developers in writing stream applications.

Apatar provides a visual interface for data integration. Its strengths are extracting, transforming and loading data.

Heka was developed by Mozilla and seeks to be a Swiss Army knife of stream processing. It handles tasks from loading to analyzing data.

Logstash reads data from multiple sources at the same time, transforms it, and stores it in the location of your choice, or Elasticsearch by default.

Scriptella is a script execution ETL tool that lets you use SQL, JavaScript, JEXL, Velocity, or whatever language is appropriate to your pipeline to transform data, migrate data, and perform cross-database ETL tasks.

For-pay ETL tools include:

Informatica PowerCenter, a high performance tool that handles the entire data pipeline from ingestion through analytics, for a top-shelf price.

Microsoft SQL Server SSIS offers a user-friendly graphic interface to help less technical users through a wide variety of data migration tasks. It ships with SQL Server.

Blendo, Fivetran, etc. as mentioned in the Stream Processing section above.

Tap into Your Data

Let’s say you find the right balance between DIY software and ETL tools, and do the heavy lifting to create an ETL database or warehouse. We’ll assume best case - that you didn’t load corrupt data or lose lots of data in the process. What do you have?

ETL is a decades-old vision for data processing. It may satisfy the needs of many businesses in the data warehouse space, but its disadvantages make it unsuitable for others. ETL is very IT centric: data ownership and use is centered in your IT department. Another potential disadvantage is that ETL-based pipelines work best for data that doesn’t change rapidly.

If you need insights from rapidly changing data, or if you want to get data into the hands of your power users and data analysts quickly, consider making an end run around ETL. Panoply offers an alternative to ETL: the ELT (Extract, Load, Transform) approach. When you shift the time-consuming Transform-type activities to the end of the process, you can load data and make it accessible in minutes rather than days. Lightening up on the Transform step will allow you to make that effort only when (or if) you need to see the data.

Once you've got your data loaded, you'll find the fact that Panoply also integrates to any BI visualization tool super helpful:  your analysts will be able to find and share actionable insights in minutes.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with monthly tips and how-tos.