The world has changed; companies now use hybrid and multi-cloud environments to store their data.
You probably do too.
Gone are the days where you had to build a data warehouse on-site and store everything locally. We have embraced the cloud and its many benefits.
One such benefit is the decoupling of storage and computing power.
Storage is the space your data takes up, and computing power is the amount of work your computers do to make that data usable and valuable (extracting, transforming, and loading most likely).
This decoupling is a double-edged sword.
On the one hand, it leads to substantial cost savings, but on the other, it gives you more to manage.
You don't want your compute resources running all the time, but how do you decide when to run them and when to turn them off?
That's where data orchestration comes in.
What is data orchestration, and why do you care?
A musical conductor controls which instruments start playing and stop playing at a specific time.
Data orchestration is like the conductor for your data pipeline.
It ensures your jobs turn on and off at the correct times.
For example, first, you extract data from different sources, then apply specific transformations. Once all this gets finished, you load it into your data warehouse (assuming you aren't using the ELT method).
But there is another aspect to data orchestration.
A conductor can tell each instrument to play at specific times, but it will sound awful if they play different pieces of music. All the instruments need to be playing the same piece of music. They should be reading from a single source of truth (remember this phrase, you will hear it a lot in data orchestration talks).
Let's imagine a company with data in a variety of locations.
The data is stored in different file formats, created with various data collection tools, and maybe even saved with multiple cloud storage providers (to avoid vendor lock-in). Each file format, tool, and storage provider requires different code to access the data.
Now, let's say an employee at this company wants to create a data application.
Before they've written a single line of code for the application, they will have to spend hours (maybe weeks) writing code just to access their data. It's like spending all your time making sure the dishes are washed thoroughly and not spending any time cooking your meal!
Wouldn't it be great if you had a dishwasher that automatically washes the dishes perfectly every time so that you can focus on making a great meal?
Hello, data orchestration.
A well-designed data orchestration pipeline brings together all your data sources into an easily query-able interface so that what used to take you an eternity takes just a few minutes. It ingests all the disparate data sources, loads them into your data warehouse, applies the transformations you want, and makes it easy to analyze in your favorite BI tool.
Effective data orchestration is critical for any business that makes heavy use of analytics (which should be all of you!). But how do you actually orchestrate your data?
You need to use something called a DAG.
What is a DAG?
DAG stands for Directed Acyclic Graph.
Working backward, we can see that it is a graph (a collection of points with lines connecting them) that is acyclic (it has no cycles) and is directed (it only goes in one direction).
Here's an example of a simple graph:
All nodes (the circles labeled 1-5) are connected with lines (sometimes called edges) between them.
We have numbered them 1-5, but there is no clear start or endpoint, and you can go in any direction. Moreover, there is a cycle: 1-2-4-3-1.
Let's turn this into an example of a directed graph by adding a direction to each line:
The arrows show us which ways we are allowed to travel along the graph.
It wouldn't make much sense to start at node 5, but you could begin at any other node and get to all the others. This journey is possible thanks to the cycle at nodes 1-2-4-3-1. So, this is a directed graph.
Now, let's turn it into a directed acyclic graph example by changing the direction of a couple of the arrows:
Now, all the arrows point in one direction, and there is no cycle. You can go from 1 to 2 and from 2 to 4 but cannot go back to 3.
Although there is no start point explicitly specified, number 1 looks like the most likely candidate as it's the only node you can start at that lets you reach all the other nodes.
The above example is what all of your data orchestration pipelines will look like: graphs with a clear start point that run in one direction and don't contain any cycles. Though in real life, they'll probably have more than five nodes.
Why you need directed acyclic graphs
So, now you know what a DAG is, but why do you need one?
It may seem like a random selection of properties to group together but bear with me. These properties of DAGs are perfect for making ETL pipelines.
First, it makes sense to think of your data pipeline as a directed graph because you perform specific steps in a specific (i.e., directed) order.
At the highest level, you extract, transform and then load your data. You cannot transform data before you have extracted it.
Second, you want your data pipeline to be a directed acyclic graph because each step builds on the previous one. You don't want to get stuck in a cycle of extracting and transforming your data in an endless loop.
Once you've extracted your data, it's time to move onto the next step. You never want to return to this step (unless you are starting a brand new process).
How to implement directed acyclic graphs
How do you write DAGs in code?
Well, you could implement them yourself with a tangle of bash and python scripts sewed together with cron jobs, but you would soon run into scaling problems.
Instead, you could use some workflow management/automation software. The most popular package is Airflow, and the object at its core is (surprise, surprise!) the DAG.
Using Airflow, you get all the DAG benefits mentioned above, plus some extra bits.
Let's say you've extracted some data, and you want to apply three transformations to it and then load it into your data warehouse.How do you manage this?
- What if the transformations take a varied amount of time to complete?
- How do you keep the program waiting and ensure it doesn't load the data early?
- What if one transformation fails?
- Can you automatically restart it?
- How many times should you restart it?
- If it keeps failing, how can you notify the team?
Airflow makes handling problems like these a breeze and only executes the next step in your DAG once everything beforehand has been completed successfully.
It also provides excellent failure handling, integrates with Slack, and has a handy UI to monitor all your DAGs.
Directed Acyclic Graph Examples
Now, this may all sound good in theory, but how does it work in practice? Let's get into some real-life directed acyclic graph examples.
Halodoc DAG Example
Halodoc is the number one healthcare app in Indonesia and connects doctors and patients through teleconsultations. Their site is in Indonesian, but their blog is in English and explains how they scaled their data orchestration pipeline with DAGs.
Initially, they used three data extraction tools (Pentaho, AWS DMS, and AWS Glue), but these were expensive and clunky. So, they switched to Airflow and its accompanying directed acyclic graphs.
Now, they use DAGs to extract data from all their disparate data sources: Amazon RDS, Amazon DynamoDB, Excel sheets, CSV files, Google analytics, and mixpanel.
After that, they transform everything into pandas dataframes and load them into an S3 data lake which copies them into their Redshift data warehouse.
Here's a visual of their directed acyclic graph pipeline managed with Airflow:
Gusto DAG Example
Gusto is a modern, online people payment and management platform that helps small businesses take care of their teams. For the first few years of their existence, everyone just ran ad-hoc SQL queries on their production databases.
But as Gusto scaled, they realized they needed something more sophisticated.
Gusto now uses Airflow and DAGs to extract data from multiple sources: their application database, 3rd party API vendors, and other team-specific sources, e.g., from the Growth and Care teams. Then they load it into a raw Redshift database before applying SQL transformations and loading it into a Redshift BI data warehouse.
Finally, they use DAGs to create different SQL views for each of their teams and their respective dashboards on Looker.
Here's visual of their data orchestration pipeline managed with Airflow (and a couple of other helper tools we left out for brevity):
How does Panoply help with data orchestration?
Now you know what directed acyclic graphs (DAGs) are and why they are vital for every business's data orchestration pipeline. However, this may sound quite complicated to you.
Maybe you just want to access your data and not worry about DAGs breaking or writing the ETL logic yourself?
Well, then you are in luck!
The end goal for almost all data orchestration jobs is to store it in a data lake/warehouse for future use or pipe it into BI tools.
DAGs are a great tool for this goal, but they can take a lot of time, effort, and money to get up and running. And that's not mentioning the engineering teams you need to hire to maintain them.
What if, instead of the complicated-looking data orchestration pipelines above, you could streamline it?
What if you could extract the data in a few clicks, apply whatever transforms you wanted, and then load it into your BI tool without having to write loads of DAGs?
If you want to ingest data from multiple sources and analyze your BI tool of choice within minutes, Panoply was built for you.
Panoply is a smart data warehouse that lets you connect to 50+ data sources with a few clicks, stores all your raw data in analysis-ready tables, and seamlessly updates your dashboards and BI tools.