Good analytics is no match for bad data. Bad data wins every time.
That's why we're talking about the tools to create a clean, efficient, and accurate ELT (extract, load, transform) pipeline so you can focus on making your "good analytics" great—and stop wondering about the validity of your analysis based on poorly modeled, infrequently updated, or just plain missing data.You don't need us to tell you that building your own ELT pipeline from scratch is a recipe for disaster. Companies exist whose sole aim is to create the perfect data pipeline solution for you. Using them will save you months of hard work and countless hours of engineering time. In the past, there were just a couple to choose from. Now there are a lot.
Decision fatigue anyone?
In this article, we'll take you through the top 7 ELT pipeline tools out there. We’ll start with the most complex, data engineer-centric solutions and end with the simplest analyst-centric solutions.
Built for Data Engineers
Airflow is an open-source platform created by AirBnB to programmatically author, schedule, and monitor workflows. It is probably the most famous data pipeline tool out there.
Using Airflow is similar to using a Python package. It is well written, easy to understand, and completely customizable. Your developers can create a data pipeline for you with unlimited complexity. You can work with any number of data sources, connect to any data warehouse, and use any BI tool.
What’s the downside?
This flexibility requires in-depth coding knowledge and a team of engineers to implement. Moreover, even though Airflow is designed to make handling large and complex data workflows easier, it itself can become quite large and complex as your company scales.
Airflow is entirely free to use and completely customizable. But if you are a small team, you may want a more straightforward, less code-heavy tool to get your data pipeline up and running swiftly.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for you to prepare and load your data for analytics.
If parts of your data pipeline are already on AWS, using Glue will be straightforward. You can create an ETL job in just a few clicks because you already understand the AWS Management Console.
Moreover, because AWS Glue is integrated across the majority of AWS services, the onboarding process is a breeze. Glue natively supports data stored in Amazon S3 buckets, Amazon Redshift, Amazon Aurora, and all other Amazon RDS engines.
Not only is it quite simple, but it's also intelligent. All you have to do is point AWS Glue in the direction of your data stored on AWS, and it finds it, understands it, and stores the associated metadata such as the table definition and table schema. Then you can query, search, and perform ETL on it whenever you want.
Another great feature is that Glue automatically generates the code you need to perform transformations. It's just Python code, so you or your engineers can modify it if it doesn't precisely fit your needs.
AWS charges you an hourly rate when your jobs are running. This rate varies by region but is around $0.44/hour. Note that you must also pay for storage. However, your first million rows and million requests are free every month. Glue is an attractive data pipeline tool if you plan to put a lot of your infrastructure on AWS.
dbt - data build tool
dbt allows anyone comfortable with SQL to own the entire data pipeline from writing data transformation code to deployment and documentation.
There are two parts to dbt: the free, open-source software called dbt Core, and the paid production service called dbt Cloud.
dbt enables you to perform complex and powerful transformations on your data with all the major data warehouses - Redshift, Snowflake, and BigQuery - using software engineering best practices. dbt natively supports version control, Git integration, logging, and modularity.
Every data model in dbt is a simple SELECT statement, and dbt handles turning those into tables and views in a data warehouse. At the core is the ref function, which lets you reference one model within another and automatically build dependency graphs. Thus, when you update your data, dbt automatically updates all your materialized tables in the graph in the correct order.
dbt is free, open-source, and has a large and active online community. Moreover, it is used by hundreds of companies such as GitLab, Canva, and Simply Business.
Once you have created your dbt model, you need to run them on a schedule (rather than manually running them from the command line). There are several options for you, the one built by dbt is dbt Cloud. It comes with a well-designed browser-based IDE to work in. It lets you run your jobs on schedule, configure error notifications, view logs for any historical invocation of dbt, and automatically render project documentation. Alternatively, you can use other job schedulers such as Airflow or cron.
Using dbt Core is free, and dbt Cloud comes with a free forever option for a data team of 1. This free profile gives you unlimited daily running, job scheduling, logging and alerting, and much more. For larger teams, it costs $50/month per person and includes up to 5 concurrently running jobs and API access.
For bigger teams, there are enterprise packages available, including SSO access, on-premises deployment, and audit logging, among other things. dbt is an excellent data pipeline tool if you are comfortable with SQL and want to scale your team.
Like dbt, Dataform has a free, open-source software package, SQLX, that lets you build data transformation pipelines from the command line. Plus, they offer a paid web service, Dataform Web, to seamlessly manage pipelines, documentation, testing, and sharing data models with your team.
Every SQLX model is a simple SELECT statement. You can easily create dependencies between tables with the ref function. Tell Dataform what kind of relation you want to create, and SQLX manages all the create, insert, drop boilerplate code for you.
If you don't want to use the command line to productionize your models, Dataform Web lets you create and schedule infinitely complex tables, dependencies, and views using their intuitive browser-based IDE. Moreover, you can run, test, and inspect the results of SQL queries before committing or overriding production data. Lastly, it's deeply integrated with Git, which makes version control a breeze.
Dataform is very similar to dbt but has a smaller community and fewer companies using it. Using their open-source software is free, and their browser-based IDE is free for a data team of one. If you use Dataform Web, they manage your infrastructure for you. In this respect, it is simpler than dbt (where you must manage your own infrastructure).
If you want job scheduling and altering, isolated development environments, and a sharable data catalog, Dataform Web costs $150/month for five profiles. For more granular access controls, run caching, and staging and production environments, it's $450/month for five profiles.
Lastly, there are enterprise packages available that include SSO, provide cloud deployment, and a dedicated account manager. If you choose the paid option, Dataform is slightly cheaper and more straightforward than dbt and is a solid choice to scale your business's data pipelines.
Built for Data Analyst
Matillion ETL software is purpose-built for cloud data warehouses and is the only ETL software that natively integrates with Google BigQuery, Amazon Redshift, Snowflake, and Microsoft Azure Synapse.
Using their GUI, you can easily extract and load data from an extensive list of data sources without any coding. Sources such as on-premises or cloud databases, SaaS applications, NoSQL, and spreadsheets are all built-in. Matillion also supports custom data source integrations via their REST API.
Matillion gives you the ability to perform powerful transformations to feed it into your favorite BI tools such as Tableau, Looker, and Power BI. Plus, you can combine simple transformations such as Filter, Join, Rank, and many more to solve for complex business logic. You can also make sure your resources are maximally utilized by scheduling jobs to run when resources are available.
You can learn the Matillion tool in a matter of hours rather than the weeks that traditional ETL can take. Note, though, that there are some tools further down the list that you can learn in minutes - so don't get too excited just yet.
Prices start from $1.37/hour, and there are no commitments or upfront costs. Plus, they manage the infrastructure for you - a solid choice for your data pipeline tool.
Alteryx is a self-service data analytics platform with multiple products. All products give you drag-and-drop accessibility to ETL tools. You don't need to know any SQL or coding to create and maintain a complex data pipeline. Some of the world's biggest companies - Coca Cola, Unilever, and Audi - use Alteryx. There is a vast online community with over 200k posts, 16k solutions, and self-service learning paths, which means learning the software is quite straightforward.
The user interface lets you design repeatable workflows so that you can automate time-consuming, manual data tasks, and adjust analytical queries easily. It natively works with a large volume and variety of data sources such as spreadsheets, data warehouses, cloud applications, and more. They make it easy to share your insights with others through their centralized platform, and it can be up to 100x faster than traditional approaches.
All this incredible functionality comes with a very steep price tag; it is by far the most expensive on the list. Alteryx starts at $5,195/user per year (so $432/month), significantly more expensive than the other options listed above. This tool is probably going to be out of your budget range if you are looking to build out your first data pipeline.
Panoply is a fully integrated data management platform that makes it easier than ever to connect your data. It is the only data pipeline tool that effortlessly puts all your business data in one place, gives all employees unlimited access to it, and requires zero maintenance.
You can set up Panoply in minutes and maintain it forever without any coding knowledge or specialist expertise. What used to take engineering teams weeks to achieve, Panoply does in minutes. The 'hours' that Matillion takes to learn seems like a lifetime now, right?
Panoply supports your business-critical data integrations from APIs, files, and databases that you can set up in a few clicks. There is no need for additional data pipeline tools or a caching layer.
Panoply uses the Extract, Load, Transform (ELT) paradigm, rather than ETL, which dramatically speeds up data ingestion, giving you more time to focus on data analysis rather than fiddling with your data pipeline.
Panoply gets all your data in one place backed by industry-leading AWS and Microsoft Azure infrastructure. Thus, it's easy to analyze data with standard SQL, and you can connect virtually any tool via industry-standard ODBC/JDBC. Finally, you have full visibility of your data pipeline thanks to detailed reporting and control over your job schedules.
All of the other tools above can work with Panoply, but none can compete on pricing. Panoply starts from $200/month for unlimited users. That is less than half what Alteryx costs for a single user. If you have a small team and a tight budget, Panoply can streamline your data pipeline now and continue to serve you as your business scales due to its flexible nature and upfront pricing. Try Panoply free for 14 days.