Finding the right ETL tool for your business is essential. ETL pulls data out of a source (extract), makes changes according to requirements (transform), and then pushes the transformed data into a database, data warehouse, or BI platform.
From cost to maintenance, there are a lot of factors to consider when choosing an ETL tool. To help, we've outlined both paid and open source ETL tools so you can make a decision about what fits best for your business. We also discuss the move from ETL to “No ETL,” as ELT is quickly taking over modern data and cloud environments.
If you're the type that likes to skip around, we've separated these tools into three categories: Paid ETL Tools, Open Source Tools, and Deprecated Tools
While free tools are always tempting, in a lot of cases you get what you pay for. If you're looking for off-the-shelf connectivity and low- or no-maintenance options, a paid service is the way to go. That said, there are plenty of differences between these tools, so it's worth sweating the details.
Informatica supports multiple DBMS technologies, and is a highly-rated company for data integration power users. In general, Informatica’s data offerings are quite pricey (in the 6-figure range for licenses), but they’re one of the biggest players in the space for a reason. If you’re looking for an ETL tool for a large, well-resourced organization, Informatica may be your choice.
Informatica PowerCenter price: $2,000/month for the most basic plan
One of the main selling points for SSIS versus other solutions is its easy-to-use interface, allowing users to deploy integrated data warehousing solutions without having to get involved with writing much—or any—code. The graphical interface allows for easy drag-and-drop ETL for multiple data types and warehouse destinations, including non-MS DBs. SSIS is a great solution for a team with a mix of technical skill levels, as it’s equally effective for ETL ninjas and point-and-click types alike.
SSIS price: $931 - $15,000
Talend also has Master Data Management (MDM) functionality, which allows organizations to have a single, consistent and accurate view of key enterprise data. This can create better transparency across a business, and lead to better operational efficiency, marketing effectiveness and compliance.
Talend price: $1,170/user monthly or $12,000 annually
Under the hood, Panoply uses a flexible ELT approach (rather than traditional ETL), which makes data ingestion much faster and more dynamic, since you don’t have to wait for transformation to complete before loading your data. And since Panoply builds managed cloud data warehouses for every user, you won’t need to set up a separate destination to store all the data you pull in using Panoply’s ELT process.
Panoply price: see all pricing options; a free trial is available.
Stitch can connect to Amazon Redshift architecture, Google BigQuery architecture, Postgres architecture, and BI tools.
Stitch price: $100 - $1,000/month based on data size
Fivetran does not impose any data limit, and can be used to centralize a company’s data and integrate all sources to determine Key Performance Indicators (KPIs) across an entire organization.
Fivetran price: Available upon request
Alooma is designed for enterprise-scale operations, so if you’re a small startup with a small operating budget, Alooma probably isn’t for you. Also note that as of 2019, “Alooma is only accepting new customers that are migrating to Google Cloud Platform.”
Alooma price: $1,000 - $15,000/month
Segment’s API has native library sources for every language, and helps record customer data from sources such as websites, mobile, apps or servers. It helps optimize analytics by piping raw customer data into data warehouses for further exploration and advanced analysis.
Segment price: $120/month
Atom’s transformation code is written in Python, which helps turn raw logs into queryable fields and insights, but could be a barrier for some users. It provides a collection layer, which supports sending data from any source and in any format to arrive to the target data repository near real time.
Atom price: Pay-per-use
ETL pipelines in ADF are built in a graphical interface, allowing for low-code use, and the package has a wide variety of data connectors for easy data ingestion. At the other end of the pipe, Data Factory can generally only be used to load into Azure data warehouses, so this solution makes sense for those who want to live entirely in a Microsoft ecosystem.
Azure Data Factory price: $1 for 1,000 runs per month
With over 70 native data source integrations, as well as an optional no-code graphical interface, Matillion makes loading your data into your warehouse of choice simple and straightforward. It also automates the data transformations you'll need in order to get your data ready for analysis with your favorite BI tool. Matillion bills hourly for usage, so it could also be particularly attractive for those with a lot of ETL downtime.
Matillion price: $12,000 - 48,000 / year, depending on plan and assuming 24/7 usage
Because it's integrated with AWS, Etleap also makes it easy to scale your data warehouse up and down with the same easy-to-use interface, while at the same time managing your ETL flows on the fly. Once data has been collected using one or many of its 50+ data integrations, users can also take advantage of Etleap's graphical data wrangling interface or fire up the SQL editor for data modeling and transformation. Orchestration and scheduling features make managing all your ETL pipelines and processes as easy as the click of a button. In addition to its SaaS offering, Etleap also provides a version that can be hosted on your own VPC.
Etleap price: Available upon request
Open source ETL tools are fantastic if you want to keep costs low...and if you love being part of a community that continually works to keep the data flowing. While budget probably isn't an issue for these tools, understanding how their backend is built is key to narrowing down your options.
Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme used in Camel to refer to an endpoint that provides information such as which components are being used, the context path and the options applied against the component. There are more than 100 components used by Apache Camel, including FTP, JMX and HTTP. Apache Camel can be deployed as a standalone application in a web container such as Tomcat, a JEEE application server such as JBOSS AS, or in combination with a Spring container.
Because Airbyte connectors run as Docker containers, they can be built in the language of your choice. This ETL tool offers even more flexibility through modular components and optional feature subsets that you can It distinguishes itself by several significant choices.
As with Fivetran, Airbyte integrates with dbt for transformations, making it an ELT tool. However, contrary to Singer, Airbyte uses one single open-source repo to standardize and consolidate all developments from the community, leading to higher quality connectors. They built a compatibility layer with Singer so that Singer taps can run within Airbyte.
Kafka can be used in many different ways: for example as a message bus, a buffer for replication systems or event processing, and to decouple apps from databases for both OLTP and DWH.
Logstash is written in Ruby and provides a JSON-like structure which has a clear separation between internal objects. It has a pluggable framework featuring over 200 plugins, enabling the ability to mix, match and orchestrate facilities over different input, filters and output. This tool can be used for BI, or in data warehouses with fetch, transformation and storing event capabilities.
With a large collection of pre-built taps, the scripts that collect datapoints from their original sources, and an extensive selection of pre-built targets, the scripts that transform and load data into pre-specified destinations, Singer allows users to write concise, single-line ETL processes that can be modified on the fly by swapping taps and targets in and out.
Deprecated ETL tools
Alas, not all ETL tools stand the test of time. While we would necessarily recommend implementing a deprecated tool in your data stack, it's helpful to know how they were built and why they still command attention in parts of the data world.
Apatar is an open source data integration and ETL tool, with capabilities for extracting, transforming and loading data. Apatar comes with a visual interface that can reduce R&D costs, improve data integration efficiency and minimize the impact of system changes. Apatar is written in Java and has Unicode-compliant functionality. It can be used to integrate data across teams, populate data warehouses and data marts, and schedule and maintain little or no code when connected to other systems.
Apatar's GitHub profile indicates it was deprecated in 2012, but it remains a favorite for those looking for a drag-and-drop ETL solution.
Heka is an open source software system for high performance data gathering, analysis, monitoring and reporting. Its main component is a daemon program known as ‘hekad’ that enables the functionality of gathering, converting, evaluating, processing and delivering data.
Heka is written in the ‘Go’ programming language, and has built-in plugins for inputting, decoding, filtering, encoding and outputting data. These plugins have different functionalities and can be used together to build a complete pipeline. Heka uses Advanced Message Queuing Protocol (AMQP) or TCP to ship data from one location to another. It can be used to load and parse log files from a file system, or to perform real-time analysis, graphing and anomaly detection on any type of data flow.
Scriptella is an open source ETL and script execution tool capable of using SQL or any other scripting language to perform data transformations. Scriptella supports cross-database ETL scripts, and can work with multiple data sources in a single ETL file.
Scriptella integrates with any JDBC/ODBC compliant driver, and provides an interface that allows interoperability with non-JDBC data sources and scripting languages. It can also integrate with Java EE, Spring, JMX, JNDI and JavaMail, making it a highly flexible option.
The case for “No ETL”
ETL has been a bedrock process of data analytics and data warehousing since the beginning, but the increased pace of data usage and the nosediving price of storage mean that speed is quickly overtaking efficiency as the most important element of a data pipeline.
Because the transform step in an ETL pipeline can often be a chokepoint, some modern data warehousing companies are switching to an ELT-based approach, where the transformation step is pushed to the end of the process, or even delayed until the point of query by analysts.
Here at Panoply, we’ve fully committed to a “No ETL” approach, but we’re not the only ones. Stitch, another of the cloud ETL tools featured here, has also been singing the praises of ELT.
In a discussion with TechTarget magazine, Stitch’s CEO praised the new AWS Athena service, and described the need to move from ETL to ELT:
“With Athena, you extract the data from the sources, and then load it with no or minimal preprocessing. This style of ELT is a superior model for most use cases, because it results in a simpler architecture and gives analysts more visibility into how the raw data becomes transformed.”