Finding the right ETL tool for your business is essential. ETL pulls data out of a source (extract), makes changes according to requirements (transform), and then pushes the transformed data into a database, data warehouse, or BI platform.
From cost to maintenance, there are a lot of factors to consider when choosing an ETL tool. To help, we've outlined both paid and open source ETL tools so you can make a decision about what fits best for your business. We also discuss the move from ETL to “No ETL,” as ELT is quickly taking over modern data and cloud environments.If you're the type that likes to skip around, we've separated these tools into three categories: Paid ETL Tools, Open Source Tools, and Deprecated Tools
Paid ETL tools
While free tools are always tempting, in a lot of cases you get what you pay for. If you're looking for off-the-shelf connectivity and low- or no-maintenance options, a paid service is the way to go. That said, there are plenty of differences between these tools, so it's worth sweating the details.
Informatica’s suite of data integration software includes PowerCenter, which is known for its strong automation capabilities. PowerCenter uses a metadata-based approach to speed data ingestion and processing, and offers automated error logging and early warning systems to help identify ETL pipeline issues before they become a serious problem.
Informatica supports multiple DBMS technologies, and is a highly-rated company for data integration power users. In general, Informatica’s data offerings are quite pricey (in the 6-figure range for licenses), but they’re one of the biggest players in the space for a reason. If you’re looking for an ETL tool for a large, well-resourced organization, Informatica may be your choice.
Informatica PowerCenter price: $2,000/month for the most basic plan
Microsoft offers SSIS (SQL Server Integration Services), a graphical interface for managing ETL using MS SQL Server. SSIS ships with SQL Server, so if you’re a SQL Server user, you already have it. But even those who don’t use SQL Server have been known to buy a license just to get access to SSIS, because it’s just that powerful.
One of the main selling points for SSIS versus other solutions is its easy-to-use interface, allowing users to deploy integrated data warehousing solutions without having to get involved with writing much—or any—code. The graphical interface allows for easy drag-and-drop ETL for multiple data types and warehouse destinations, including non-MS DBs. SSIS is a great solution for a team with a mix of technical skill levels, as it’s equally effective for ETL ninjas and point-and-click types alike.
SSIS price: $931 - $15,000
Talend open source data integration software products provide software to integrate, cleanse, mask and profile data. This ETL tool offers a GUI that enables managing a large number of source systems using standard connectors.
Talend also has Master Data Management (MDM) functionality, which allows organizations to have a single, consistent and accurate view of key enterprise data. This can create better transparency across a business, and lead to better operational efficiency, marketing effectiveness and compliance.
Talend price: $1,170/user monthly or $12,000 annually
Panoply is the only cloud ETL provider and data warehouse combination. With a wide variety of data connectors, ETL and data ingestion is fast and easy, just a few clicks stand between you and your newly integrated data.
Under the hood, Panoply uses a flexible ELT approach (rather than traditional ETL), which makes data ingestion much faster and more dynamic, since you don’t have to wait for transformation to complete before loading your data. And since Panoply builds managed cloud data warehouses for every user, you won’t need to set up a separate destination to store all the data you pull in using Panoply’s ELT process.
Panoply price: see all pricing options; a free trial is available.
Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API can replicate data from any source, and handle bulk and incremental data updates. Stitch also provides a replication engine that relies on multiple strategies to deliver data to users. Its REST API supports JSON or transit, which helps enable automatic detection and normalization of nested document structures into relational schemas.
Stitch can connect to Amazon Redshift architecture, Google BigQuery architecture, Postgres architecture, and BI tools.
Stitch price: $100 - $1,000/month based on data size
Fivetran is a fully-managed data pipeline with a web interface that integrates data from SaaS services and databases into a single data warehouse. It provides direct integration and sends data over a secure connection using a sophisticated caching layer that helps to move data from one point to another without ever storing a copy on the application server.
Fivetran does not impose any data limit, and can be used to centralize a company’s data and integrate all sources to determine Key Performance Indicators (KPIs) across an entire organization.
Fivetran price: Available upon request
Alooma is an enterprise data integration platform with great ETL tools built in. The company puts a strong focus on rapid pipeline construction, data quality monitoring and error handling to ensure that customers don’t lose or corrupt data in a potentially error-prone ETL process, but it also offers the flexibility to intervene and write your own scripts to monitor, clean and move your data as needed.
Alooma is designed for enterprise-scale operations, so if you’re a small startup with a small operating budget, Alooma probably isn’t for you. Also note that as of 2019, “Alooma is only accepting new customers that are migrating to Google Cloud Platform.”
Alooma price: $1,000 - $15,000/month
Segment is a platform for collecting customer data and then sending it to analytics, marketing and data warehousing services. It provides an API that supports collection and routes of customer data to over 180 different tools and database services.
Segment’s API has native library sources for every language, and helps record customer data from sources such as websites, mobile, apps or servers. It helps optimize analytics by piping raw customer data into data warehouses for further exploration and advanced analysis.
Segment price: $120/month
Atom, from ironSource, is a data pipeline management solution that enables streaming into a data warehouse. Atom enables data flow customization that helps to manage data more efficiently. Atom's ability to resume data flow without losing a single event is a both important for governance and a major differentiator from other streaming data pipelines.
Atom’s transformation code is written in Python, which helps turn raw logs into queryable fields and insights, but could be a barrier for some users. It provides a collection layer, which supports sending data from any source and in any format to arrive to the target data repository near real time.
Atom price: Pay-per-use
Azure Data Factory
In addition to SQL Server SSIS, Microsoft’s on-premise ETL solution, the company also offers Azure Data Factory (ADF), an ETL tool for their cloud-based Azure platform. Because they’re both from Microsoft, Data Factory has nice interoperability with SSIS: if you’re making the jump from an on-premise SQL Server setup to Azure, you’ll be able to deploy the SSIS packages you’ve already developed in Data Factory.
ETL pipelines in ADF are built in a graphical interface, allowing for low-code use, and the package has a wide variety of data connectors for easy data ingestion. At the other end of the pipe, Data Factory can generally only be used to load into Azure data warehouses, so this solution makes sense for those who want to live entirely in a Microsoft ecosystem.
Azure Data Factory price: $1 for 1,000 runs per month
Matillion's ETL tool is, according to its developers, purpose-built for cloud data warehouses, so it could be a particularly strong choice for users who are especially interested in loading data into Amazon Redshift, Google BigQuery or Snowflake.
With over 70 native data source integrations, as well as an optional no-code graphical interface, Matillion makes loading your data into your warehouse of choice simple and straightforward. It also automates the data transformations you'll need in order to get your data ready for analysis with your favorite BI tool. Matillion bills hourly for usage, so it could also be particularly attractive for those with a lot of ETL downtime.
Matillion price: $12,000 - 48,000 / year, depending on plan and assuming 24/7 usage
Built on AWS architecture, Etleap makes it easy to collect data from a wide range of sources and load them into your Redshift or Snowflake data warehouse. Etleap's point-and-click, no code interface makes it a good fit for data teams that want a lot of control over their ETL processes, but don't necessarily want high IT overhead.
Because it's integrated with AWS, Etleap also makes it easy to scale your data warehouse up and down with the same easy-to-use interface, while at the same time managing your ETL flows on the fly. Once data has been collected using one or many of its 50+ data integrations, users can also take advantage of Etleap's graphical data wrangling interface or fire up the SQL editor for data modeling and transformation. Orchestration and scheduling features make managing all your ETL pipelines and processes as easy as the click of a button. In addition to its SaaS offering, Etleap also provides a version that can be hosted on your own VPC.
Etleap price: Available upon request
Free and open source ETL tools
Open source ETL tools are fantastic if you want to keep costs low...and if you love being part of a community that continually works to keep the data flowing. While budget probably isn't an issue for these tools, understanding how their backend is built is key to narrowing down your options.
Apache Camel is an open source framework that enables integration of different applications using multiple protocols and technologies. To configure routing and mediation rules, it provides Java object-based implementation of Enterprise Integration Patterns (EIP), using an API or declarative Java domain-specific language. EIPs are design patterns that enable the use of enterprise application integration and message-oriented middleware.
Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme used in Camel to refer to an endpoint that provides information such as which components are being used, the context path and the options applied against the component. There are more than 100 components used by Apache Camel, including FTP, JMX and HTTP. Apache Camel can be deployed as a standalone application in a web container such as Tomcat, a JEEE application server such as JBOSS AS, or in combination with a Spring container.
Airbyte is a new open-source EL(T) platform that started in July 2020. It has a fast-growing community and list of supported connectors. Airbyte is different from other tools in that its connectors are usable out of the box through a UI and an API that enables monitoring, scheduling and orchestration.
Because Airbyte connectors run as Docker containers, they can be built in the language of your choice. This ETL tool offers even more flexibility through modular components and optional feature subsets that you can It distinguishes itself by several significant choices.
As with Fivetran, Airbyte integrates with dbt for transformations, making it an ELT tool. However, contrary to Singer, Airbyte uses one single open-source repo to standardize and consolidate all developments from the community, leading to higher quality connectors. They built a compatibility layer with Singer so that Singer taps can run within Airbyte.
Apache Kafka is an open source platform written in Scala and Java. It provides a unified, high-throughput, low-latency platform for managing real-time data. Kafka publishes and subscribes to a stream of records in a fault-tolerant way, immediately as they occur.
Kafka can be used in many different ways: for example as a message bus, a buffer for replication systems or event processing, and to decouple apps from databases for both OLTP and DWH.
Logstash is an open source data processing pipeline that ingests data from multiple sources simultaneously, transforming the source data and store events into ElasticSearch by default. Logstash is part of an ELK stack. The E stands for Elasticsearch, a JSON-based search and analytics engine, and the K stands for Kibana, which enables data visualization.
Logstash is written in Ruby and provides a JSON-like structure which has a clear separation between internal objects. It has a pluggable framework featuring over 200 plugins, enabling the ability to mix, match and orchestrate facilities over different input, filters and output. This tool can be used for BI, or in data warehouses with fetch, transformation and storing event capabilities.
Singer's open source, command line ETL tool allows users to build modular ETL pipelines using its "tap" and "target" modules. Instead of building a single, static ETL pipeline, Singer provides a backbone that allows users to connect data sources to storage destinations.
With a large collection of pre-built taps, the scripts that collect datapoints from their original sources, and an extensive selection of pre-built targets, the scripts that transform and load data into pre-specified destinations, Singer allows users to write concise, single-line ETL processes that can be modified on the fly by swapping taps and targets in and out.
Deprecated ETL tools
Alas, not all ETL tools stand the test of time. While we would necessarily recommend implementing a deprecated tool in your data stack, it's helpful to know how they were built and why they still command attention in parts of the data world.
Apatar is an open source data integration and ETL tool, with capabilities for extracting, transforming and loading data. Apatar comes with a visual interface that can reduce R&D costs, improve data integration efficiency and minimize the impact of system changes. Apatar is written in Java and has Unicode-compliant functionality. It can be used to integrate data across teams, populate data warehouses and data marts, and schedule and maintain little or no code when connected to other systems.
Apatar's GitHub profile indicates it was deprecated in 2012, but it remains a favorite for those looking for a drag-and-drop ETL solution.
Heka is an open source software system for high performance data gathering, analysis, monitoring and reporting. Its main component is a daemon program known as ‘hekad’ that enables the functionality of gathering, converting, evaluating, processing and delivering data.
Heka is written in the ‘Go’ programming language, and has built-in plugins for inputting, decoding, filtering, encoding and outputting data. These plugins have different functionalities and can be used together to build a complete pipeline. Heka uses Advanced Message Queuing Protocol (AMQP) or TCP to ship data from one location to another. It can be used to load and parse log files from a file system, or to perform real-time analysis, graphing and anomaly detection on any type of data flow.
Scriptella is an open source ETL and script execution tool capable of using SQL or any other scripting language to perform data transformations. Scriptella supports cross-database ETL scripts, and can work with multiple data sources in a single ETL file.
Scriptella integrates with any JDBC/ODBC compliant driver, and provides an interface that allows interoperability with non-JDBC data sources and scripting languages. It can also integrate with Java EE, Spring, JMX, JNDI and JavaMail, making it a highly flexible option.
The case for “No ETL”
ETL has been a bedrock process of data analytics and data warehousing since the beginning, but the increased pace of data usage and the nosediving price of storage mean that speed is quickly overtaking efficiency as the most important element of a data pipeline.
Because the transform step in an ETL pipeline can often be a chokepoint, some modern data warehousing companies are switching to an ELT-based approach, where the transformation step is pushed to the end of the process, or even delayed until the point of query by analysts.
Here at Panoply, we’ve fully committed to a “No ETL” approach, but we’re not the only ones. Stitch, another of the cloud ETL tools featured here, has also been singing the praises of ELT.
In a discussion with TechTarget magazine, Stitch’s CEO praised the new AWS Athena service, and described the need to move from ETL to ELT:
“With Athena, you extract the data from the sources, and then load it with no or minimal preprocessing. This style of ELT is a superior model for most use cases, because it results in a simpler architecture and gives analysts more visibility into how the raw data becomes transformed.”