Extract, Transform, and Load (ETL) is a data warehousing process that uses batch processing to help business users analyze and report on data relevant to their business focus. Becoming increasingly popular in a modern data warehouse architecture, the ETL process pulls data out of the source, makes changes according to requirements, and then loads the transformed data into a database or BI platform to provide better business insights. With ETL, business leaders can make data-driven business decisions. Below we list 6 open source ETL tools and 11 paid options to allow you to make your own comparisons and decide what’s best for your business. We also discuss the need to move from ETL to “No ETL”, as ELT quickly evolves to be the ultimate process in modern data and cloud environments.
This list of ETL tools is separated into two parts and paid and free/open source, and there is always the case for no ETL…
17 Great ETL Tools
1. Informatica PowerCenter
Informatica’s suite of data integration software includes PowerCenter, which is known for its strong automation capabilities. PowerCenter uses a metadata-based approach to speed data ingestion and processing, and offers automated error logging and early warning systems to help identify ETL pipeline issues before they become a serious problem. Informatica supports multiple DBMS technologies, and is a highly-rated company for data integration power users. In general, Informatica’s data offerings are quite pricey (in the 6-figure range for licenses), but they’re one of the biggest players in the space for a reason. If you’re looking for an ETL tool for a large, well-resourced organization, Informatica may be your choice.
Informatica PowerCenter price: $2,000/month for the most basic plan
2. Microsoft SQL Server SSIS
Microsoft offers SSIS, a graphical interface for managing ETL using MS SQL Server. One of the main selling points for SSIS versus other solutions is its easy-to-use interface, allowing users to deploy integrated data warehousing solutions without having to get involved with writing much--or any--code. The graphical interface allows for easy drag-and-drop ETL for multiple data types and warehouse destinations, including non-MS DBs. SSIS ships with SQL Server, so if you’re a SQL Server user, you already have it. But even those who don’t use SQL Server have been known to buy a license just to get access to SSIS, because it’s just that powerful. SSIS is a great solution for a team with a mix of technical skill levels, as it’s equally effective for ETL ninjas and point-and-click types alike.
SSIS price: $931 - $15,000
Talend open source data integration software products provide software to integrate, cleanse, mask and profile data. Talend has a GUI that enables managing a large number of source systems using standard connectors. It also has Master Data Management (MDM) functionality, which allows organizations to have a single, consistent and accurate view of key enterprise data. This can create better transparency across a business, and lead to better operational efficiency, marketing effectiveness and compliance.
Talend price: $1,170/user monthly or $12,000 annually
Alooma offers an enterprise-scale data integration platform with great ETL tools built in. The company puts a strong focus on rapid pipeline construction, data quality monitoring and error handling to ensure that customers don’t lose or corrupt data in a potentially error-prone ETL process, but it also offers the flexibility to intervene and write your own scripts to monitor, clean and move your data as needed. As mentioned, Alooma is designed for enterprise-scale operations, so if you’re a small startup with a small operating budget, Alooma probably isn’t for you. Also note that as of 2019, “Alooma is only accepting new customers that are migrating to Google Cloud Platform.”
Alooma price: $1,000 - $15,000/month
Panoply is the only cloud ETL provider and data warehouse combination. With 100+ data connectors, ETL and data ingestion is fast and easy, with just a few clicks and a login between you and your newly integrated data. Under the hood, Panoply is actually using an ELT approach (rather than traditional ETL), which makes data ingestion much faster and more dynamic, since you don’t have to wait for transformation to complete before loading your data. And since Panoply builds managed cloud data warehouses for every user, you won’t need to set up a separate destination to store all the data you pull in using Panoply’s ELT process.
Panoply price: $349/month (includes managed Redshift cluster)
Blendo offers a cloud-based ETL tool focused on letting users get their data into warehouses as quickly as possible using their suite of proprietary data connectors. Blendo’s ETL-as-a-service product makes it easy to pull data in from all sorts of data sources including S3 buckets, CSVs, and a large array of third-party data sources like Google Analytics, Mailchimp, Salesforce and many others. Once you’ve set up the incoming end of the data pipeline, you can load it into a number of different storage destinations, including Redshift, BigQuery, MS SQL Server, Panoply and Snowflake.
Blendo price: Available upon request
Fivetran is a fully-managed data pipeline with a web interface that integrates data from SaaS services and databases into a single data warehouse. It provides direct integration and sends data over a direct secure connection using a sophisticated caching layer. This caching layer helps to move data from one point to another without ever storing a copy on the application server. Fivetran does not impose any data limit, and can be used to centralize a company’s data and integrate all sources to determine Key Performance Indicators (KPIs) across an entire organization.
Fivetran price: Available upon request
Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API can replicate data from any source, and handle bulk and incremental data updates. Stitch also provides a replication engine that relies on multiple strategies to deliver data to users. Its REST API supports JSON or transit, which helps enable automatic detection and normalization of nested document structures into relational schemas. Stitch can connect to Amazon Redshift architecture, Google BigQuery architecture, and Postgres architecture - and integrates with BI tools. Stitch is typically designed to collect, transform and load Google analytics data into its own system, to automatically give business insights on raw data.
Stitch price: $100 - $1,000/month based on data size
Segment is a platform for collecting customer data and then sending it to analytics, marketing and data warehousing services. It provides an API that supports collection and routes of customer data to over 180 different tools and database services. Segment’s API has native library sources for every language, and helps record customer data from sources such as websites, mobile, apps or servers. It helps optimize analytics by piping raw customer data into data warehouses for further exploration and advanced analysis. The API can be used to query data or optimize marketing performance and design.
Segment price: $120/month
Atom, from ironSource, is a data pipeline management solution that allows data streaming in near real time, into a data warehouse. Atom enables data flow customization, based on requirements that help manage data more efficiently. Atom’s transformation code is written in Python, which helps turn raw logs into queryable fields and insights. It provides a collection layer, which supports sending data from any source and in any format to arrive to the target data repository near real time. Atom also has pause and play options. A reliable resumption of data flow without losing a single event is an important capability of Atom, in terms of maintaining data integrity.
Atom price: Pay-per-use
In addition to SQL Server SSIS, Microsoft’s on-premise ETL solution, the company also offers Azure Data Factory (ADF), an ETL tool for their cloud-based Azure platform. Because they’re both from Microsoft, Data Factory has nice interoperability with SSIS: if you’re making the jump from an on-premise SQL Server setup to Azure, you’ll be able to deploy the SSIS packages you’ve already developed in Data Factory. ETL pipelines in ADF are built in a graphical interface, allowing for low-code use, and the package has a wide variety of data connectors for easy data ingestion--except, strangely, support for loading Microsoft Excel files. At the other end of the pipe, Data Factory can generally only be used to load into Azure data warehouses, so this solution probably only makes sense for those who want to live entirely in a Microsoft ecosystem.
Azure Data Factory price: $1 for 1,000 runs per month
Free and open source ETL tools
12. Apache Camel
Apache Camel is an open source Java integration framework that enables integration of different applications using multiple protocols and technologies. It is a rule-based routing and mediation engine. To configure routing and mediation rules, it provides Java object-based implementation of Enterprise Integration Patterns (EIP), using an API or declarative Java domain-specific language. EIPs are design patterns that enable the use of enterprise application integration and message-oriented middleware. Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme used in Camel to refer to an endpoint that provides information such as which components are being used, the context path and the options applied against the component. There are more than 100 components used by Apache Camel, including FTP, JMX and HTTP. Apache Camel can be deployed as a standalone application in a web container such as Tomcat, a JEEE application server such as JBOSS AS, or in combination with a Spring container.
13. Apache Kafka
Apache Kafka is an open source platform written in Scala and Java. It provides a unified, high-throughput, low-latency platform for managing real-time data. Kafka publishes and subscribes to a stream of records in a fault-tolerant way, immediately as they occur.Nodes of a Kafka cluster are known as brokers; streams of records stored in a cluster are known as topics; and each record consists of a key, value and timestamp. Kafka can be used in many different ways; for example as a message bus, a buffer for replication systems or event processing, and to decouple apps from databases for both OLTP and DWH.
Apatar is an open source data integration and ETL tool, with capabilities for extracting, transforming and loading data. Apatar comes with a visual interface that can reduce R&D costs, improve data integration efficiency and minimize the impact of system changes. Apatar is written in Java and has Unicode-compliant functionality. It can be used to integrate data across teams, populate data warehouses and data marts, and schedule and maintain little or no code when connected to other systems.
Heka is an open source software system for high performance data gathering, analysis, monitoring and reporting. Its main component is a daemon program known as ‘hekad’ that enables the functionality of gathering, converting, evaluating, processing and delivering data. Heka is written in the ‘Go’ programming language, and has built-in plugins for inputting, decoding, filtering, encoding and outputting data. These plugins have different functionalities and can be used together to build a complete pipeline. Heka uses Advanced Message Queuing Protocol (AMQP) or TCP to ship data from one location to another. It can be used to load and parse log files from a file system, or to perform real-time analysis, graphing and anomaly detection on any type of data flow.
Logstash is an open source data processing pipeline that ingests data from multiple sources simultaneously, transforming the source data and store events into ElasticSearch by default. Logstash is part of an ELK stack. The E stands for Elasticsearch, a JSON-based search and analytics engine, and the K stands for Kibana, which enables data visualization. Logstash is written in Ruby and provides a JSON-like structure which has a clear separation between internal objects. It has a pluggable framework featuring over 200 plugins, enabling the ability to mix, match and orchestrate facilities over different input, filters and output. This tool can be used for BI, or in data warehouses with fetch, transformation and storing event capabilities.
Scriptella is an open source ETL and script execution tool capable of using SQL or any other scripting language to perform data transformations. Scriptella supports cross-database ETL scripts, and can work with multiple data sources in a single ETL file. Scriptella integrates with any JDBC/ODBC compliant driver, and provides an interface that allows interoperability with non-JDBC data sources and scripting languages. It can also integrate with Java EE, Spring, JMX, JNDI and JavaMail.
The case for “NO ETL”
ETL has been a bedrock process of data analytics and data warehousing since the beginning, but the increased pace of data usage and the nosediving price of storage mean that it’s often necessary these days to get data in front of analysts as quickly as possible. Because the Transform step in an ETL pipeline can often be a chokepoint in the data pipeline, that means that some more modern data warehousing companies are switching to an ELT-based approach, where the transformation step is pushed to the end of the process, or even delayed until the point of query by analysts. Here at Panoply, we’ve fully committed to a “NO ETL” approach, but we’re not the only ones. Stitch, another of the cloud ETL tools featured here, has also been singing the praises of ELT.
‘No ETL,’ according to him, means that the ETL process is supplanted by Extract, Load, Transform (ELT), where data transformation happens in SQL as needed for downstream use, rather than upfront during the loading stage. He admitted that it might be surprising for an ETL firm like his to promote the notion of ELT, but explained the latter’s benefits.
“With Athena, you extract the data from the sources, and then load it with no or minimal preprocessing. This style of ELT is a superior model for most use cases, because it results in a simpler architecture and gives analysts more visibility into how the raw data becomes transformed.”