11 Great ETL Tools, and the Case for Saying “No” to ETL

Written by Alon Brody | Jan 9, 2017 3:35:00 PM

Extract, Transform, and Load (ETL) is a data warehousing process that uses batch processing to help business users analyze and report on data relevant to their business focus. Becoming increasingly popular in a modern data warehouse architecture, the ETL process pulls data out of the source, makes changes according to requirements, and then loads the transformed data into a database or BI platform to provide better business insights. With ETL, business leaders can make data-driven business decisions.

Below we list 11, mostly open source ETL tools (by alphabetical order). However we also discuss the need to move from ETL to “No ETL”, as ELT quickly evolves to be the ultimate process in modern data and cloud environments.

11 Great ETL Tools

1. Apache Camel is an open source Java integration framework that enables integration of different applications using multiple protocols and technologies. It is a rule-based routing and mediation engine. To configure routing and mediation rules, it provides Java object-based implementation of Enterprise Integration Patterns (EIP), using an API or declarative Java domain-specific language. EIPs are design patterns that enable the use of enterprise application integration and message-oriented middleware. Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme used in Camel to refer to an endpoint that provides information such as which components are being used, the context path and the options applied against the component. There are more than 100 components used by Apache Camel, including FTP, JMX and HTTP. Apache Camel can be deployed as a standalone application in a web container such as Tomcat, a JEEE application server such as JBOSS AS, or in combination with a Spring container.

2. Apache Kafka is an open source platform written in Scala and Java. It provides a unified, high-throughput, low-latency platform for managing real-time data. Kafka publishes and subscribes to a stream of records in a fault-tolerant way, immediately as they occur.Nodes of a Kafka cluster are known as brokers; streams of records stored in a cluster are known as topics; and each record consists of a key, value and timestamp. Kafka can be used in many different ways; for example as a message bus, a buffer for replication systems or event processing, and to decouple apps from databases for both OLTP and DWH.

3. Apatar is an open source data integration and ETL tool, with capabilities for extracting, transforming and loading data. Apatar comes with a visual interface that can reduce R&D costs, improve data integration efficiency and minimize the impact of system changes. Apatar is written in Java and has Unicode-compliant functionality. It can be used to integrate data across teams, populate data warehouses and data marts, and schedule and maintain little or no code when connected to other systems.

4. Atom, from ironSource, is a data pipeline management solution that allows data streaming in near real time, into a data warehouse. Atom enables data flow customization, based on requirements that help manage data more efficiently. Atom’s transformation code is written in Python, which helps turn raw logs into queryable fields and insights. It provides a collection layer, which supports sending data from any source and in any format to arrive to the target data repository near real time. Atom also has pause and play options. A reliable resumption of data flow without losing a single event is an important capability of Atom, in terms of maintaining data integrity.

5. Fivetran is a fully-managed data pipeline that integrates data from SaaS services and databases into a single data warehouse. It provides direct integration and sends data over a direct secure connection using a sophisticated caching layer. This caching layer helps to move data from one point to another without ever storing a copy on the application server. Fivetran does not impose any data limit, and can be used to centralize a company’s data and integrate all sources to determine Key Performance Indicators (KPIs) across an entire organization.

6. Heka is an open source software system for high performance data gathering, analysis, monitoring and reporting. Its main component is a daemon program known as ‘hekad’ that enables the functionality of gathering, converting, evaluating, processing and delivering data. Heka is written in the ‘Go’ programming language, and has built-in plugins for inputting, decoding, filtering, encoding and outputting data. These plugins have different functionalities and can be used together to build a complete pipeline. Heka uses Advanced Message Queuing Protocol (AMQP) or TCP to ship data from one location to another. It can be used to load and parse log files from a file system, or to perform real-time analysis, graphing and anomaly detection on any type of data flow.

7. Logstash is an open source data processing pipeline that ingests data from multiple sources simultaneously, transforming the source data and store events into ElasticSearch by default. Logstash is part of an ELK stack. The E stands for Elasticsearch, a JSON-based search and analytics engine, and the K stands for Kibana, which enables data visualization. Logstash is written in Ruby and provides a JSON-like structure which has a clear separation between internal objects. It has a pluggable framework featuring over 200 plugins, enabling the ability to mix, match and orchestrate facilities over different input, filters and output. This tool can be used for BI, or in data warehouses with fetch, transformation and storing event capabilities.

8. Scriptella is an open source ETL and script execution tool capable of using SQL or any other scripting language to perform data transformations. Scriptella supports cross-database ETL scripts, and can work with multiple data sources in a single ETL file. Scriptella integrates with any JDBC/ODBC compliant driver, and provides an interface that allows interoperability with non-JDBC data sources and scripting languages. It can also integrate with Java EE, Spring, JMX, JNDI and JavaMail.

9. Segment is a platform for collecting customer data and then sending it to analytics, marketing and data warehousing services. It provides an API that supports collection and routes of customer data to over 180 different tools and database services. Segment’s API has native library sources for every language, and helps record customer data from sources such as websites, mobile, apps or servers. It helps optimize analytics by piping raw customer data into data warehouses for further exploration and advanced analysis. The API can be used to query data or optimize marketing performance and design.

10. Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API can replicate data from any source, and handle bulk and incremental data updates. Stitch also provides a replication engine that relies on multiple strategies to deliver data to users. Its REST API supports JSON or transit, which helps enable automatic detection and normalization of nested document structures into relational schemas. Stitch can connect to Amazon Redshift architecture, Google BigQuery architecture, and Postgres architecture - and integrates with BI tools. Stitch is typically designed to collect, transform and load Google analytics data into its own system, to automatically give business insights on raw data.

11. Talend open source data integration software products provide tools to integrate, cleanse, mask and profile data. Talend has a GUI that enables managing a large number of source systems using standard connectors. It also has Master Data Management (MDM) functionality, which allows organizations to have a single, consistent and accurate view of key enterprise data. This can create better transparency across a business, and lead to better operational efficiency, marketing effectiveness and compliance.

The case for “NO ETL”

Though Stitch is an ETL service, in a discussion with TechTarget magazine, their CEO praised the new AWS Athena service, and described the need to move from ETL to ELT.

‘No ETL,’ according to him, means that the ETL process is supplanted by Extract, Load, Transform (ELT), where data transformation happens in SQL as needed for downstream use, rather than upfront during the loading stage. He admitted that it might be surprising for an ETL firm like his to promote the notion of ELT, but explained the latter’s benefits.

“With Athena, you extract the data from the sources, and then load it with no or minimal preprocessing. This style of ELT is a superior model for most use cases, because it results in a simpler architecture and gives analysts more visibility into how the raw data becomes transformed.”

With the ELT approach, after extracting is finalized, the data loading phase starts right away without waiting for a “correct” data transformation operation. The transformation can even run at query time, a major time-saver in comparison with ETL - which requires users to wait for transformation to be done. ELT provides BI users and analysts with unlimited access to the entire raw data at any time, creating greater flexibility for users to generate more insights to support the business.

Learn more, with a detailed comparison of the two approaches here: ETL vs ELT: The Difference is in the How.

Want to learn more about Redshift cluster and Redshift columnar storage? Check here: Redshift cluster, Redshift columnar storage.

View full post