Any successful data operation needs two things at its heart: a solid data warehouse and an ETL tool to move data from individual data stores to the data warehouse. Since so many application databases—and data warehouses—are built on SQL databases, and since MySQL is one of the most popular flavors of SQL, we put together a list of the top MySQL ETL tools to help you move data in and out of MySQL database systems.
We focused on free and open source MySQL ETL tools here, but several of the tools below are the “community” versions of more powerful, paid platforms that you can use if you want something with a little more oomph. Skip to the end if you’re looking for good ETL solutions for extracting data from your MySQL database into a data warehouse.
Take a look at the offerings below, and you’re bound to find a tool that works for your particular data stack.
Free and Opens Source ETL Tools
- Talend Open Source Data Integrator
- Pentaho Kettle
Benetl is a lightweight ETL tool that can use a MySQL database for storage. It has strong data analysis functionality, but Benetl has file type limitations – its ETLtool builds tables only from data extracted from csv, xls, and txt formats. Earlier versions than 1.8 don’t support xls, and 1.8 and later convert xls to txt.
You can use Benetl’s ETLtool to define and retrieve fields from your source files. The special operator “part” creates complex data retrieval functions that you can test in the Benetel GUI “part formula viewer.” Benetl is a good ETL choice if you’re comfortable with command-line interfaces and the extra steps MySQL users need to take to get it running.
Go past basic data analysis and storage with Talend Open Studio for Data Integration, a cloud-friendly ETL that connects to MySQL. Open Studio’s robust toolbox lets you work with code, manage files, and transform and integrate big data. It gives you graphical design and development tools and hundreds of data processing components and connectors. With Talend’s Open Studio, you can import external code, create and expand your own, and view and test it in a runtime environment. Check your final products with Open Studio’s Data Quality & Profiling and Data Preparation features. You can get the open source download on the Talend website.
If you’re working with CRM systems, Apatar is an open source ETL that connects to MySQL. It moves and synchronizes customer data between your own systems and third-party applications. Apatar can transform and integrate large, complex customer datasets. The Apatar download saves time and resources by leveraging built-in app-integration tools and reusing mapping schemas that you create. Even non-developers can work with Apatar’s user-friendly drag-and-drop UI. No programming, design or coding is required with this cost-saving but powerful data migration tool that makes CRM work easier.
KETL is an open source ETL tool that features a built-in scheduler, so you don’t have to use any third-party tools. KETL is XML-based and works with MySQL to develop and deploy complex ETL transformation projects that need scheduling. With the KETL API, you can execute more job types than just the three that basic KETL supports: SQL, XML and OS.
OpenMRS is an ETL tool that features predictive modeling by tracking historical data across multiple platforms. It runs queries on the data you select from a MySQL database and then transmits it to an Apache Hive data warehouse. Hive analyzes the data and gives you the results. OpenMRS can help healthcare agencies and other entities that need fast, internal analysis of complex, sensitive data. The MySQL/Hive combination can only do ETL in real time. You can get auto login and scheduling with the OpenMRS API.
The open source version of the DataExpress ETL was a beta trial that’s now a thing of the past. DataExpress was a lightweight Scala-based tool that supported MySQL. It let you move and access files securely across databases and platforms. You can read about the paid versions—DataExpress Open Platform (DXOP) and DataExpress NonStop (DXNS)—on the DataExpress website. You can still get access to the package through GitHub, but the repository is no longer maintained.
Transformalize is an ETL that makes your work with stored relational data fast and easy. Related data is normalized (un-joined) and put in tables for optimal storage. Then it has to be rejoined before it can be retrieved. But that rejoining can slow you down when it happens at runtime. With Transformalize, you can de-normalize (pre-join) relational data and eliminate that time-consuming process when you’re ready to query and retrieve it. This open source ETL tool supports complex queries in XML, JSON and C# code. MySQL works for both source files and storage. You can download the latest version of Transformalize on GitHub.
It’s easier to work with your data when it’s stored in a database than when it’s in CSV files. Csv2db is an open source ETL that loads the data from your CVS files directly into databases. You don’t need to spend time cleaning or transforming it first. This data-loading tool works with plain text, multiple and compressed files and then gives you the output with the “--verbose” command. You can clone Csv2db from its GitHub repo ($ git clone https://github.com/csv2db/csv2db) or else download it. For Csv2db to work with MySQL in a Python workflow, you'll have to install the mysql-connector-python driver.
Pentaho’s Data Integration (PDI), or Kettle (Kettle E.T.T.L. Environment), is an open source ETL tool that uses Pentaho’s own metadata-based integration method. With Kettle, you can move and transform data, create and run jobs, load-balance data, pull data from multiple sources, and more. But you can’t sequence your transformations.
Be aware that some users report issues with the Kettle MySQL driver that include slowness, timeouts and communication link failure, but Pentaho’s MySQL wiki page shows solutions and bug fixes for some of the issues. Spoon has reported issues too.
Paid Cloud Services for Extracting MySQL Data to a Data Warehouse
There are a number of good reasons to use a data warehouse instead of a database when it comes to data integration and analytics. And when it comes to pulling data out of a MySQL database to a data warehouse (a common use case for businesses who want to run analytics on the data in their MySQL application database) the best approach is to buy rather than build. Here are five great options MySQL ETL to a data warehouse.
Panoply, as you can imagine, is our favorite cloud ETL provider and data warehouse combination. With tons of data connectors—including one for MySQL connector—data ingestion is fast and easy, with just a few clicks and a login between you and your newly integrated data.
Under the hood, Panoply uses an ELT approach instead of traditional ETL. Data ingestion is faster and more dynamic because you don’t have to wait for data transformation to complete before you load your data. Panoply builds managed cloud data warehouses for every user. So you won’t need to set up a separate destination to store all the data you pull in with Panoply’s ELT process.
Blendo offers a cloud-based ETL tool focused on letting users get their data into warehouses as fast as possible by using its suite of proprietary data connectors. Blendo’s ETL-as-a-service product makes it easy to get data from many data sources including S3 buckets, CSVs, and a large array of third-party data sources like Google Analytics, Mailchimp, Salesforce and others. After you set up the incoming end of the data pipeline, you can load it into several storage destinations.
Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API replicates data from any source, and it handles bulk and incremental data updates. Stitch also has a replication engine that uses multiple strategies to deliver data to users. Its REST API supports JSON or transit to automatically detect and normalize nested document structures into relational schemas. Stitch connects to Amazon Redshift, Google BigQuery and PostgreSQL architectures, and it integrates with a massive suite of BI data analysis tools. Stitch also has some cool automation features, like automated processing of Google Analytics data: collects, transforms and loads Google analytics data into its own system, where it automatically produces business insights on your raw data. MySQL is just one of many data sources that work with Stitch.
FlyData replicates and moves data from MySQL to Amazon Redshift. Its automated pipelines move data in real time, and they don’t need any maintenance after they’re set up. Your Redshift data warehouse is always up to date with changes to your MySQL data because FlyData continuously syncs your MySQL data.
The FlyData ETL is safe and easy to use. Its SecureTunnel VPN can access data behind your firewall. FlyData buffers your data, so you won’t lose any if errors happen. And FlyData support people monitor the sync and replication to Redshift for you. There’s not much work needed on your part when you use this paid database replicator.
Informatica’s suite of data integration software includes PowerCenter, which is known for its strong automation capabilities. PowerCenter uses a metadata-based approach for fast data ingestion and processing. Its automated error logging and early warning systems identify ETL pipeline issues before they become serious problems. Informatica supports multiple DBMS technologies, and it’s a high-rated company for data integration power users. Informatica’s data offerings are pricey (in the six-figure range for licenses). But they’re one of the biggest players in the space for a reason. If you’re looking for an ETL tool for a large, well-resourced organization, Informatica might be your choice.
Note that MySQL doesn’t automatically connect to Informatica PowerCenter. You’ll have to create an ODBC connection.