ETL is the heart of any data warehousing project. Any successful data project will involve the ingestion and/or extraction of large numbers of data points, some of which not be properly formatted for their destination database. Luckily for data professionals, the Python developer community has built a wide array of open source tools that make ETL a snap. We’ve put together a list of the top Python ETL tools to help you gather, clean and load your data into your data warehousing solution of choice. Some of these packages allow you to manage every step of an ETL process, while others are just really good at a specific step in the process. Either way, you’re bound to find something helpful below.
Originally developed at Airbnb, Airflow is the new open source hotness of modern data infrastructure. While it doesn’t do any of the data processing itself, Airflow can help you schedule, organize and monitor ETL processes using python. Airflow’s core technology revolves around the construction of Directed Acyclic Graphs (DAGs), which allows its scheduler to spread your tasks across an array of workers without requiring you to define precise parent-child relationships between data flows. It comes with a handy web-based UI for managing and editing your DAGs, but there’s also a nice set of tools that makes it easy to perform “DAG surgery” from the command line. Airflow is highly extensible and scalable, so consider using it if you’ve already chosen your favorite data processing package and want to take your ETL management up a notch.
As long as we’re talking about Apache tools, we should also talk about Spark! Spark isn’t technically a python tool, but the PySpark API makes it easy to handle Spark jobs in your Python workflow. Spark has all sorts of data processing and transformation tools built in, and is designed to run computations in parallel, so even large data jobs can be run extremely quickly. It scales up nicely for truly large data operations, and working through the PySpark API allows you to write concise, readable and shareable code for your ETL jobs. Consider Spark if you need speed and size in your data operations.
petl is a Python package for ETL (hence the name ‘petl’). Similar to pandas, petl lets the user build tables in Python by extracting from a number of possible data sources (csv, xls, html, txt, json, etc) and outputting to your database or storage format of choice. petl has a lot of the same capabilities as pandas, but is designed more specifically for ETL work and doesn’t include built-in analysis features, so it might be right for you if you’re interested purely in ETL.
While Panoply is designed as a full-featured data warehousing solution, our software makes ETL a snap. Panoply handles every step of the process, streamlining data ingestion from any data source you can think of, from CSVs to S3 buckets to Google Analytics.
If you’ve used Python to work with data, you’re probably familiar with pandas, the data manipulation and analysis toolkit. If not, you should be! pandas adds R-style dataframes to Python, which makes data manipulation, cleaning and analysis much more straightforward than it would be in raw Python. As an ETL tool, pandas can handle every step of the process, allowing you to extract data from most storage formats and manipulate your in-memory data quickly and easily. When you’re done, pandas makes it just as easy to write your data frame to csv, Microsoft Excel or a SQL database.
Bubbles is a popular Python ETL framework that makes it easy to build ETL pipelines. Bubbles is written in Python, but is actually designed to be technology agnostic. It’s set up to work with data objects--representations of the data sets being ETL’d--in order to maximize flexibility in the user’s ETL pipeline. If your ETL pipeline has a lot of nodes with format-dependent behavior, Bubbles might be the solution for you. The github repository hasn’t seen active development since 2015, though, so some features may be out of date.
Bonobo is a lightweight, code-as-configuration ETL framework for Python. It has tools for building data pipelines that can process multiple data sources in parallel, and has a SQLAlchemy extension (currently in alpha) that allows you to connect your pipeline directly to SQL databases. Bonobo is designed to be simple to get up and running, with a UNIX-like atomic structure for each of its transformation processes. This library should be accessible for anyone with a basic level of skill in Python, and also includes an ETL process graph visualizer that makes it easy to track your process.
Luigi is an open source Python package developed by Spotify. It’s designed to make the management of long-running batch processes easier, so it can handle tasks that go far beyond the scope of ETL--but it does ETL pretty well, too. Luigi comes with a web interface that allows the user to visualize tasks and process dependencies. It’s conceptually similar to GNU Make, but isn’t only for Hadoop (although it does make Hadoop jobs easier). Luigi might be your ETL tool if you have large, long-running data jobs that just need to get done.
Odo is a Python package that makes it easy to move data between different types of containers. Once you’ve got it installed, Odo provides a single function that can migrate data between in-memory structures (lists, numpy arrays, pandas dataframes, etc), storage formats (CSV, JSON, HDF5, etc) and remote databases such as Postgres and Hadoop. Odo is configured to use these SQL-based databases’ native CSV loading capabilities, which are significantly faster than approaches using pure Python. One of the developers’ benchmarks indicates that Pandas is 11 times slower than the slowest native CSV-to-SQL loader. If you find yourself loading a lot of data from CSVs into SQL databases, Odo might be the ETL tool for you.
etlalchemy is a lightweight Python package that manages the migration of SQL databases. The project was conceived when the developer realized the majority of his organization’s data was stored in an Oracle 9i database, which has been unsupported since 2010. etlalchemy was designed to make migrating between relational databases with different dialects easier and faster. A word of caution, though: this package won’t work on Windows, and has trouble loading to MSSQL, which means you’ll want to look elsewhere if your workflow includes Windows and, e.g., Azure.
mETL is a Python ETL tool that will automatically generate a Yaml file for extracting data from a given file and loading into A SQL database. Somewhat more hands-on than some of the other packages described here, but can work with a wide variety of data sources and targets, including standard flat files, Google Sheets and a full suite of SQL dialects (including Microsoft SQl Server). Recent updates have provided some tweaks to work around slowdowns caused by some Python SQL drivers, so this may be the package for you if you like your ETL process to taste like Python, but faster.
Open Semantic ETL
Open Semantic ETL is an open source Python framework for managing ETL, especially from large numbers of individual documents. The framework allows the user to build pipelines that can crawl entire directories of files, parse them using various add-ons (including one that can handle OCR for particularly tricky PDFs), and load them into your relational database of choice.
Mara is a Python library that combines a lightweight ETL framework with a well-developed web UI that can be popped into any Flask app. Like many of the other frameworks described here, Mara lets the user build pipelines for data extraction and migration. Mara uses PostgreSQL as a data processing engine, and takes advantages of Python’s multiprocessing package for pipeline execution. The developers describe it as “halfway between plain scripts and Apache Airflow,” so if you’re looking for something in between those two extremes, try Mara. Note: Mara cannot currently run on Windows.
While riko isn’t technically a full ETL solution, it can handle most data extraction work and includes a lot of features that make extracting streams of unstructured data easier in Python. The tool was designed to replace the now-defunct Yahoo! Pipes web app for pure Python developers, and has both synchronous and asynchronous APIs. riko has a pretty small computational footprint, native RSS/Atom support and a pure Python library, so it has some advantages over other stream processing apps like Huginn, Flink, Spark and Storm. If you find yourself processing a lot of stream data, try riko.
Carry is a Python package that combines SQLAlchemy and Pandas. It’s useful for migrating between CSVs and common relational database types including Microsoft SQL Server, PostgreSQL, SQLite, Oracle and others. Using Carry, multiple tables can be migrated in parallel, and complex data conversions can be handled during the process. One of Carry’s differentiating features is that it can automatically create and store views based on migrated SQL data for the user’s future reference.
The team at Capital One Open Source Projects has developed locopy, a Python library for ETL tasks using Redshift and Snowflake that supports many Python DB drivers and adapters for Postgres. Locopy also makes uploading and downloading to/from S3 buckets fairly easy. If you’re looking specifically for a tool that makes ETL with Redshift and Snowflake easier, check out locopy.
etlpy is a Python library designed to streamline an ETL pipeline that involves web scraping and data cleaning. Most of the documentation is in Chinese, though, so it might not be your go-to tool unless you speak Chinese or are comfortable relying on Google Translate. etlpy provides a graphical interface for designing web crawlers/scrapers and data cleaning tools. Once you’ve designed your tool, you can save it as an xml file and feed it to the etlpy engine, which appears to provide a Python dictionary as output. This might be your choice if you want to extract a lot of data, use a graphical interface to do so, and speak Chinese.
pygrametl is another Python framework for building ETL processes. pygrametl allows users to construct an entire ETL flow in Python, but works with both CPython and Jython, so it may be a good choice if you have existing Java code and/or JDBC drivers in your ETL processing pipeline.
Do you have any great Python ETL tool or library recommendations? Let us know! Send your recommendations to blog [at] panoply.io.