If you're building a data warehouse, you need ETL to move data into that storage. To do that, you first extract data from an array of different sources. Then you apply transformations to get everything into a format you can use, and finally, you load it into your data warehouse.
There are many ways to do this, one of which is the Python programming language. The Python community has created a range of tools to make your ETL life easier and give you control over the process. If you have the time, money, and patience, using Python will ensure your ETL pipeline is streamlined exactly for your business needs.
With that in mind, here are the top Python ETL Tools for 2021. Some of these let you manage each step of the ETL process, while others are excellent at a specific step. They are organized into groups to make it easier for you to compare them. Let’s go!
We will first look at Python's meta-ETL tools. Workflow Management Systems (WMS) let you schedule, organize, and monitor any repetitive task in your business. Thus, you can use WMS to set up and run your ETL workflows.
With Airflow, you build workflows as Directed Acyclic Graphs (DAGs). Then, the scheduler spreads your tasks across an array of workers for maximum efficiency. It comes with a handy web-based UI for managing and editing your DAGs, and there’s also a nice set of tools that makes it easy to perform “DAG surgery” from the command line.
Here is a simple DAG—adapted from the beginner tutorial—that runs a couple of simple bash commands each day:
from datetime import timedelta # The object we use to initialize a DAG from airflow import DAG # We need this to operate! from airflow.operators.bash_operator import BashOperator # Dict containing default args for the DAG, excluded for brevity default_args = dict(...) # Initialize DAG dag = DAG( ‘tutorial’, default_args=default_args, description=‘A simple DAG’, schedule_interval=timedelta(days=1) ) # Initialize bash operator tasks we want to perform t1 = BashOperator( task_id=‘print_date’, bash_command=‘date’, dag=dag) t2 = BashOperator( task_id=‘sleep’, depends_on_past=False, bash_command=‘sleep 5’, retries=3, dag=dag) # t2 depends on t1 running successfully to run t1 >> t2
Airflow is the Ferrari of Python ETL tools. It can truly do anything. But this extensibility comes at a cost. It can be a bit complex for first-time users (despite their excellent documentation and tutorial) and might be more than you need right now. If you want to get your ETL process up and running immediately, it might be better to choose something simpler. But if you have the time and money, your only limit is your imagination if you work with Airflow.
Luigi comes with a web interface that allows the user to visualize tasks and process dependencies. It’s conceptually similar to GNU Make but isn’t only for Hadoop (although it does make Hadoop jobs easier). Furthermore, it’s quite straightforward to create workflows as they are all just Python classes.
Here is an outline of what a typical task looks like (adapted from the docs). Your ETL pipeline is made up of many such tasks chained together.
import luigi class MyTask(luigi.Task): # Parameters for this task param = luigi.Parameter(default=42) # The other tasks it depends on def requires(self): return SomeOtherTask(self.param) # Business logic of the task def run(self): with self.output().open('w') as f: f.write('hello world!') # Where it writes output def output(self): return luigi.LocalTarget(f'/tmp/foo/bar-{self.param}.txt') if __name__ == '__main__': luigi.run()
While the package is regularly updated, it is not under as much active development as Airflow, and the documentation is out of date as it is littered with Python 2 code. If you can get past that, Luigi might be your ETL tool if you have large, long-running data jobs that just need to get done.
A core part of ETL is data processing. While plenty of Python tools can handle it, some are specifically designed for that task. Let's take a look at your options:
Here’s an example where we extract data from a CSV file, apply some data transforms, and load it to a PostgreSQL database:
# Standard imports import pandas as pd import numpy as np # Library to interface the PostgreSQL database engine import pg8000 # Extract from csv file into a DataFrame df = pd.read_csv('my_csv.csv') # New DataFrame is the sum of the columns from the first DataFrame df_sum = df.apply(np.sum, axis=0) # Load new DataFrame into PostgreSQL database con = pg8000.connect('postgres', password='secret_word') df_sum.to_sql(name='my_table', con=con)
However, there is a catch. Pandas is designed primarily as a data analysis tool. Thus, it does everything in memory and can be quite slow if you are working with big data. This would be a good choice for building a proof-of-concept ETL pipeline, but if you want to put a big ETL pipeline into production, this is probably not the tool for you.
Here is some example code showing you how to initialize a Spark session, read in a CSV, apply some transforms, and write out to a different CSV:
from pyspark.sql import SparkSession # Initialize SparkSession spark = SparkSession.builder.appName('foo').getOrCreate() # Extract data from csv file df = spark.read.csv('input_csv.csv') # Transform using a custom function defined elsewhere df = apply_data_transforms(df) # Write to csv file df.write.csv('output_csv.csv')
Obviously, Spark can do much more than just read and write to CSV files, but this gives you a flavor of its intuitive API. Consider Spark if you need speed and size in your data operations.
Petl is only focused on ETL. Thus, it is more efficient than pandas as it does not load the database into memory each time it executes a line of code. On the other hand, it doesn’t include extra features such as built-in data analysis or visualization.
Here’s an example of how to read in a couple of CSV files, concatenate them together and write to a new CSV file:
import petl # Extract two csv files into two tables table1 = petl.fromcsv('input_csv1.csv') table2 = petl.fromcsv('input_csv2.csv') # Concatenate the tables together to make a new table table3 = petl.cat(table1, table2) # Write new table to a csv petl.tocsv(table3, 'output_csv.csv')
Petl is still under active development, and there is the extended library—petlx—that provides extensions to work with an array of different data types. One caveat is that the docs are slightly out of date and contain some typos. But it has a solid user base and good functionality. If you want to focus purely on ETL, petl could be the Python tool for you.
We’ve discussed some tools that you could combine to make a custom Python ETL solution (e.g., Airflow and Spark). But now, let’s look at the Python tools which can handle every step of the extract-transform-load process.
Bonobo has ETL tools for building data pipelines that can process multiple data sources in parallel and has an SQLAlchemy extension (currently in alpha) that allows you to connect your pipeline directly to SQL databases. This framework should be accessible for anyone with a basic skill level in Python and includes an ETL process graph visualizer that makes it easy to track your process. Plus, you can be up and running within 10 minutes, thanks to their excellently written tutorial.
Here is a basic Bonobo ETL pipeline adapted from the tutorial. Note how everything is just a Python function or generator.
import bonobo def extract(): # The data our ETL pipeline will work with yield 'hello' yield 'world!' def transform(*args): # Capitalizes the inputs from extract() function yield (map(str.title, args)) def load(*args): # Prints (now capitalized) inputs to the screen print(*args)
You can chain these functions together as a graph (excluded here for brevity) and run it in the command line as a simple Python file, e.g., $ python my_etl_job.py.
One issue is that Bonobo is not yet at version 1.0, and their Github has not been updated since July 2019. Furthermore, the docs say Bonobo is under heavy development and that it may not be completely stable. As such, this could be a good framework to build small scale pipelines quickly but might not be the best long-term solution until version 1.0 is released at least.
Mara reduces the complexity of your ETL pipeline by making some assumptions. Some of them are: 1) you must have PostgreSQL as your data processing engine, 2) you use declarative Python code to define your data integration pipelines, 3) you use the command line as the main tool for interacting with your databases, and 4) you use their beautifully designed web UI (which you can pop into any Flask app) as the main tool to inspect, run, and debug your pipelines.
Here is a demo mara-pipeline that pings localhost three times:
from mara_pipelines.commands.bash import RunBash from mara_pipelines.pipelines import Pipeline, Task # Initialize pipeline object pipeline = Pipeline( id='demo', description='A pipeline that pings localhost 3 times') # Add task to the pipeline - ping localhost three times pipeline.add(Task(id='ping_localhost', description='Pings localhost', commands=[RunBash('ping -c 3 localhost')]))
Note that the docs are still a work in progress and that Mara does not run natively on Windows. However, it is still under active development, so if you want something between the two extremes mentioned above, give Mara a try.
Pygrametl describes itself as “a Python framework that offers commonly used functionality to develop Extract-Transform-Load (ETL) processes.” It was first created back in 2009 and has seen constant updates since then.
Pygrametl provides object-oriented abstractions for commonly used operations such as interfacing between different data sources, running parallel data processing, or creating snowflake schemas. As it’s a framework, you can seamlessly integrate it with other Python code. Indeed, the docs say it is used in production systems in the transport, finance, and healthcare sectors.
The beginner tutorial is incredibly comprehensive and takes you through building up your own mini-data warehouse with tables containing standard Dimensions, SlowlyChangingDimensions, and SnowflakedDimensions.
In the example below, we create a FactTable for a bookstore detailing which books were sold and at what time:
from pygrametl.tables import Dimension, FactTable # Info about the book that was sold book_dimension = Dimension( name='book', key='bookid', attributes=['book', 'genre']) # Info about when the book was sold time_dimension = Dimension( name='time', key='timeid', attributes=['day', 'month', 'year']) # Join book and time info into a fact table measuring total sales fact_table = FactTable( name='facttable', keyrefs=['bookid', 'timeid'], measures=['sale'])
One potential downside is that this library has been around for over a decade but has yet to gain widespread popularity. This may indicate it’s not that user-friendly in practice. However, pygrametl works in both CPython and Jython, so it may be a good choice if you have existing Java code and/or JDBC drivers in your ETL processing pipeline.
The following Python ETL tools are not fully-fledged ETL solutions but are specialized to do heavy lifting at a specific part of the process. If you’re mixing a lot of tools, consider adding one of the following.
This may get the award for the best little ETL library ever. Odo has one function—odo—and one goal: to effortlessly migrate data between different containers. It works on small, in-memory containers and large, out-of-core containers too.
The function takes two arguments odo(source, target) and converts the source to the target. So to convert the tuple (1, 2, 3) to a list, run:
>>> from odo import odo >>> odo((1, 2, 3), list) [1, 2, 3]
Or to migrate between HDF5 and PostgreSQL do:
>>> odo('myfile.hdf5::/data', 'postgresql://user:pass@host::my-table') Table('my-table', MetaData(bind=Engine(postgresql://user:****@host)), ...)
Odo works under the hood by connecting different data types via a path/network of conversions (hodos means ‘path’ in Greek), so if one path fails, there may be another way to do the conversion.
Moreover, odo uses SQL-based databases’ native CSV loading capabilities that are significantly faster than using pure Python. The docs demonstrate that Odo is 11x faster than reading your CSV file into pandas and then sending it to a database. If you find yourself loading a lot of data from CSVs into SQL databases, odo might be the ETL tool for you.
Note that the Github has not been updated for several years, so odo may not be completely stable. But many filesystems are backward compatible, so this may not be an issue.
This lightweight Python ETL tool lets you migrate between any two types of RDBMS in just 4 lines of code. ETLAlchemy can take you from MySQL to SQLite, from SQL Server to Postgres or any other flavor of combinations.
Here is the code in action:
from etlalchemy import ETLAlchemySource, ETLAlchemyTarget source = ETLAlchemySource("mssql+pyodbc://username:password@DSN_NAME") target = ETLAlchemyTarget("mysql://username:password@hostname/db_name",
drop_database=True)
target.addSource(source) target.migrate()
Pretty simple, huh?
The Github was last updated in Jan 2019 but says they are still under active development. If you want to migrate between different flavors of SQL quickly, this could be the ETL tool for you.
Riko's main focus is extracting streams of unstructured data. Thanks to a host of great features such as synchronous and asynchronous APIs, a small computational footprint, and native RSS/Atom support, it is great for processing data streams. Moreover, the documentation is excellent, and the pure Python library is wonderfully designed.
Here’s an example of how you can fetch an RSS feed and inspect its contents (in this case, a stream of blog posts from https://news.ycombinator.com):
>>> from riko.modules import fetch # stream is an iterator of dictionaries containing info from # the RSS feed - a number of blog articles >>> stream = fetch.pipe(conf={'url': 'https://news.ycombinator.com/rss'}) # Get the first item (blog article) in the iterator of dictionaries >>> item = next(stream) # Inspect information about the blog article >>> item['title'], item['author'], item['id'] ('Gravity doesn’t care about quantum spin', 'Chris Lee', 'http://arstechnica.com/?p=924009')
(You will get different results to the above as the feed is updated several times per day).
Riko is still under development, so if you are looking for a stream processing engine, this could be your answer.
Capital One has created a powerful Python ETL tool with Locopy that lets you easily (un)load and copy data to Redshift or Snowflake. The API is simple, straightforward, and gets the job done.
For example, here is how you can download data from Redshift to a CSV:
# Library to interface the PostgreSQL database engine import pg8000 import locopy # Your profile info my_profile = "some_profile_with_valid_tokens" # Set up Redshift instance configuration redshift_config = locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) with redshift_config as redshift: redshift.unload_and_copy( query="SELECT * FROM schema.table", s3_bucket="my_s3_bucket", export_path="my_output_destination.csv")
It is still actively maintained, so if you’re looking specifically for a tool that makes ETL with Redshift and Snowflake easier, check out locopy.
Here is a list of tools we’ve recommended in the past but no longer look like they’re under active development. You may be able to get away with using them in the short term, but we would not advise you to build anything of size due to their inherent instability from lack of development.
Bubbles is a popular Python ETL framework that makes it easy to build ETL pipelines. Bubbles is written in Python but is designed to be technology agnostic. It’s set up to work with data objects—representations of the data sets being ETL’d—to maximize flexibility in the user’s ETL pipeline. If your ETL pipeline has many nodes with format-dependent behavior, Bubbles might be the solution for you. The Github repository hasn’t seen active development since 2015, so some features may be outdated.
mETL is a Python ETL tool that automatically generates a YAML file to extract data from a given file and load it into a SQL database. It’s somewhat more hands-on than some of the other packages described here, but can work with a wide variety of data sources and targets, including standard flat files, Google Sheets, and a full suite of SQL dialects (including Microsoft SQL Server). Recent updates have provided some tweaks to work around slowdowns caused by some Python SQL drivers, so this may be the package for you if you like your ETL process to taste like Python but faster.
Carry is a Python package that combines SQLAlchemy and Pandas. It’s useful for migrating between CSVs and common relational database types, including Microsoft SQL Server, PostgreSQL, SQLite, Oracle, and others. Using Carry, multiple tables can be migrated in parallel, and complex data conversions can be handled during the process. Carry can automatically create and store views based on migrated SQL data for the user’s future reference.
Etlpy is a Python library designed to streamline an ETL pipeline that involves web scraping and data cleaning. Most of the documentation is in Chinese, though, so it might not be your go-to tool unless you speak Chinese or are comfortable relying on Google Translate. Etlpy provides a graphical interface for designing web crawlers/scrapers and data cleaning tools. Once you’ve designed your tool, you can save it as an XML file and feed it to the etlpy engine, which appears to provide a Python dictionary as output. This might be your choice if you want to extract a lot of data, use a graphical interface to do so, and speak Chinese.
Using Python ETL tools is one way to set up your ETL infrastructure. However, as is the case with all coding projects, it can be expensive, time-consuming, and full of unexpected problems.
If you just want to sync, store, and easily access your data, Panoply is for you. Instead of spending weeks coding your ETL pipeline in Python, do it in a few minutes and mouse clicks with Panoply. It’s the only data pipeline tool that effortlessly puts all your business data in one place, gives all employees the unlimited access they need, and requires zero maintenance. Plus, Panoply has storage built-in, so you don’t have to juggle multiple vendors to get your data flowing.
If coding your ETL pipeline in Python seems too complicated and risky, try Panoply free for 14 days.