ETL

ETL With Singer: A Tutorial

 

Whatever your relationship with ETL processes is, you’re probably aware that it’s an inescapable element of modern data management and data warehousing practices. Data just always seems to be somewhere else, and you will inevitably need to pull it into your data warehouse. Luckily, there are a bunch of tools that can either do ETL for you or help you build and manage your own ETL pipelines. Which is great! But it’s not uncommon to find a datasource that just isn’t covered by standard, commercially available ETL tools--and in that case, you’ll probably find yourself having to write custom scripts that require constant maintenance and upkeep as data sources change and target locations move or are taken offline. This can be a serious headache for data engineers, and takes valuable time away from your data team that could be spent on higher-impact tasks.

Singer: a modular, customizable ETL tool

Singer, developed by Stitch, is an open source tool that helps deal with the messy world of custom ETL scripts. Singer is set up to allow users to mix and match inputs and outputs, making their ETL processes much more modular, and therefore easier to run and maintain. It’s also designed to use JSON to move all data between sources and destinations, so you won’t have to worry about incompatible formats once you’ve set your ETL tasks up to run. So how does Singer do all this? Basically, it breaks traditional ETL scripts into Taps and Targets, described below:

  • Taps: Taps are data extraction scripts, designed to pull data from files, databases and APIs
  • Targets: Targets are data loading scripts, designed to take the data extracted by Taps and move it along to their target destination

Pretty straightforward, right? We’ll run through how to actually use Singer for ETL in the next few sections.

Taps and Targets: how to use Singer for ETL

Requirements

  1. Python 3.5.2: Before getting started with Singer, you’re going to need to make sure you’re working with the right version of Python. Singer is designed to work with Python 3.5.2, so if you don’t have it installed, make sure to do that before proceeding.
  2. pip: This probably isn’t news to you, but pip is the package manager for Python--and since all existing Taps and Targets developed by the Singer team are available as pip-installable packages, you’ll want to make sure you have pip installed on your machine.
  3. Virtual environments: Singer recommends that you create a separate Python virtual environment for each Tap and Target, since this will help you avoid running into any conflicting dependencies when running your ETL jobs. The `virtualenv` package (installable through pip) makes this pretty easy.

Getting started

For the purposes of this tutorial, we’ll assume you’re trying to take advantage of an existing Tap and Target pair, but these guidelines will be roughly the same if you’re creating your own custom Taps and Targets.

First, you’ll need to create a virtual environment for your Tap. Replace <tap> with the name of the tap you’re trying to install in the code below:

`python3 -m venv ~/.virtualenvs/<tap>`

While you’re creating virtual environments, you might as well do the same for your Target now:

`python3 -m venv ~/.virtualenvs/<target>`

Then you’ll need to install the Tap and Target in their respective virtual environments using pip:

`source ~/.virtualenvs/<tap>/bin/activate
pip install <tap>

source ~/.virtualenvs/<target>/bin/activate
pip install <target>`

Once you have the Tap installed, you’ll need to edit the Tap’s config file (config.json) in your working directory. You’ll need to do this to set a few different parameters:

  • Credentials: You’ll need to enter the credentials you’ll use for pulling from the desired data source. Typically this will be an API key, but could be something else like login information.
  • Start date: Use this setting on your first sync to tell Singer how far back in the records it should go. Start dates need to be in a format that conforms to RFC3339 specification.
  • User agent: This is where you can enter contact information in case the API provider needs to get in contact with you.

So, with that said, a typical config file will contain something like this from the Singer documentation:

`{
"api_key" : "ABC123ASDF5432",
"start_date" : "2017-01-01T00:00:00Z",
"user_agent" : "Stitch (+support@stitchdata.com)"
}`

Some Taps support discovery mode, which allows users to get a description of the data streams supported by a Tap. If your Tap is set up for discovery mode, you can run

`~/.virtualenvs/<tap>/bin/<tap> --config tap_config.json --discover > catalog.json`

to get the catalog, which is a JSON formatted list containing an array of data streams supported by the tap, along with their descriptions. Depending on the Tap, you might also need to edit the catalog to add metadata to indicate which streams you’re interested in and whether you want the Tap to save its progress as it goes to ensure that only new records are collected.

Running your Tap

At this point, you’re ready to run your Tap! Congratulations. You can run the Tap in sync mode by running the following:

`~/.virtualenvs/<tap>/bin/<tap> --config tap_config.json --catalog catalog.json`

But you’ll most likely want to run your Tap with the target you installed, so take a look at the next section to see how all the pieces come together.

Running your Tap with a Target

Since you’ve already set up a virtual environment for your Target and installed it there, you’ll just need to edit the Target’s config file to add in the credentials for its destination and set up the parameters you want to use.

Once that’s done, you can run your Tap and pipe the output directly to your Target. You’ll also want to keep track of the final state of the Target, as this will allow you to set up future runs that will build on this one (instead of just repeating the full ETL process over again). To do that, run the following:

`~/.virtualenvs/<tap>/bin/<tap> --config tap_config.json --catalog catalog.json
~/.virtualenvs/<target>/bin/<target> --config target_config.json >> state.json
tail -1 state.json > state.json.tmp && mv state.json.tmp state.json`

Which will run the Tap and pipe its output to the Target, then store the Target’s output in a state file, making your progress on the job available for future runs.

And that’s basically it! Singer makes building modular ETL processes pretty simple, and its built-in state tracking makes it easy to set up cron jobs to repeat and schedule activity on your data pipelines. And don’t worry about the relatively thin catalog of pre-built Taps and Targets that are offered through the Singer site right now: the framework makes it simple to roll your own Taps and Targets, and the documentation provides a very good guide to doing so.

Singer in action: example use

So, what does all this look like when you actually run a Tap with a Target? Good question! We can demonstrate the Singer flow using a free and open source API that provides structured data on exchange rates published by the European Central Bank. Singer has a pre-built tap, `tap-exchangeratesapi` that you can use to collect the data. We’ll output it to a csv using `target-csv`, which is just what it sounds like: a Singer Target that outputs to a csv file. So we’ll run:

`~/.virtualenvs/singerexchange/bin/tap-exchangeratesapi ~/.virtualenvs/singercsv/bin/target-csv > state.json`

 

And we should expect to see this response in the console:

 

The `target-csv` script will create a timestamped csv in your working directory, which you can view in the console with `cat exchange_rate-{timestamp}.csv` where `{timestamp}` will, obviously, be whatever the actual timestamp is. That will give you something like this: blogimage1But you’ll also be able to look at it in your favorite spreadsheet editor, of course:

And that’s basically it! Make sure to install Taps and Targets in different virtual environments, because they don’t always have the same dependencies. The two used here, for example, rely on different dependencies and are better run from separate virtual environments.

ELT with Panoply: the wish-there-was-an-easier-way way

Maybe you read through the above and thought to yourself, “yeesh, that’s actually more complicated than I thought,” or maybe you’re looking for an ETL solution that makes things even easier. If that’s the case, you might be interested in Panoply’s automated ELT-based (as opposed to ETL-based) data management approach. Panoply provides a large--and growing--list of pre-built data connectors (equivalent to Singer Taps) and a central repository for storing and analyzing all your data in one place. Let’s take a look at what differentiates Panoply vs Singer:

  • Data source integrations: Panoply has a set of pre-built--and continuously maintained--set of data connectors that allow for push-button data ingestion. That means no fiddling with config files to set your pipeline up, and no having to go back and modify your whole process when something changes at the data source itself
  • Single destination: Panoply provides users with a central data warehouse to use, so you won’t have to setup and install multiple Targets to load your data
  • Automated data ingestion: Panoply automates data loading and optimizes your data structures for fast retrieval, so you won’t have to worry about pre-defining table schema

If you’re interested in trying out Panoply for your data loading and integration needs, check out our free trial offer and see how it works for you. 

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
ETL
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.