You're building a new data solution for your startup, and you need an ETL tool to make slinging data more manageable. You've read a ton of articles and talked to a bunch of folks about what they use. And now you’re staring at a list of 15 possible options.
How do you go from fifteen tools to one—especially when making this decision is only one small part of your job and you can't afford to spend a lot of time mucking around with it?
In this article, I'll walk you through how to decide which ETL tool will best fit your needs and will put you on a path to spreading data goodness throughout your startup.
What is an ETL tool and why do you need one?
Before we dive in, a quick recap. Let's say your startup is using Salesforce, Shopify, Google Analytics, and an online accounting package. You've also got a database around your product(s) and your support system. You want to be able to pull data from all of these systems so you can get a 360° picture of what's going on with your company.
You know that to effectively analyze that data (and not just become a spreadsheet wrangler) you need to:
- Extract the data from all of these systems
- Transform that data so that it's cleaned up, connected, and easy to analyze
- Load the data into the data warehouse
If you’re into DIY ETL, you could painstakingly:
- Write code—lots and lots of code—by hand to perform these three steps.
- Write code to test that the three steps are working the way they are supposed to.
- Manually connect the code for these three steps to a system that lets you schedule jobs.
- Write code to create a log of what exactly happened so that, for example, when there is a hiccup with your connection to the product database on Tuesday you don't have to wait two weeks to discover you’re missing critical data.
- Rewrite code every time your needs change or one of your data sources changes.
...Or you could use a tool that handles the heavy lifting involved in doing this work.
Traditionally, these tools would extract (E) the data and put them in temporary tables, transform (T) the data in the temporary tables, then load (L) the data into wherever you are going to store it; that's why they’re called ETL tools.
Now that computers are far more powerful and storage is ever-cheaper, the industry is trending towards tools that skip the time-consuming step of using temporary tables and instead extract (E) the data, load (L) it directly into a data warehouse, and then performs the transformation (T). This approach is known as ELT. But when referring to the ecosystem of both ELT and ETL tools, they are usually known as ETL tools.
Factors in choosing an ETL tool
Regardless of what these tools are called, how do you decide between them?
ETL tool makers like to give you an insanely long laundry list of all the reasons why they're awesome. But really, there are only six factors you need to worry about:
1. Ease of use
ETL tools tend to be pretty powerful, but many ETL tools look like they were designed by super geeky data engineers for super geeky data engineers. So one factor you want to consider is how easy it is to set up a new ETL process or to tweak an existing one.
As you are assessing an ETL tool's ease-of-use, be sure to have several user scenarios in mind that cover the main issues raised by your data sources. Drag-and-drop is great, but it can get pretty messy if it's not designed to accommodate, for example, joining together the pieces of a complex data model.
When you’re scrambling to get a new data solution up and running, it's easy to forget just how critical maintenance will be to your success. So be sure to ask questions such as:
- What level of skill will you or your team need to have in order to keep the ETL system running smoothly?
- If your ETL tool didn't have a built in integration with a data source you use, how easy will it be to keep up as a data source's data model changes over time?
- How easy to use and robust are its error logs? For example, if one of your data sources is having problems, how quickly will you be able to track down the problem and fix it?
No matter how easy to use a tool is, at some point you're going to need help. And the fact that the tool you chose had a marginally better feature set isn't going to matter to your users if they can't get critical work done because your data’s failing to process and you don't have the support you need.
Some questions you should ask:
- How good is the documentation for the tool?
- Is there high quality live support for the tool?
- Is there an online community around the tool? If so, how long does it take for questions to get answered? And how helpful and knowledgeable are the responses you see on their online community?
4. Built-in integrations
Using an ETL tool to design an integration with a data source is a lot less work than writing a bunch of code to do it. You know what's even less work? Having somebody else take care of integrating most of your data sources.
These days, some ETL tools include custom integrations for commonly used data sources. If an ETL tool has custom integrations for most of the data sources you use, the amount of time it'll take to get your data warehouse up and running can drop dramatically. Given the right integrations, instead of taking weeks or months to integrate your data sources, a big chunk of your data could be ready to go within a matter of hours or even minutes.
In a lot of cases, a single ETL tool won’t have native integrations to all your data sources. That’s no big deal, so long as the tool can:
- Integrate with other ETL tools that have the connector you need or
- Connect to an S3 bucket that can act as an intermediary between the original data source and your ETL tool or
- Ingest data from a manually coded pipeline (while we wouldn't recommend building all your ETL pipelines yourself, if you’ve only got one to manage, it can be a viable option).
Finally, if you aren't planning on storing your data in a standard data warehouse—e.g., if your data storage uses noSQL—make sure that whatever ETL tool you choose integrates seamlessly both with your data source and your storage.
Cost is often a top factor when choosing between tools. Some questions you'll need to consider:
- How much do you have to pay to build out your initial pipeline?
- What annual costs do you have to pay?
- Do your costs go up if the amount of data in your data sources or the number of data sources you have increase? If so, by how much?
- Is the pricing model predictable? For that matter, how well do you understand it?
You may find that it's impossible to do an apples to apples comparison of the costs of different ETL tools. That's okay. When you’re doing your first cut of potential candidates, all you need is a very rough guess of how much they will cost.
And when you are making your final choice? If you aren't reasonably sure about the difference in cost, it might be worth reaching out to folks you know to find out if anyone uses it and what their experience has been. You might also want to do a quick proof of concept to get a better sense of how the pricing model works in your particular case.
Finally, you need to spend at least some time answering the question, can I expect the tool will be around two years from now? If you're talking about, say, Microsoft's ETL tools, you can be pretty sure they aren't going away tomorrow. But if the tool is brought to you by a new startup with a tiny customer base, you might want to think twice.
How do you vet a company's stability? Do a little googling to see how long they've been around, how big your customer base is, what kind of funding they have, etc. You don't need to spend a lot of time on it—a very rough guess is all you need to prudently reduce the odds that you will get burned.
What about open source ETL tools? On the one hand, open source tools can be tricky: they don’t have a company or a large group of contributors dedicated to building and maintaining them, so there's always the danger they’ll stop being updated. On the other hand, if an open source tool's backers or community starts to crumble, you've got time to move to another platform; if the latest and greatest online ETL tool startup goes belly up, your ETL infrastructure could disappear overnight.
Why Panoply for ETL
And now for a shameless plug: if you're trying to decide between ETL tools, you should take a serious look at Panoply. Here's why:
- Speed of deployment and ease of use. Panoply has built-in integrations with a staggering number of online systems, all designed to be up and running with just a few clicks.
- Fully managed syncing. With Panoply, you don't have to worry about staying on top of the often messy details of synchronizing data. Choose between a couple of simple options and Panoply takes it from there. As your data sources change or APIs are updated, Panoply stays on top of it for you.
- Built-in storage. While most ETL tools only sync your data, Panoply includes an automatically configured cloud data warehouse to store your data. That can save both money and hassle compared to a data stack where you have to manage multiple tools.
- First-rate support. No matter how easy the system is, there will always be a point where you could use a little help. Along with robust documentation, Panoply offers best-in-class support including 24/7 chat that helps keep your data flowing even in the wee hours of the morning.
Sound too good to be true? Try it for yourself—for free, without entering your credit card information. You can have your ETL pipeline set up and be syncing data in minutes...all without a single line of code.