“To buy or not to buy, that is the question.”
Ok, comparing the construction of a company’s ETL pipeline with Hamlet’s binary choice on whether to stick around or just get some sleep is a touch dramatic. But let’s not minimize how important choosing the best Extract, Transform, Load (ETL) process is for any organization, large or small.
This article explores why buying tools might work best for some, while hand-coding the ETL layer may work better for others. But before we look at different build-versus-buy scenarios, here’s a quick overview on what purpose the ETL process is meant to serve.
ETL process and purpose
The Extract, Transform, Load process describes the infrastructure built to pull the raw data required for analysis from a source (extract), convert and prepare it in some useful form for a business need (transform), and deliver it to a data warehouse (load). This automated process is an integral part of any company’s data pipeline. In fact, it is perhaps the integral part.
A traditional way of implementing ETL is to process the data in batches (say, once every 24 hours). With batch processing, data is stored, and collection is stopped at some point while the system “forklifts” the data over to your data warehouse. The loaded and transformed data is then available for BI tools to provide business insights at the visualization layer. Powerful software frameworks like Hadoop MapReduce or Stitch have been designed for batch processing of large datasets.
A more modern approach—one where real-time analytics are crucial, latency matters, and extracting and transforming large batches of data in bulk may be counterintuitive for business needs—is to have the ETL pipeline perform stream processing of data. Stream processing tools, such as Apache’s Spark or Kafka, process data in-memory, greatly decreasing the time it takes for data to become actionable.
While batch processing lets the data build up, stream processing renders the data more steadily accessible by spreading the processing over time. Batch processing is moving data in bulk, validating and cleaning the data, and then staging it in a database before loading the data to target tables within the data warehouse (an important check in the pipeline, making it easier to roll back if something goes wrong). Stream processing is moving data at speed, and should be viewed as preferable, perhaps mandatory, when the caliber of insight and value possible from touching data immediately decreases as the time component increases. There are some very powerful tools out there for either scenario, and it’s always a good idea to familiarize yourself with ETL tool options.
The DIY Decision Matrix
When deciding whether to build or buy, you should start at the beginning. Counterintuitive that may be, this includes scenarios where a company realizes it's way past the starting line and galloping along an inefficient trajectory with respect to data.
1. Form a schematic that defines scope
By putting a plan together to help dictate exactly what your business needs are, we are also describing the project parameters. The top of that list should be pretty obvious: make your data make sense. There is a human need for this data, upon which the end users gleaning intel from dashboards are reliant. They are also counting on that data to be delivered clean and useful, and ready for them to derive insights.
To build this plan, start with a few straightforward questions:
- What are the top business needs to be addressed? What data jobs are essential?
- What business needs do you want to address, but haven’t been able to? What data jobs would you like to incorporate, but didn’t think possible or viable?
- If building your own ETL, how much time will you need to correctly develop and test it? And—uber important—will these DIY pipelines scale as the company grows?
- How much money will be saved hand-coding versus buying a tool? Is DIY worth it? For how long?
- If using hand-coding, how responsive will the in-house team be to failure? Can you commit to long-term maintenance, not just a one-time investment in setup?
- Conversely, if using outside tools for your ETL, how agile are those tools with respect to your current, and changing, needs? This is your future growth, after all!
Each company’s requirements are different, yet no company will benefit from a pipeline that’s unnecessarily sophisticated or poorly structured. Doing so will all but guarantee underperformance.
2. Describe project guidelines for the desired ETL system
Those responsible for development should document everything in a simple and concise way that those who will rely on it most (DBAs, the IT team, and so on) can easily reference.
Whether the decision was made to hand-code the ETL or purchase a tool, successful installation means everything is configured effectively. It also means that you've completed testing with representative data sources—including the intentional introduction of corrupted data at certain iterations—to ensure the system is behaving as desired. Tests should be automated whenever possible, and a support system put in place, with any training available as needed.
As each step is executed in constructing and implementing your company’s preferred ETL method, make sure that initial questions are answered, and that acceptance testing is treated as though it had real-world significance. You should also adopt proactive measures to avoid potential data quality issues, such as regular auditing and profiling of target tables or the system as a whole.
Advantages of building
In this scenario, you know what metrics matter, how you want everything configured, and demand control over your data pipeline. If you're leaning toward a DIY solution, here are a few finer points to consider when solidifying that choice:
1. DIY systems offer a level of flexibility that a tool can’t provide
With hand-coding, you can build anything you want. In contrast, a tool-based approach may be limited by available features. A hand-coded system can be tailored to more directly manage any metadata as well, and this increased flexibility might be desirable depending on your business needs.
Before taking the leap to build, it's important to ask yourself a few key questions:
- Is there an in-house data team dedicated to writing scripts and managing workflows?
- If there isn’t a team (or lone data scientist) for whom this is their sole focus, is the task of hand-coding ETL relegated to someone’s second job?
- Can the ETL team handle maintenance tasks like testing, documentation, and optimization as well as setup?
2. Total control, total visibility, total responsibility
Control is awesome! Does the control of running data jobs wherever and whenever you determine outperform the use of tools you could buy (or have used in the past)? Does a custom data integration solution render any current ETL tool part of unnecessary overhead to be trimmed away from productivity costs? If so, then hand-coding looks extremely viable.
However—and as the saying goes—with great power comes great responsibility. How are you positioned to handle things when data fails to load, or only partially loads, with that failure potentially not even in the ETL layer but somewhere within source data? Are the necessary checks built in to swiftly remedy the situation? Will new data sources/connection changes be integrated seamlessly?
It's important to realize that committing to DIY data pipelines means figuring out how to manage all those messes with internal resources...even when it's inconvenient.
3. Cost effectiveness
If there are also budgetary concerns (aren’t there always?), and your current data pipeline needs do not require everything an ETL tool has to offer, a DIY solution may indeed be advantageous.
However, along with budgeting for maintenance, it is imperative that ETL testing is taken seriously. You'll need to place checks throughout your ETL, with regular monitoring and troubleshooting to ensure optimal performance and address any data quality issues head-on. Also keep in mind the investment in the tech infrastructure necessary to build your system, including the direct costs to keep it operational. In the long run, will the money saved now by hand-coding continuously pay off?
If the type of data integration your organization requires is highly unique, hand-coding your ETL may be the best and cheaper option. For all other scenarios, let’s consider the perks of buying instead of building...
Advantages of buying
Why buy? Here's a typical scenario for a lot of companies thinking about ETL: Your organization wants the most robust data pipeline possible and you’re determined to create an environment that allows for autonomy and ownership when it comes to data visualization. However, you don't have the time, desire, or expertise to manage your own pipelines. Here are some reasons a pre-built solution could be the right one for you:
1. You don’t have to build anything
The tool vendor has put their energy and best people on building an ETL tool that they hope you can’t live without. For the vendor, this is what they do; for you and your organization, there’s now freedom to focus on 99 (or more) other problems, knowing that “ETL ain’t one.”
Another consideration is that teams change and people move on. In a perfect world, everything about your ETL has been documented clearly, and yesterday’s data scientists and engineers have ensured that tomorrow’s developers have everything they need for the future. Using a tool for your ETL means not having to construct all that documentation, eliminating a potential and counterproductive lack of clarity when new hires become responsible for a customized data integration solution they didn’t build.
2. Cost is completely justified
Remember: Speed of insight depends on the velocity toward accessibility.
Even if you and your data team are comfortable writing scripts, is it the most efficient use of company time? Development with an ETL tool will be simpler and faster, with the initial outlay of funds more than making up for itself over time, particularly for large or complex projects. In fact, the full benefit of a particular tool may only be revealed after you really start working with it.
Using a tool also relieves your responsibility for some aspects of data security and data integrity, with data lineage documentation auto-generated from a metadata repository.
3. Data flow is managed and scalability is baked-in
Using a tool literally facilitates teamwork, making it much easier to share and transfer knowledge throughout an organization. Managed automation transforms multiple data flows into an actionable visualization layer, allowing end users to query data without having to understand how any of it works.
Also, if opting to buy, any professional tool should have no problem scaling up as your company grows, handling additional data sources and increased data volumes without compromising performance.
Conclusions
When it comes to data integration, defining the unique business needs of the organization helps bring the build-vs.-buy decision into focus. Like other things in life, there is the convenience of buying—having it assembled for you, and by pros who know what they’re doing—versus the control inherent in building something yourself, which includes the native knowledge of how everything works, since you’re the one that oversaw construction.
Ultimately, you should choose an ETL process based on your end goal—getting data into a state where users can derive insights. Whether you're just looking to make things easier for your analysts or you're creating a culture of data democratization, having the right ETL in place is a massive step in the right direction.