“To buy or not to buy, that is the question.”
Ok; comparing the construction of a company’s ETL pipeline with Hamlet’s binary choice on whether to stick around or just get some sleep is perhaps a touch dramatic. But let’s not minimize how important the decision-making framework is that goes into deciding the best Extract, Transform, Load (ETL) process for any organization, large or small. It’s extremely significant. This article assesses some of the dominant reasons why buying tools might work best for some, while hand-coding the ETL layer may work better for others.
Before contrasting and comparing different build-versus-buy scenarios, here’s a quick overview on what purpose the ETL process is meant to serve.
ETL Process and Purpose
The Extract, Transform, Load process describes the infrastructure built to pull the raw data required for analysis (extract), convert and prepare it in some useful form for a business need (transform), and deliver it to a data warehouse (load).
This automated process — extracting the raw data from trusted sources, transforming it for business analytics, and loading it for storage — is an integral part of any company’s data pipeline; in fact, it is perhaps the integral part. Successful extraction means data has been converted into a single format for standardized processing.
A legacy or traditional form of implementing ETL is to process the data in batches (say, once every 24 hours). With batch processing, data is stored, and collection stopped at some point to “forklift” the data over to your data warehouse. The loaded and transformed data is then available for BI tools to provide business insights at the visualization layer. Powerful software frameworks like Hadoop MapReduce or Stitch have been designed for batch processing of large datasets.
A more modern approach — one where real-time analytics are crucial, latency factors are significant, and extracting and transforming large batches of data in bulk may be counterintuitive for business needs — is to have the ETL pipeline perform stream processing of data. A stream processing tool, such as Apache’s Spark or Kafka, is utilized to process data in-memory, greatly increasing the time data becomes actionable.
While batch processing lets the data build up, stream processing renders the data more steadily accessible, spreading the processing over time. Batch processing is moving data in bulk, validating and cleaning the data, and then staging it in a database before loading the data to target tables within the data warehouse (an important check in the pipeline, making it easier to roll back if something goes wrong). Stream processing is moving data at speed, and should be viewed as preferable, perhaps mandatory, when the caliber of insight and value possible from touching data immediately decreases as the time component increases. There are some very powerful tools out there for either scenario, and it’s always a good idea to familiarize yourself with ETL tool options.
The DIY Decision Matrix
Let’s start at the beginning. (While that phrase may sound redundant, this includes scenarios where a company, having grown and/or focused on a million other things, is realizing they’re way past the starting line and galloping along an inefficient trajectory with respect to data visualization.)
1) Form a schematic that defines scope
By putting a plan together to help dictate exactly what the needs are, we are also describing the project parameters. The top of that list should be pretty obvious: make your data make sense. There is a human need for this data, upon which the end users gleaning intel from dashboards are reliant. They are also counting on that data to be delivered clean and useful, and readily prepared for them to derive insights from the assembled analytics.
We can start with a few straightforward questions:
- What are the top business needs to be addressed? What data jobs are essential?
- What are the business needs you want to address, but haven’t been able to? What data jobs would you like to incorporate, but didn’t think possible or viable?
- If building your own ETL, will enough time be allotted to correctly develop and test it? And — uber important — will it scale accordingly as the company grows?
- How much money will be saved hand-coding versus buying a tool? Is it worth it? For how long?
- If using hand-coding, how responsive will the in-house team be to failure? This is also highly significant.
- Conversely, if using outside tools for your ETL, how agile are those tools with respect to your current, and changing, needs? This is your future growth, after all!
Each company’s requirements are different, yet no company will benefit from a pipeline that’s unnecessarily sophisticated or poorly structured. Doing so will all but guarantee underperformance.
2) Describe project guidelines for the desired ETL system
The above questions assist in fine-tuning the requirements of the organization. Going forward, planning decisions regarding design have been discussed and agreed upon by all parties, and the business teams (such as sales or marketing, or at the executive level) are confident that the ETL system developers are designing a data integration solution that is optimal for the company.
Those responsible for development are documenting everything with enough detail as is necessary, with the documentation simple, concise, and written accordingly for those who will rely on it most (DBAs, the IT team, and so on). Whether the decision was made to hand-code the ETL or purchase a tool, successful installation means everything is configured effectively, and testing has taken place with representative data sources — including the intentional introduction of corrupted data at certain iterations — to evaluate outcomes and ensure the system is behaving as desired. Tests should become automated whenever possible, and a support system put in place, with any training available as needed.
As each step is executed in constructing and implementing your company’s preferred ETL method, make sure that initial questions are answered, and repeatedly. Treat acceptance testing as if it had real-world significance, and adopt a proactive posture that mitigates any potential data quality issues: regular auditing and profiling of target tables or the system as a whole.
Illuminating the pros and cons of building-versus-buying ETL will address your specific BI requirements, and a more detailed, informed decision can then be made by the development team on how to best proceed forward. It’s also always a good idea to separate fact from fiction when it comes to ETL, or at least fact from really bad advice. This is important. As mentioned earlier after the mutated Shakespeare quote, the best-possible data delivery system is a very significant component for any company’s successful growth. After all the planning, cost, and hard work that goes into whichever ETL solution was decided upon, the absolute last thing we want to see is our data metaphorically strutting and fretting upon our (data pipeline) stage, “full of sound and fury, signifying nothing.”
Advantages of Building
In this scenario, you know what metrics matter, how you want everything configured, and demand control over your data pipeline. A DIY solution is where you’re headed; here are a few finer points to consider when solidifying that choice.
1) DIY systems offer a level of flexibility that a tool can’t provide
With hand-coding, you can build anything you want, write any script however you see fit. A tool-based approach may be limited by the vendor’s abilities in this regard. A hand-coded system can be tailored to more directly manage any metadata as well, and this increased flexibility might be desirable depending on business need.
It’s nonetheless important to gauge this prospect accordingly before taking that leap to build: Is there an in-house data team dedicated to writing scripts and managing workflows? If there isn’t a team (or lone data scientist) for whom this is their sole focus, is the task of hand-coding ETL relegated to someone’s second job? Is this wise? Even if the coding chops are there to build the pipeline, there may simply not be enough time to properly write the documentation needed, and with sufficient attention and follow-through for the requisite testing to ensure optimization.
2) Total control, total visibility
Control is awesome! Does the control of running data jobs wherever and whenever you determine outperform the use of tools you could buy (or have used in the past)? Does a custom data integration solution therefore render any current ETL tool part of unnecessary overhead to be trimmed away from productivity costs? If so, then hand-coding looks extremely viable.
However — and as the saying goes — with great power comes great responsibility. How are you positioned to handle things when data fails to load, or only partially loads, with that failure potentially not even in the ETL layer but somewhere within source data? Are the necessary checks built in to swiftly remedy the situation? Will new data sources / connection changes be integrated seamlessly?
3) Cost effectiveness
If there are also budgetary concerns (aren’t there always?), and your current data pipeline needs do not require or would not utilize everything an ETL tool has to offer when evaluated against the tool’s cost to implement, a DIY solution may indeed be advantageous.
However, along with budgeting for maintenance, it is imperative that ETL testing is taken seriously, and that checks are placed throughout your ETL, with regular monitoring and troubleshooting to ensure optimal performance and address any data quality issues head-on. Also keep in mind there is the investment in the tech infrastructure necessary to build your system, including the direct costs to keep it operational. In the long run, will the money saved now by hand-coding continuously pay off?
In short, do you have what it takes to ensure multiple data sources are brought together to efficiently provide your end users a 360-degree view of valuable data?
If the type of data integration your organization requires is highly unique, hand-coding your ETL may be the best option, and one that’s cheaper to construct and maintain given the parameters involved. For all other scenarios, let’s consider the perks of buying instead of building...
Advantages of Buying
Here’s a typical scenario describing why one would gravitate toward using an ETL tool: Your organization wants the most robust data pipeline possible. You’re determined to create an environment that allows for autonomy and ownership when it comes to data visualization and implementation, and further conclude that getting there means incorporating specialized ETL tools from trusted vendors, tools tailored to your specific business needs. Finally, you have neither the time, desire, nor expertise to spend building your own solution. Let’s review the pros on why this might be the preferred choice in the build-vs-buy match up.
1) You don’t have to build anything
The tool vendor has put their energy and best people on building an ETL tool that they hope you can’t live without. If the tool meets your business requirements, paying for its automation capabilities and ease-of-use allows you to leverage all the effort, time, and money the vendor team put into building something you’ll never have to. For the vendor, this is what they do; for you and your organization, there’s now freedom to focus on 99 (or more) other problems, knowing that “an ETL ain’t one.”
Another consideration is that teams change, and people move on. In a perfect world, everything about your ETL has been documented clearly, and yesterday’s data scientists and engineers have ensured that tomorrow’s developers have everything they need, and none of it is vague. Using a tool for your ETL means not having to construct one, eliminating a potential and counterproductive lack of clarity when new faces are responsible for a customized data integration solution they didn’t build.
2) Cost is completely justified
Remember: Speed of insight depends on the velocity toward accessibility.
Even if you and your data team are comfortable writing scripts, is it the most efficient use of company time? Development with an ETL tool will be simpler and faster, with the initial outlay of funds more than making up for itself over time, particularly with projects that are large or possess increasing complexity. In fact, the full scope of capabilities possible with a vendor’s tool may only be revealed after specific and disparate use cases are run.
Using a tool also transfers aspects of data security and data integrity away from being entirely your responsibility, with data lineage documentation auto-generated from a metadata repository.
3) Data flow is managed, visualization provided, scalability baked-in
Using a tool literally facilitates teamwork, making it much easier to share and transfer knowledge throughout an organization, and fosters a unified sense of working together via the ease of comprehension taking place on everyone’s dashboard. ETL jobs are easy to schedule, and built-in connectors are already optimized for efficiency when used to extract and deliver data from sources and targets. This managed automation transforms multiple data flows into an actionable visualization layer, allowing end users to query data without having to understand how any of it works.
Also, if opting to buy, any professional tool should have no problem scaling up as your company grows, handling additional data sources and increased data volumes without compromising performance.
Finally, a hand-coded ETL layer will dictate far more responsibility and oversight from the development team, and may be difficult to maintain or, worse, deliver suboptimal results due to poor planning and/or project management. These costs to the organization will eventually far outweigh the upfront expenditures associated with buying and integrating ETL tools. And — perhaps most significant — if not noticed as the source of inefficiency in the data pipeline, the DIY aspect of building your own could be seen as some externality, a misunderstood and out-of-proportion cost given the demands of business need, rather than what it has truly become: an actual culprit handicapping the delivery of clean, useful data. This alone makes a fairly strong argument for tool-based ETL.
When it comes to data integration, defining the unique business needs of the organization helps bring the build-vs-buy decision into focus. Like other things in life, there is the convenience of buying — having it assembled for you, and by pros who know what they’re doing — versus the control inherent in building something yourself, which includes the native knowledge of how everything works, since you’re the one that oversaw construction.
Oversight and proper management are key; failure to do so will certainly render the data a convoluted mess. Properly weighing your options, and being flexible and attentive enough to adjust as needed, will make for sound business decisions.
The ETL process a company chooses will show its efficacy by competently tackling business needs while simultaneously structured not to take on the unnecessary. After all, the objective — moving data from multiple locales and/or disparate systems to an organized database, rendered clean and delivered useful for insights — reinforces the goal of data democratization, encouraging independent, self-service data visualization and analysis for everyone across an organization’s ecosystem.