Managing big data is a never-ending challenge. Extract, transform, and load (ETL) processes serve as the traditional foundation for enterprise data warehousing. However, with the emerging popularity of big data ETL tools, some IT experts are eyeing new methods for transforming data. This specific development has provided the fuel for controversy over the survival of ETL in the future. This article is an investigative view into process, challenges, and find out what ETL tools will survive in the big data universe.
Briefly, Extract, Transform and Load (ETL), is the process of moving data from its source into a data warehouse or target database. This process has been the traditional way of moving data.
Because most ETL activities occur as batch processes, trying complete ETL in real time presents its own set of challenges. Specifically, the system must remain available to users to create queries, run reports, and perform other data analysis all while the system is inserting new data and updating tables. The process simply conflicts with traditional ETL tools.
To mitigate this challenge you could employ direct trickle feed as a solution which allows you to send data to the warehouse from the source continuously and insert new data on the warehouse side. However, this method is challenged by lack of scalability and may cause conflict between query calls and data updates.
Online Analytical Processing (OLAP) was designed to operate on top of static data with no solution to address OLAP in real time data environments. If real-time data updating is implemented, any query executed while data is being updated may yield undesired results.
This challenge can be resolved by using risk mitigation which stores a snapshot of updated data separately from other operations and making it available for more complex query execution. Again the drawback here is that it takes a lot of work to setup and maintain. Additionally, those using the isolated data snapshot must understand there will be reporting inconsistencies due to the timing of updates and concurrent query execution.
OLAP solutions and scalability are a direct function of the amount of queried data and the number of users executing queries at the same time. If the available data is static, the number of users on the system is proportional to query. The added burden of data updates alone with constant query execution also affects real-time data warehouse systems. Complex queries and large numbers of users trying to access the data warehouse for reporting will result in query content scenarios as well as limit system scalability. To meet this challenge, you may want to segregate reports that require real-time data from those that require static data. While not ideal, it can provide relief and assist in keeping the data flowing and users happy.
ETL will not die, and neither will the tools used extract, transform and load data from its source to a data warehouse. Informatica has redesigned its marketing to include new features that help customer organize big data into meaningful information. A close second is IBM Infosphere Datastage, followed by Microsoft SSIS and Oracle ODI. Alternatives to the big three include Talend, SAP BOD, Snaplogic and Pentaho. All offer a way to reign in your fears and manage your data.
ETL and the tools, especially those provided by the leading developers in the data warehousing space, are here to stay. The longevity of ETL based applications in this industry will depend on their ability to effectively and efficiently allow customers to employ the ETL process concurrently with query execution.