What is ETL?
ETL stands for Extract, Transform, Load. In computing, it is a type of data integration used to blend data from multiple data sources. It is a process that extracts the data from different RDBMS source systems, then transforms the data (like applying calculations, concatenations, splitting, etc.) and finally loads the data into the Data Warehouse system.
What is a Data Warehouse?
A data warehouse is a federated repository of information from many different sources within an organization from operational systems and external data sources. A core component of business intelligence, data warehouse is designed to support business decisions by allowing data consolidation, analysis and reporting at different aggregate levels.
A great example of data warehouse is Panoply. It utilizes machine learning to model and automate the standard data management activities.
In an organization with a lot of data to process, ETL tools and data warehouses are readily available out there to ease the process of management of your data. In this article, we will take a look at two of the top ETL tools and compare them to help you decide which one suits your needs.
Fivetran is going global with offices in USA, India, and Ireland. They build their product with analyst needs in mind, “Shaped by the needs of analysts, our fully managed pipelines enable data-backed decisions company-wide by delivering ready-to-query data into your cloud warehouse”.
That’s So Cool!
- Quick replication of data with minimal lag time. It definitely saves a lot of time compared to manual replication with no assurance on the amount of time you need to allot for debugging.
- A lot of users are praising their great customer service. Customer representatives are technically knowledgeable enough to solve their users’ problems regarding the usage of the tool.
- Works well when you have a lot of data in different places and want to get it all in one spot. Users can easily manage their data sources and can concentrate on more important things.
- It is fast to set-up and connect any app or data source in just a few steps.
- Incremental data syncs allow users to experience seamless data synchronization.
- The ease of the interface. It can easily be used by any user without much of a technical background. A perfect feature for any analyst who works on data warehouses so they can do more on analysis and less work on integrations and coding.
- The time it saves by having a common integration ready to go is extremely helpful. Instead of requiring an engineer to build the connection, you can leverage Fivetran for your data streaming needs.
Not So Cool!
- There's little to no progress estimate on the front end UI to see how long the refresh might take and you need to enable backend logging (i.e. Amazon) to see where it is at in the sync process and derive rough estimates of completion time.
- Needs improvement of visibility into what is happening after integration has been done such as logs of files uploaded, exact upload to column name references and allowing a little more flexibility on some of the newer integrations.
- It would be great to have more detailed logging, when there is an issue it is sometimes hard to diagnose the cause without contacting support.
- It would be great to be able to configure the query batch size (which by default is 10M rows) to allow flexibility, or to automatically adjust based on the table size (GiB and row count) for more control.
- Fivetran only offers one direction data sync which is not suitable for an organization who wants a more secure and diverse backup methodology.
Stitch was recently acquired in 2018 by Talend, a software integration vendor primarily focuses on big data. They market their product for developers, “Stitch is a cloud-first, developer-focused platform for rapidly moving data”.
That’s So Cool!
- Good at replicating relational databases. It manages the conversion of data types and schemas, and it is capable of incremental replication.
- The UI makes it easy to configure which tables should be replicated, and how they should be replicated. It’s very intuitive for general users.
- Their pricing is transparent
- Good debugging and monitoring tools that allows you to detect issues on the spot and provide needed solutions before any issues got worse.
- Data connectors not available in the options? You can build your own using the Singer integration tool, an open source standard for ETL that allows data engineers to replicate data from any source to any destination
- Stitch is a great and economical solution for small organizations who want their data and business intelligence organization to focus on building competitive advantages instead of building data pipelines.
- Stitch data is a more detailed and complex version of ETL that offers better insight into the data analytics. It stores all the user data in user’s warehouse within minutes without API maintenance, cron jobs, scripting or JSON wrangling.
- Free tier enables to sync limited number of records for free. It is an almost zero maintenance product where you set it up correctly and then it does the job.
Not So Cool!
- Cannot replicate the same source to multiple sinks, which is inconvenient if you want to replicate some of a datastore's tables to Redshift and others to Redshift Spectrum, for instance.
- You can only use one data warehouse. This could be a challenge for an enterprise with more than one data warehouse for their business and their clients.
- Stitch charges based on the volume of data imported. At some point it is probably cheaper to hire someone to build the integrations in-house. It is not suited for a big company for which ETL is a really big thing and does not want to outsource it.
- The scheduling is in UTC which might bring confusion to some users in setting schedules. Needs more timezone localization options.
- For someone who relies heavily in MongoDB, users can not customize the fields being migrated. It’s also not so good at replicating document stores like MongoDB to relational databases.
Here’s a tabular comparison of Fivetran and Stitch. Take note, this is not comprehensive.
On top of these ETL tools, you can integrate Panoply’s powerful capabilities to supercharge your organization’s performance and wow your customers. Let these tools handle your data management so you can focus more on what you should be doing best for your company’s exponential growth.