Node.js is a serverless, open source environment that runs JavaScript and builds scalable apps and web pages. This event-driven runtime handles HTTP requests and sleeps when it’s not called to work. Node.js is asynchronous, so data is transmitted through networks without any timing requirements. Data flows in an event loop, and you don’t have to deal with blocks or deadlocks.
For any data operations, you need an ETL (Extract, Transform and Load) to process your data from its source to your output or data warehouse. And when you build apps with Node.js, you need an ETL tool that works with it. With any other data work, you might want the flexibility and advantages an ETL built on Node.js offers. In this post, we'll cover some of the top open source Node.js ETL toolss and what they do best.
Node.js and most open source Node.js ETL tools might be a challenge for non-programmers and beginners, but the user-friendly features of paid ETL tools can make things easier. Even pro-level developers can save time and effort with the powerful features that subscription data management tools deliver. That’s even more true if you have critical big data jobs that demand enterprise-level support. We're going to start with open source ETL options, but at the end of the post, we'll list some paid ETLs that offer power, speed, ease of use and specialized tools for your business intelligence tasks.
Open source ETLs
1. Empujar
Empujar from TaskRabbit is a Node.js-based ETL tool that pushes data and does backup and other data operations. It takes advantage of Node.js’s asynchronous nature to run data operations in series or parallel. Empujar uses a book, chapter, page format. Top-level objects are called books. They contain chapters, which are run in order, with pages that can be run in parallel up to a limit that you set.
Out of the box, this open source ETL tool connects to MySQL, Amazon Redshift, Elasticsearch, FTP, and S3. Empujar’s GitHub documentation explains how to add these built-in offerings, as well as how to create your own connections.
2. Nextract
Nextract from GitHub contributor Chad Auld is an ETL tool built on Node.js streams. This open source ETL tool might be a good choice if you’re a beginning or mid-level programmer. It was designed to make developers’ work easier than some of the more popular Java-based ETL tools by giving you the flexibility of Node.js’s asynchronous runtime and the ease of using JavaScript. And you can use npm modules (JavaScript packages) to extend what Nextract can do.
This well-documented ETL tool supports several databases: Postgres, MSSQL, MySQL, MariaDB, SQLite3, and Oracle. Nextract can extract and output CSV and JSON data, and it also extracts data from database queries and outputs it to tables. With its built-in plugins, you can perform even more ETL operations like sorting, filtering, and basic math. One current limitation is that Nextract runs on the resources of a single machine, so it doesn’t work well with big data.
3. Extraload
Extraload from GitHub contributor Alayton Norgard is a lightweight ETL tool for Node.js that moves data from files into databases and between databases. It makes ETL work fast and easy with JavaScript coding and the time-saving advantage of Node.js’s non-blocking programming. Extraload also updates search platform indexes—Apache Solr built on Apache Lucerene.
This open source ETL package has API reference pages that explain how to do tasks like writing scripts and creating drivers. Extraload’s GitHub documentation also guides you on installing the drivers for CSV, MySQL, XML and XPath.
4. Datapumps
Datapumps from agmen.hu is a basic ETL tool for Node.js that uses "pumps" to read input and write output. A simple example is exporting data from MongoDB to Excel. For complex ETL work, you can create a group of pumps. Some features of this open source ETL tool are data transformation, encapsulation, error handling and debugging.
There’s some additional setup needed with Datapumps because it doesn’t do all the ETL work on its own. The Datapumps components only pass data in a controlled flow. It relies on 10 different mixins to import, export and transfer your data. Each time you add a new mixin, you’ll have to fork Datapumps and create a pull request in GitHub.
5. proc-that
proc-that is an extendable ETL tool from smartive. It’s written in TypeScript (scalable JavaScript), but it also supports JavaScript coding. proc-that gives you the asynchronous job streaming of Node.js added to TypeScript’s expanded ability to work on advanced tools and functions.
proc-that’s GitHub page shows you how to import its ETL tool and then add its built-in transformers and loaders. If you want to implement your own extractors, transformers, and loaders, the creators of proc-that invite you to contribute them to their list in the proc-that GitHub repo. As of this writing, proc-that has a build-status failing badge, so you might want to check that before you get started.
A not-so-documented open source Node.js ETL tool
If you’re curious to see what some other open source ETL tools can do, and you’re comfortable with figuring things out on your own, you might try this Node.js-based ETL—with not much documentation. Even more ETLs are in progress on GitHub, so check back later to see what’s new.
Paid ETLs for Node.js
1. Eventn
Eventn is a subscription ETL manager that builds, deploys and scales RESTful Node.js microservices—small independent processes that together form a complex app. Its secure platform creates database-backed services (RESTful), including IoT, that are built to work on the web. With user-friendly Eventn, you can deploy a public-facing microservice with just one-click. It embeds an ETL process into your microservice that collects structured and unstructured data, including API data from multiple sources, in real time.
This serverless enterprise-level BI tool runs anywhere. You build Eventn microservices with JavaScript code and npm modules in a sandboxed Node.js environment, so you get all the advantages of JavaScript and Node.js--including runtimes--with no blocking or waiting. You can edit and test your code with the developer tools in Event’s UI. In addition to ETL, Event gives you natural language processing (NLP), event processing and some extra utilities.
2. Panoply
Panoply is easy for non-programmers, but it also delivers the unbeatable speed and support that professional engineers need for big and small data ops. This automated ETL data platform pulls data from any source, simplifies it, and stores it all in one place. Panoply continuously streams data in real time to your output. It’s the only service that combines a fully integrated ETL with a cloud data warehouse that it builds and manages for you.
The user-friendly Panoply BI platform has one-click connectors to many data source apps that support Node.js script including MongoDB (recommended by Node.js), MySQL and PostgreSQL. Panoply uploads your raw data in minutes and automatically sorts and models it during the upload, so you can start querying all your data right away. You can try Panoply out for free or get a personalized demo.