The R Project is an open source programming environment that supports statistical computing and graphic design. For R work or any data operations, you need an ETL tool (extract, transform and load) to process your data from its source to your output database or data warehouse. In some cases, R on its own can act as an ETL tool. But it can also be used to build apps that perform specific ETL tasks. In this post, we've rounded up some of the top open source ETL tools for R and shown what they do best.
R and most open source R ETL processes might be a challenge for non-programmers and beginners, but the user-friendly features of paid ETL tools can make R work easier. Even pro-level developers can save time and effort with the powerful features that subscription BI tools deliver. That’s even more true if you have critical big data jobs that demand enterprise-level support. At the end of this post, we list some paid ETLs that offer power, speed, ease of use and specialized tools for your business intelligence tasks.
Since Spark excels at extracting data, running transformations, and loading the resulting data, you might consider using it as an ETL tool for R. Spark is an open source tool with all sorts of data processing and transformation functionality built in. It’s designed to run computations in parallel, so even large data jobs run fast—100 times faster than Hadoop, according to the Spark website. And it scales up for big data operations and can run algorithms in stream. Spark has tools for fast data streaming, machine learning and graph processing that output to storage or live dashboards. Spark’s own SparkR package installs a frontend that runs Spark inside of R and RStudio, making it easy to integrate into an R-based workflow.
Spark is supported by the community. If you need help, try its mailing lists, in-person groups and issue tracker.
Pentaho’s Data Integration (PDI), or Kettle (Kettle E.T.T.L. Environment), is an open source ETL tool that uses Pentaho’s own metadata-based integration method. The Kettle documentation includes an R script executor. The R executor adds the complex statistical analysis and graphic modeling functions of R to Kettle’s user-friendly, multidata ETL tool. However, the extra work to install and run the R script executor might be a challenge for beginning programmers.
With Kettle, you can move and transform data, create and run jobs, load-balance data, pull data from multiple sources, and more. But you can’t sequence your transformations. You’ll need Spoon, the GUI for designing jobs and transformations that work with Kettle’s tools: Pan does data transformation, and Kitchen runs your jobs. Also be aware that Spoon has some reported issues.
etl from GitHub contributor Ben Baumer is an R package that makes your ETL data ops easier. This open source ETL is designed specifically for work with medium data and SQL database output. All programming work for etl is done in R. Then you can output to either local or online remote storage and do analysis on your tabular data. A useful feature of etl is that you can use it to create your own ETLs.
To get the stable version of etl, install the package library that’s on Comprehensive Archive R Network (CRAN). etl is well documented and frequently downloaded.
uptasticsearch is a basic-use R ETL tool from Update that moves data from Elasticsearch to R tables. This open source ETL queries Elasticsearch and outputs the parsed results in data tables. uptasticsearch supports many of Elasticsearch’s built-in aggregations that create summarized data views. A few examples are "date_histogram", "percentiles" and "significant terms". Support for more aggregations is in the works.
The uptasticsearch ETL tool for Elasticsearch is less documented than elastic. You can get its stable version on CRAN. The dev version, which currently has a build error, is on GitHub.
ETLUtils from GitHub contributor jwijffels is a package of ETL utilities that uploads and transmits big data from databases to CRAN’s R-based ff packages. ff stores large data on disks with similar speed and capacity as if the data was stored on RAM. ETLUtils pulls data from SQL databases: MySQL, Oracle, PostgreSQL and Apache Hive.
You can get the stable version of ETLUtils on CRAN. And there’s also a link to this open source ETL tool in R Project’s package library. ETLUtils is only lightly documented on GitHub, but there’s more guidance in its reference manual on CRAN.
If you’re curious to see what some other open source ETL packages can do, and you’re comfortable with figuring things out on your own, you might try this ETL for R—with only light documentation. Even more ETLs are in progress on GitHub, so check back later to see what’s new.
RETL from GitHub contributor Václav Hausenblas is an open source R package with not much documentation. It’s licensed, and as of this writing, the creator is actively working on it, so it might be an up-and-coming R ETL tool worth checking out.
Panoply solves the "I need data in easy-to-use tables for analysis" problem without any code or engineering work and you can connect Panoply to R Studio with the standard ODBC connection. The user-friendly Panoply BI tool also has one-click connectors to SQL data source apps that support work in the R environment: MySQL and PostgreSQL.
Panoply is easy for non-programmers, but it also delivers the unbeatable speed and support that professional designers need for big and small data ops. This automated ETL data platform pulls data from any source, simplifies it, and stores it all in one place. Panoply continuously streams data in real time to your output. It’s the only service that combines a fully integrated ETL with a cloud data warehouse that it builds and manages for you. You can try Panoply out for free or get a personalized demo.
With Blendo’s cloud-based ETL tool, users can get their data into warehouses as quickly as possible by using its suite of proprietary data connectors. This paid ETL-as-a-service platform makes it easy to pull data from many data sources including CSV files and third-party sources like Amazon S3 buckets, Google Analytics, Mailchimp, Salesforce and others.
The Blendo ETL process is a fast and safe way to load data from e-commerce platforms into your data warehouse. After you set up the incoming end of the data pipeline, you can load it into several storage destinations, including PostgreSQL, which plays nice with R. Blendo’s blog has a great guide that shows how to connect it to Google BigQuery with R and then run queries in the R environment, and this blog post explains how to get your data from Amazon Redshift and PostgreSQL into R.
Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API replicates data from any source, and it handles bulk and incremental data updates. Stitch also has a replication engine that uses multiple strategies to deliver data to users. Its REST API supports JSON or transit to automatically detect and normalize nested document structures into relational schemas.
Stitch connects to Amazon Redshift, Google BigQuery and PostgreSQL architectures, and it integrates with a massive suite of data analysis tools. Stitch collects, transforms and loads Google analytics data into its own system, where it automatically produces business insights on your raw data. There’s a built-in integration for R—one of many analysis tools that work with Stitch.