Elasticsearch is the search and analytics engine that powers the Elastic Stack, a suite of products from the Elastic company. This popular search and analytics engine searches websites, apps enterprises, maps, logs, IoT data sources, and more, and creates a distributed document store for your collected data.
Elasticsearch processes a massive number of events in seconds and performs complex analytics functions like mistake handling and machine learning that reveals ranks and patterns. With extensions like Elasticsearch SQL, you can get fast answers to your Elasticsearch queries using SQL.
Enterprises that use Elasticsearch as a data source often need to extract that data for analysis in other business analytics platforms. And if they use Elasticsearch for backend storage, they need a way to put data pulled from other sources into their Elasticsearch data warehouse.
All data operations use ETL (extract, transform, and load) processes to move data into and out of storage. There is a wide range of available ETL tools that are capable of working with Elastic Search, and we’ve put together a list of the best of them below, organized into “pull” and “put” categories. If you’re looking for something that can help you pull data that you’ve already stored out of Elasticsearch, check out the first section. If you’re looking for a tool to help you load data into Elasticsearch, check out the second section.
Tools to pull data from Elasticsearch
Logstash is an open-source data pipeline that can pull and blend data from diverse sources in real time. Logstash is also a product of the Elastic company, and it was built to be compatible with Elasticsearch. Designed to collect data from logs, Logstash easily extracts all types of data logs including web and app logs. It captures log formats, networks, and firewalls from both cloud and on-premises sources. This ETL tool has an input plugin that pulls Elasticsearch query results into Logstash. With Logstash’s out-of-the-box Elasticsearch plugin filter, you can query Elasticsearch data for log events.
Logstash is designed to work with Elasticsearch, but you need to install, verify, run, and maintain it in a development environment. If you’re not a programmer, or if you’re looking for an easy-to-use ETL tool, Logstash probably isn’t your best choice.
2. Panoply (cloud ETL + data warehouse)
Panoply makes it fast and easy for both developers and non-programmers to automatically pull data out of Elasticsearch and query it with SQL. With a few clicks, you can connect Elasticsearch to Panoply’s user-friendly UI.
With Panoply, your Elasticsearch data is automatically extracted and loaded, ready for analysis in popular BI and visualization tools. Under the hood, Panoply uses an ELT (Extract, Load, Transform) approach instead of traditional ETL, which means that data ingestion is faster and more dynamic because you don’t have to wait for transformation to finish before you load your data. And Panoply's managed cloud data warehouse means you won’t need to set up a separate destination to store all the data you pull from Elasticsearch with Panoply’s ELT process.
3. Apache NiFi
The Apache NiFi platform has its own expression language for writing Elasticsearch queries in many of Elasticsearch’s connection properties. NiFi lets users build high-performing data pipelines for database ingestion from SQL Server, MySQL, Postgres, and other popular cloud data stores. NiFi’s QueryElasticsearchHttp query reads and routes Elasticsearch FlowFiles. You can also write SQL queries in NiFi that process extracted Elasticsearch data in stream. These queries run locally, not on the cloud.
Using NiFi to write Elasticsearch queries or to create ETL pipelines requires a high level of technical knowledge, control, and work in the development environment. This ETL tool probably isn’t the right choice for beginners or non-programmers. But NiFi might meet the specific needs of complex Elasticsearch data analytics and design projects.
Transporter from compose is a data pipeline that performs transformations and moves data from many sources to data warehouses. This open-source ETL has an adaptor that supports some versions of Elasticsearch. After data is extracted, the Transporter adaptor converts it into message format and then sends the messages to sinks where they’re converted to files that write to Elasticsearch. The adaptor also tracks changes to source data while transmission is in progress and refreshes the data sinks with the latest updates. With Transporter’s native functions like goja, you can add specific processes and transformations to in-stream data.
Transporter is another pro-designer level ETL tool that runs in a development environment and requires knowledge of Git commands. It’s a great tool for those comfortable with a more technical, code-heavy approach.
The Dremio self-service platform pulls data from multiple data stores including Elasticsearch. This ETL tool connects extracted data to any BI tool, as well as Python, R, and SQL and other data analytics platforms, and provides instant results. Dremio is designed for the cloud and runs on Apache Arrow. Its Data Reflections feature uses relational algebra to make data movement and queries run 1000 times faster. Dremio’s website offers a tutorial library, resources, and support to get you started and train you on how to use the platform. Its many Elasticsearch tutorials explain how to use SQL to query Elasticsearch indexes, connect Elasticsearch to Dremio, and more.
Open-source Dremio scales to fit the needs of your enterprise, and its user-friendly UI makes it easy even for non-programmers to get answers and create shareable visualizations from their Elasticsearch data.
elastic from ropensci is a multipurpose R interface for Elasticsearch. Part of the ROpenSci project, this open source ETL tool lets you use R’s complex data analytics capabilities on data sets pulled from Elasticsearch. With elastic, you can search, get multiple documents and parse your Elasticsearch data using R. The library has connectors for a variety of different sample data sets, including Shakespeare plays, the Public Library of Science and the Global Biodiversity Information Facility, all designed to help you familiarize yourself with the workflow. If you want to play around, you can get even more predefined datasets on ropensci’s elastic dataset page.
You can download the stable version of elastic on CRAN. If you’re curious, you can try the latest dev package on GitHub, which is currently passing build. elastic works best with older versions of Elasticsearch.
uptasticsearch is a basic-use R ETL tool from Update that moves data from Elasticsearch to R tables. This open source ETL tool queries Elasticsearch and outputs the parsed results in data tables. uptasticsearch supports many of Elasticsearch’s built-in aggregations that create summarized data views. A few examples are "date_histogram", "percentiles" and "significant terms". Support for more aggregations is in the works.
The uptasticsearch ETL tool for Elasticsearch is less documented than elastic. Its stable version is on CRAN, and the dev version is on GitHub.
Putting data into Elasticsearch
Elasticsearch’s related product Logstash loads data into dozens of cloud and on-premise data warehouses including Elasticsearch. With its Elasticsearch plugin, Logstash can easily store logs in Elasticsearch. This ETL tool is a real-time data pipeline that can extract data, logs, and events from many other sources in addition to Elasticsearch, transform them, and then store them all in an Elasticsearch data warehouse. Logstash’s conventional use is for log collection, but its capabilities extend to complex data processing, enrichment, analysis, management, and much more.
Logstash’s advanced features make it a good choice for designers who want to move data into Elasticsearch for analytics. But Logstash needs to be configured and run in a development environment, so it’s not the right BI tool for non-programmers or people who want to save time with an ETL that has a user-friendly UI.
2. Apache NiFi
For data distribution to Elasticsearch, NiFi’s PutElasticsearch expression writes and updates the contents of the FlowFiles pulled with NiFi’s QueryElasticsearchHttp from multiple sources. After that, PutElasticsearch loads the data into an Elasticsearch warehouse. Secure connections can be made with the NiFi’s plugin for Elasticsearch clusters. Only FlowFiles that are written to Elasticsearch are routed to a “success” relationship, which is required for input to the data warehouse. NiFi’s Data Provenance feature tracks flow through the data pipeline from upload to backend storage, which allows the user to record each step of the process, including where data came from, how it was transformed, and where it went. This means you can pinpoint where mistakes happened if you end up with unexpected results in your data warehouse. After making fixes, you can replay the data flow with one click.
As with using NiFi for extraction, this ETL tool’s data storage process requires complex configuration that’s probably only achievable by those with solid programming skills.
Transporter from compose is an open-source ETL tool with an adaptor that sends data to some versions of Elasticsearch. This data pipeline’s adaptor can transform and move many types and sources of data to Elasticsearch data warehouses. Transporter pulls data from PostgreSQL and MongoDB cloud databases, and its beta version has a feature that resumes stopped operations. This feature needs to be enabled in Transporter to run.
In order to run Transporter, you’ll need to configure it in a development environment, and you’ll need to manually map a PUT request to Elasticsearch. If that sounds intimidating, you might want to consider another option—this designer-level ETL tool won’t work for people who need fast and easy data loading.