The Java platform is a free software download that many of today’s websites and apps can’t run without. Java is practically a requirement for most internal and cloud applications. Developers use its object-oriented programming language to build desktop and mobile apps. You can write complex ETL (extract, transform and load) processes in Java that go beyond what’s available out of the box in most ETL tools.
If you use Java to script code for data transformations or other ETL functions, you also need an ETL tool that supports Java work. Java is one of the most popular and powerful scripting languages. And there’s an abundance of open source and paid ETLs to choose from that work with Java code. You won’t have any trouble finding one that meets your specific data project needs.
This blog gives you information on some of the best open source ETLs for Java. Some ETLs that used to be open source have become paid services. At the end of the blog, we also list some paid ETLs that might meet your needs for big BI data projects that need pro-level support.
Free and open source Java ETLs
1. Apache Spark
If Apache is your web server, you can use Spark as an ETL that works with Java scripting. The Spark quickstart shows you how to write a self-contained app in Java. You can get even more functionality with one of Spark’s many Java API packages.
This open source ETL has all sorts of data processing and transformation tools built in. It’s designed to run computations in parallel, so even large data jobs run fast—100 times faster than Hadoop, according to the Spark website. And it scales up for big data operations and can run algorithms in stream. Spark has tools for fast data streaming, machine learning and graph processing that output to storage or live dashboards.
Jaspersoft ETL is a free platform that works with Java scripting. With this open source ETL, you can embed dynamic reports and print-quality files into your Java apps and websites. It extracts report data from any data source and exports to 10 formats.
If you’re a developer, Jaspersoft ETL is an easy-to-use choice for data integration projects. You can download the community edition for free. The open source version is recommended for small work groups. For larger enterprises and professional-level support, you might opt for the enterprise edition. Documentation and tutorial links on the community page tend to take you to info on the paid version.
Scriptella is an open source ETL tool that was written in Java. It was created for programmers to simplify data transformation work. You don’t have to learn a new scripting language because Scriptella works with just about any scripting language you already know, including Java.
Scriptella supports cross-database ETL scripts, and it works with multiple data sources in a single ETL file. This ETL tool is a good choice to use with Java when you’ve got source data in different database formats that needs to be run in a combined transformation.
If you work with CRM systems, Apatar, a Java-based open source ETL, might be a good choice. It moves and synchronizes customer data between your own systems and third-party applications. Apatar can transform and integrate large, complex customer datasets. You can customize this free tool with the Java source code that’s included in the package.
The Apatar download saves time and resources by leveraging built-in app integration tools and reusing mapping schemas that you create. Even non-developers can work with Apatar’s user-friendly drag-and-drop UI. No programming, design or coding is required with this cost-saving but powerful data migration tool that makes CRM work easier.
Pentaho’s Data Integration (PDI), or Kettle (Kettle E.T.T.L. Environment), is an open source ETL tool that uses Pentaho’s own metadata-based integration method. Kettle documentation includes Java API examples. And its wiki tells you how to run Kettle transformations with Java.
With Kettle, you can move and transform data, create and run jobs, load balance data, pull data from multiple sources, and more. But you can’t sequence your transformations. You’ll need Spoon, the GUI for designing jobs and transformations that work with Kettle’s tools: Pan does data transformation, and Kitchen runs your jobs. However, Spoon has some reported issues.
Go past basic data analysis and storage with Talend Open Studio for Data Integration, a cloud-friendly ETL that can embed Java code libraries. Open Studio’s robust toolbox lets you work with code, manage files, and transform and integrate big data. It gives you graphical design and development tools and hundreds of data processing components and connectors.
With Talend’s Open Studio, you can import external code, create and expand your own, and view and test it in a runtime environment. Check your final products with Open Studio’s Data Quality & Profiling and Data Preparation features. You can get the open source download on the Talend website.
7. Spring Batch
Spring Batch is a full-service ETL with heavy documentation and training resources. This lightweight, easy-to-use tool delivers robust ETL for batch applications. With Spring Batch, you can build batch apps, process small or complex batch jobs, and scale up for high-volume data processing. It has reusable functions and advanced technical features like transaction management, chunk-based processing, web-based admin interface and more.
8. Easy Batch
The Easy Batch framework uses Java to make batch processing easier. This open source ETL reads, filters and maps your source data in sequence. It processes your job in a pipeline, writes your output in batches to your data warehouse, and gives you a job report. With Easy Batch’s APIs, you can process different source data types consistently. The Easy Batch ETL tool transforms your Java code into usable data for reporting, testing and analysis.
9. Apache Camel
Apache Camel is an open source Java framework that integrates different apps by using multiple protocols and technologies. It’s a small ETL library with only one API for to learn. To configure routing and mediation rules, Apache Camel provides Java object-based implementation of Enterprise Integration Patterns (EIPs) using an API or declarative Java domain-specific language. EIPs are design patterns that enable enterprise application integration and message-oriented middleware.
Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme that refers to an endpoint that provides information. Examples are what components are used, the context path and the options applied against the component. This ETL tool has more than 100 components, including FTP, JMX and HTTP. It runs as a standalone application in a web container like Apache Tomcat, a JEEE application server like WildFly, or combined with a Spring container.
You can read more about Apache Camel on its GitHub repo.
Amazon’s AWS Lamba runs serverless code and does basic ETL, but you might need something more. Bender is a Java-based framework designed to build ETL modules in Lamba. For example, this open source ETL appends GeoIP info to your log data, so you can create data-driven geological dashboards in Kibana. Out of the box, it reads, writes and transforms input that supports Java code: Amazon Kinesis Streams and Amazon S3.
Bender is a robust, strongly documented and supported ETL tool that enhances your data operations. It gives you multiple operations, handlers, deserializers and serializers, transporters and reporters that go beyond what’s available in Lamba.
If you’re OK with using an ETL tool that’s no longer being developed, you might try Smooks. This open source ETL uses Java to build apps for processing data, including Java code. Although Smooks isn’t supported, it has useful functions beyond basic ETL. For example, it can populate Java and virtual object models from source data. Smooks also transforms and transmits large-GB messages to your data warehouse or output destination. From there, Smooks can enrich messages with data from your data sources.
Metl, from JumpMind, is a lightweight ETL tool that’s built and run on the Java JDK. This web-based, open source ETL was designed to make programmers’ data work easier. Although it’s a hands-on ETL tool, you don’t need to write custom code with Metl. But you can write your own components if you need to. It runs in the cloud or internally. Metl generates a war file that you can run either on a server like Tomcat or as a standalone app.
See the JumpMind Metl page for support, documentation and training resources.
PocketETL is an extendable Java library for batched ETL using Java scripting. This is another hands-on open source ETL that was designed for programmers. To make your data pipeline faster, it processes large batches in parallel instead of in series. With PocketETL, you can customize streams and split and reuse EtlStream objects as components in other streams. PocketETL can speed up the time it takes to call external APIs, and it merges multiple EtlStreams into a single loader. If the output is more than 128 MiB, the S3FastLoader splits it into part files.
If you need an ETL tool that saves time, PocketETL might be a good choice. It comes with a host of publicly available adapters that include extractors, transformers and loaders. So, you can get your ETL work started right away with just a bit of coding. PocketETL’s user documentation isn’t extensive, but it has an issue support page.
Java ETL from a GitHub contributor
If you’re curious to see what some other open source ETLs can do, and you’re comfortable with figuring things out on your own, you might try this Java-based ETL—with only light documentation. Even more ETLs are in progress on GitHub, so check back later to see what’s new.
GETL’s documentation is limited, but if you’re OK with that, it might be a good ETL to try.
Best Paid Java ETLs
Panoply is a great all-in-one data platform that covers data warehouse automation and ETL for over 100 data sources . Examples are Amazon S3, Google Analytics and MySQL. It’s the only cloud service that combines an ETL with a data warehouse. Panoply builds and manages cloud data warehouses for you. It saves time and effort because you don’t have to wait for transformations to complete before you load your data. And you won’t need to set up a separate destination to store the data you pull with Panoply’s ETL.
The paid version of TIBCO JasperSoft works either as a standalone or embeds in your BI apps. With JasperSoft, you can create dynamic BI content for websites and apps as well as print-quality files. It partners with Talend Open Studio, and per the JasperSoft news page, it’s the world’s most popular open source BI tool. Its reporting engine, JasperReports, received the Duke’s Choice Award for best Java technology.
Although Cascading is a free tool, it only works as an ETL with Driven. Xplenty acquired Driven, and it’s now a subscription-based resource. So, we grouped this ETL tool in the paid section of the blog.
Cascading in an open source API created for Java developers and engineers. With Cascading, you can build complex apps and perform high-level data operations that require coding in Java. It uses pipelines and filters to stream and transform data from its source to the data warehouse. You can add custom functions by using Cascading’s Java APIs.
CloverDX is a paid ETL that was formerly available as open source CloverETL. It has its own data transformation platform and language (CTL), but CTL isn’t a replacement for standard-use coding languages. To run more complex data transformations from your existing Java code libraries, you can extend CloverDX with your own custom Java functions.
CloverDX has a code debugger, and this ETL tool lets you write hackable code and generate code transformations. With its open architecture, developers can collaborate and integrate transformations in a DevOps-style workspace.