ETL is the process of extracting, loading, and transforming data from various sources into a central data repository and is an essential step for turning data into insights. When data is extracted from a relational database, mapping the fields to their destination is a relatively easy task. With non-tabular data, transforming the data and loading it in a table optimized for analytical jobs is less straightforward.
For example, take a NoSQL database like MongoDB. It supports the storage of non-tabular data in the form of semi-structured documents. MongoDB is a convenient and popular way to store data with structure that changes rapidly. NoSQL’s adoption was mainly driven by software engineers who embedded these databases’ developer-friendly APIs within their applications.
As a first-mover, MongoDB commercialized the NoSQL database and successfully went public in 2017 with a mission to disrupt the database market. Despite their strengths, they recognize that all modern business intelligence practices are centered around tabular data. For this reason, they have developed best practices for ETL tool vendors that would like to interact with MongoDB databases and, in turn, receive certification.
Here are some conceptual expectations that MongoDB sets:
Because there is a good chance that these capabilities already exist within your tool stack, we’ve created a list of 17 tools that don’t necessarily support mapping and inferring schema info and projections. This should allow you to select a tool without capability overlap in your existing data stack.
But first, let’s go over some key considerations for choosing a data extraction tool with a MongoDB connector.
Tools that extract, transform, and load data from a MongoDB database come in multiple flavors.
Considering open source is all about understanding how much you’re willing to carry the burden of updating and maintenance. The business model of data integration vendors is centered around guaranteeing compatibility to a broad set of data sources (and destinations).
In other words: zero downtime of your data pipeline and compatibility with MongoDB are the priority of a proprietary vendor. With a self-hosted open source solution, downtime is all but inevitable, not to mention a lot of open source ETL tools lack a graphical user interface (infra).
Mapping JSON documents to a tabular format is a tedious task. There are two ways this process can be structured:
For over a decade, ETL used to be the default process. However, cloud computing and cheap storage are driving organizations to ELT processes.
Depending on its complexity and depth, extracting data from NoSQL databases within an ELT context might leave you with a vast amount of tables in the sink. Without proper management, your sink can quickly deteriorate into a data swamp.
Within an ETL context, you can rest assured that this won’t happen since you’re only storing the data in its final form. The other side of the coin is that it can be hard to get a sense of the breadth of the semi-structured data you’re about to extract.
When deciding on the proper way to extract data from MongoDB, you’ll have to take some considerations into account concerning your existing tool stack. What is the destination of the data? Does your organization already have a (cloud) data warehouse? Do you have the proper infrastructure in place to support an ETL or ELT workflow?
The answers to these questions should determine if you need a tool with the sole capability to extract data or something that can also store it and serve various use cases.
Who will be using the data? If extracting semi-structured data serves the use case of a single user, a locally installed desktop tool could be the thing your organization needs.
If the responsibility of the data pipeline resides solely with the IT department or a dedicated engineering team, extracting the data from a command-line interface or via an API might suffice.
However, if business users own the data pipeline, don’t assume they’re comfortable with command lines. Very likely, a graphical user interface is their deal-breaking functional requirement.
Panoply is an end-to-end cloud data platform. It’s a SaaS tool that simplifies data pipelines to make the process accessible to business users with limited technical knowledge. Data is sunk into Panoply’s managed data warehouse. Analytics engineers can manipulate the extracted data using SQL-based stored procedures and load it into the final tables to serve business users and data scientists.
In that sense, Panoply is the only solution in this list supporting MongoDB-to-DWH data pipelines within an ELT workflow from a single tool.
Transporter provides a selection of “adaptors” to databases (both structured and unstructured). On top of that, it supports transformations written in JavaScript. Although this open source tool has been around for years, there haven’t been new commits to this project in three years. Don’t expect much support.
Extraction of data is truly at the core of Airbyte, a fairly new solution that comes in multiple flavors. There are no transformation features, but using dbt, k8s, or Airflow is actively promoted. It comes packed in a Docker container that can run locally or in a self-hosted cloud environment. As a commercial open source solution, there’s also a fully managed standard and enterprise solution.
Stitch is a tool for extracting data from a wide range of data sources and sending it to your data lake, warehouse, or lakehouse. Stitch is closely related to Singer, which is the open source framework that extends its capabilities. But for this use case there is no need to write your own Singer “tap” since MongoDB is already fully supported. With the recent acquisition of Stitch by Talend, it’s safe to assume that combining those two products will result in a full-blown ELT or ETL solution.
With over 150 connectors that require no configuration, Fivetran has become a popular solution for business users in both startups and large enterprises. It has a large community and plays well together with other popular components in the ELT space. They say that working with Fivetran, Snowflake, and dbt is an analytics engineer’s ultimate trio.
Although Infoworks has a strong enterprise focus, it still acts as pure data integration middleware for migrating data warehouses and Hadoop clusters and extracting data from databases such as MongoDB. Infoworks comes packed with various data governance and lineage features and has a lot of synergies with Databricks.
Another tool with a strong enterprise focus is Safe’s Feature Manipulation Engine (FME). Its MongoDB reader/writer is supported on its local, server, and cloud distributions. Without a doubt, FME is the odd-one-out in this list. Safe started as a software company that helped forestry companies share spatial data with provincial governments. Handling spatial data is still its USP.
Just like Fivetran, Hevo automatically detects the schema from the data source and destination. The most significant difference is that Hevo supports multiple ways of transforming data within an ETL context, while Fivetran does not. There are drag-and-drop transformation blocks, but technical users write Python-based transformation scripts. Good to know: Hevo also supports MongoDB’s Change Streams functionality.
If you need more of a sandbox, Keboola might be your thing. At the core is a data integration tool (with excellent support for MongoDB). Still, if you look at all its features, Keboola is a complete suite of tools that serve various use cases, from data science to data governance.
One might wonder why KNIME is included in this list. It’s mostly known for providing enterprise data science and advanced analytics solutions, but it supports reading from and writing to many (relational and semi-structured) databases, including NoSQL. If you’re looking for an enterprise data science tool that also provides data-wrangling capabilities, KNIME might be right up your alley.
Matillion Data Loader and ETL is a data integration tool with a pay-per-use pricing plan. Best of all, if you don’t need to do any transformations or if you do them after loading the data, Data Loader is entirely free. Nevertheless, Matillion provides a thorough mapping interface to flatten semi-structured data and map it to its tabular destination table.
The people at GitLab embraced the (previously mentioned) Singer standard and built a CLI-first ELT tool on top of it. The result: Meltano. It comes with a MongoDB extractor that follows the Singer standard and is maintained by Stitch.
When AI meets data engineering: Precog, where all data structures are treated identically. Its Multidimensional Relational Algebra (MRA) engine gives Precog users the same user experience whether they’re working with semi-structured or structured data. Precog can even replace MongoDB’s native BI connector.
Rivery sits somewhere between Panoply and Fivetran. Like Fivetran, it supports transferring data to and from various data warehouses and SaaS tools. Unlike Panoply, Rivery does not offer a native data warehouse. But it comes with the same SQL-driven data transformation logic that can be used to transform and map the loaded NoSQL data into the final tabular format. Finally, Rivery has excellent data management and lineage capabilities.
The '00s called: they want their data preparation tool back. Toad Data Point might be an ugly tool that got stuck in Windows XP, but it comes packed with data exploration and mapping capabilities that will turn your MongoDB extraction jobs into a walk in the park. Toad Data Point started as a straightforward database management tool with support for relational and NoSQL databases but is now part of Dell’s impressive Quest product ecosystem.
Tray.io has a strong focus on business users eager to manage their own data pipelines. That’s why it has one of the most intuitive interfaces to extract and transform data without writing a single line of code. But that’s also a key weakness: Tray might not offer the most robust capabilities for exploring and mapping data from complex MongoDB databases to a tabular format.
Trifacta’s Data Engineering Cloud is a one-stop-shop solution to extract data from various sources and transform it with a robust enterprise-level GUI. It integrates natively with all three major cloud providers (AWS, Azure, GCP) and supports Snowflake and Databricks. Trifacta comes with the necessary tools to flatten, filter, and reference nested objects extracted from a MongoDB database.
ETL can be complex and frustrating, especially when working with non-tabular data. Using a database management system like NoSQL is a no-brainer, and utilizing MongoDB and the plethora of data extraction tools available is the next step.
You now know about 17 of them that might fit in your existing data tool stack, but if you couldn’t find the tool you were looking for, it doesn’t end here. There are still plenty of other tools out there that might fit your specific use case if you dig deep into the MongoDB community.
However, if you’re ready for a SaaS that makes your data pipeline accessible to anyone in your business who needs it, check out Panoply.