PostgreSQL, aka Postgres, is recognized as the “the world's most advanced open source relational database." Postgres is used all over the world to power complex applications such as Skype, IMDB, and Etsy. But you may be less familiar with the relational databases use as a data warehouse. In fact Amazon Redshift’s design was largely based on Postgres.
Organizations often find themselves using Postgres as an ETL data source and data sink. Sometimes the source data is needed to be pulled out and stream to BI tools for data analytics and other business data work. Sometimes the data needs to be loaded from other sources and formats into their Postgres data warehouse. As in any data operation, an ETL (extract, transform, and load) is how data gets moved from one place to another. There are many ETLs that can work with PostgreSQL on-premises or on the cloud. Some ETLs pull data into PostgreSQL, some move data out, and some can do both those functions.
This blog lists the features of the best ETL tools designed to make your specific data work easier when PostgreSQL is your data source or your data warehouse.
Extract data from Postgres
pglogical is an open-source PostgreSQL extension from 2ndQuadrant that copies and inserts the Postgres datasets you choose into other systems. This data pipeline connects Postgres to Postgres-XL subscribers and other non-Postgres subscribers including Kafka and RabbitMQ. pglogical’s automation seamlessly replicates any changes to the structure of your source database. You can select root tables for seamless replication of their associated partitioned tables that are assigned to specific data sites. Or you can transparently define your own data partition settings. pglogical’s data prep features include row and column filtering during replication.
GitHub-recommended pglogical is a solid ETL choice if you need to connect Postgres databases to one of pglogical’s supported systems.
2. Panoply (cloud ETL + data warehouse)
Panoply makes it fast and easy for both developers and non-programmers to automatically pull data out of PostgreSQL. The Panoply all-in-one data pipeline is the only cloud ETL provider and data warehouse combination. Panoply doesn’t limit you to a few choices of BI tools to process your data. It connects to tons of popular tools.
Automated Postgres data extraction, prep, and load to your tool of choice happens with just a few clicks and a login. Under the hood, Panoply uses an ELT approach instead of traditional ETL. Data ingestion is faster and more dynamic because you don’t have to wait for transformation to finish before you load your data. And Panoply builds managed cloud data warehouses for every user. So you won’t need to set up a separate destination to store all the data you pull from Postgres with Panoply’s ELT process.
Panoply is an enterprise-level ETL that continuously uploads, processes, and streams your latest PostgreSQL data to your connected services. So your analytics and output always reflect the most relevant metrics.
The Stitch PostgreSQL integration is an ETL that copies data from a PostgreSQL database to another PostgreSQL data warehouse. You can configure Stitch to synch the data extraction at defined frequencies, or you can do it manually. You can also select the tables and fields of the data you want to replicate. Stitch stores the data it pulls from PostgreSQL databases in PostgreSQL warehouses that your team can access in just minutes. With the Stitch self-service ETL, developers can set up a data pipeline solution in about a day. After that, there’s no need to write or revise any code to maintain the ETL.
Stitch handles bulk and incremental data updates. Its replication engine uses multiple strategies to deliver data to users. Stitch connects to PostgreSQL architecture, and it also integrates with a massive suite of data analysis tools. Stitch collects, transforms, and loads Postgres data into its own system, where it automatically produces business insights on your raw data. PostgreSQL is just one of many data warehouses that work with Stitch.
The Fivetran data pipeline extracts PostgreSQL source data and connects it to cloud and on-premises warehouses that include Panoply, Amazon Redshift, MySQL, Mailchimp, and many more. If you want to try it out, but you don’t have a warehouse, Fivetran provides one during the trial period and then helps you set up something permanent if you become a subscriber. This user-friendly ETL tool uploads and preps all your current and historical data, but you can select data to exclude from replication. To minimize load on your systems, Fivetran reads logs of your changes instead of querying your live data. Then it streams the updates in near real time to your output and data warehouse.
Both developers and non-programmers can set up a Fivetran pipeline with just 10 clicks. There’s no coding or development work needed after that because Fivetran continuously tests and maintains your automated ETL. And its built-in schemas designed for analytics save your tech team even more time.
Blendo is a self-service ETL tool that quickly extracts PostgreSQL cloud data and transmits it to your data warehouse. Blendo’s PostgreSQL data connector automatically syncs your source data as soon as you sign in. After you set up the incoming end of the data pipeline, you can load it into these cloud storage destinations: Panoply, Redshift, Google BigQuery, MS SQL Server, PostgreSQL, and Snowflake. This data pipeline also connects to seven popular BI tools including Chartio, looker, and Power BI. So it’s easy to get business insights on the data Blendo uploads from Postgres.
The Blendo ETL is a fast and safe way to pull and analyze your Postgres data and securely store it in your cloud data warehouse. But it might not be the right choice if you need continuous updates to your output. Blendo’s data extraction isn’t automatic. You need to set a schedule for data load, and your data analytics tools, output, and warehouse get updated on that schedule, not in real time.
Load data into Postgres
1. Target Postgres (Python)
Target Postgres from datamill-co is an open source app that works with the Singer ETL scripting tool. With Singer’s data extraction scripts, called taps, you can create custom ETLs that upload data from non-standard sources, formats, and APIs. You write target scripts to take your extracted data from taps and load it to data warehouses. Target Postgres was designed to be a target for Singer ETLs that move data into PostgreSQL data warehouses. Target Postgres saves development time because your target script is prewritten. It creates and processes SQL tables from your extracted data and replicates the data in full stream to your Postgres warehouse.
To work with Target Postgres, and therefore Singer, you currently need to install Python 3.5.2 and pip, Python’s package manager. Singer recommends that you work with each tap and target in separate virtual environments. Learn how to work with Singer taps and targets in this Singer tutorial.
The lightweight bellboy ETL might be a good choice if you need to store data in PostgreSQL, and you want a very hands-on role in the ETL process.
The Stitch ETL tool automatically scales to meet the needs of enterprise-level data volumes. And Stitch offers its subscribers SLAs that guarantee it will work as and when expected.
As a data loader, the Fivetran ETL pipeline can extract data from apps, databases, events, and files and store it to PostgreSQL data warehouses including PostgreSQL on Amazon RDL on the cloud. Fivetran scales to fit enterprise-level data flow to your cloud warehouse. It optimizes loading by running most of the ETL process on its own servers. And Fivetran keeps your data warehouse clean by vacuuming and deleting overwritten and upserted data.
If you write your own custom functions to extract data, Fivetran can load your extracted data and then call and run your functions as frequently as every five minutes. You need to host the functions on a serverless cloud platform like Google Cloud Functions for this feature to work.
The Blendo cloud-based ETL tool quickly pulls many types of source data into PostgreSQL data warehouses by using its large suite of data connectors. Blendo’s ETL-as-a-service product makes it easy to pull data in from dozens of sources including S3 buckets, CSVs, and other third-party data cloud services like Google Analytics, Mailchimp, and Salesforce. To load data into PostgreSQL, all you need to do is connect it to Blendo as a destination and then connect to your data sources. That setup takes about three minutes. Then the Blendo data pipeline transmits your source data to your PostgreSQL data warehouse.
The Blendo ETL is a fast and safe way to load data from e-commerce and business analytics platforms into PostgreSQL. Some advantages of having Blendo store your data in Postgres are that you can combine and process all your data in one place, and you can easily use SQL to process your raw data.
Ready to start extracting your data from Postgres for data analysis and visualization? Try Panoply free now!