Panoply.io Labs

Panoply PGproxy: Smart Routing of Your Data Warehouse Queries

Written by Roi Avinoam|January 02, 2017

A smart, proactive PostgreSQL connection pooler, we are proud to announce our PGproxy offers the advantages of query routing and rewriting, making it optimal for data engineers who need to query their databases, and removing the concern of inconsistencies that can cause a query break. Instead of connecting to the database server, users can connect and send requests to PGproxy.

Similar to the open source tools - PGbouncer or PGpool - PGproxy offers a way to split your database load across multiple servers and clusters, while avoiding the headaches that come with maintaining infrastructure solely dedicated to query routing and load balancing. PGproxy eliminates the need to constantly alter the configuration and restart the routing service every time you want to make a change, as you do when using a self-hosted tool such as PGbouncer.

PGproxy’s three main advantages are its ability to rewrite queries, replicate data across multiple clusters, and materialize views into tables. Below, we examine those benefits in greater detail.

Data Warehouse too complex to manage? Learn how to leverage AWS

Query Routing and Rewrites

PGproxy is useful in large and complicated Amazon Redshift data warehouse deployments, where data loads need to be split among multiple clusters - creating challenges when it comes to, for example, adjusting and rewriting queries accordingly. For instance, assume you have 200 views, and then decide the next day you want to materialize 50 of them. To complete that action, you’re required to create a new table with a new name for each. Subsequently, you have to change the existing queries to reflect the name change. Every change to any view necessitates updating all other views and their queries. Attempting to resolve this issue manually, or by running an ad-hoc script, will lead to inconsistencies and likely break queries.

PGproxy is useful in large and complicated Amazon Redshift data warehouse deployments, where data loads need to be split among multiple clusters

Instead, PGproxy implements the change on a lower level. All users continue to access the view, but behind the scenes PGproxy rewrites the query and renames it to maintain backward compatibility. So queries continue running as a view without having to deal with the question of whether or not it has been materialized to a table or if the name change has been made. PGproxy handles this automatically, without the need for a specialist dedicated to this task.

Replicating Data Across Multiple Clusters

Data replication is a known challenge in heavy data cloud environments, and there is no one-size-fits-all solution. With PGproxy, data architects can create small Redshift clusters to support multiple teams in the organization, in order to avoid database bottlenecks. This is a known issue that can be solved manually; however, PGproxy perfectly complements Amazon Redshift and facilitates the creation and synchronization of these multiple clusters. Once created, every update database operation requires replication over these multiple clusters.

For instance, an enterprise may have one team for which its cluster runs on a Solid State Disk (SSD), on which they save a portion of data for rapid queries; while another cluster runs on a Hard Disk Drive (HDD), with data used by analysts for examination and reporting, for example. In a typical infrastructure-heavy environment, query routing between these two types of disks create real cost savings-without compromising on their ability to have real time analytics.

Above and Beyond, with Materialization of Views

Data has to be re-generated every time a query is run against a view. As a result, if there are multiple queries hitting the views, overall performance will suffer. To overcome this, using PGproxy, Panoply.io analyzes usage and automatically materializes specific views when new data is entered, saving the results into a new table with an optimized structure.

Panoply.io analyzes query performance and frequencies, so that only actively-used transformations are materialized. This is instead of pre-processing everything, as in many cases not all transformations are accessed and used as frequently. In addition, the generated tables are pre-optimized in terms of indexing, partitioning, and compression to provide your queries with the best possible performance.

The end result is the same: you’re still able to build any transformation logic imaginable, and you’re still getting pre-processed tables that are optimized to maintain your queries’ performance. But you can achieve this without spending a lot of time up front for data processing, and without being locked into a manual, rigid process that takes days to iterate.

Data Warehouse too complex to manage? Learn how to leverage AWS

Final Note

The cloud-dynamic environment and the pace of innovation of players such as AWS requires moving from manual operations to intelligent automated processes that you get from a turnkey solution provider, such as Panoply.io. PGproxy offers your enterprise not only a hassle-free way to bypass a common limitation in cloud-based data warehouse environments, but also reduces overhead costs, while improving query performance. Panoply.io utilizes machine learning and natural language processing (NLP) to learn system behavior and automate standard data management activities – saving thousands of code lines, and countless hours of database operations manual debugging and research. Contact us to learn more.

From raw data to analysis in under 10 minutes.

Sign up now for a demo or a free trail of the Panoply.io platform.

Learn more about platform features