Optimizing your SQL code is an important part of maintaining the performance and efficiency of your data pipeline. There are essentially no hard limits on the logical complexity of a single SQL query, plus, most modern database solutions allow a significant number of tables to be referenced in a single query. Google BigQuery, for example, allows you to reference upwards of 1000 tables in a single query. This relative lack of constraints, however, may lead your SQL developers to write inefficient queries that: (1) perform well when the source tables are small in size, but not as they grow with your organization, or (2) grow into "spaghetti code" that is difficult to maintain and modify as new developers struggle to understand exactly which tables are used for what purposes in the query.
In addition to maintainability, faster query performance benefits everyone in the organization, even in ways that are not immediately obvious.
When it comes to OLTP (online transaction processing) databases, for example, data primarily comes in the form of huge tables, each documenting a specific type of transaction or event, such as customer payments. These OLTP databases generally lie earlier in the data pipeline, representing the destination for data "E"xtracted as part of the ELT process. In this way, data loaded into OLTP databases is more "raw" and closer to front-end (customer-facing) components. When analytical SQL queries further down the data pipeline are written cleanly and efficiently, front-end errors that had the chance to propagate down the data pipeline can be identified more quickly. Additionally, the painstaking task of debugging backwards - "unbaking the cookie" and determining which ingredient was rotten - becomes easier as well.
OLAP (online analytical processing) databases, further down the pipeline, are built for data analysis and often serve as the source for business dashboards presentable to an organization-wide audience. Such databases are often called data warehouses, to emphasize the fact that business stakeholders can pull a wide variety of data that has been more optimized for analysis compared to its form in the OLTP data warehouse - it has been "T"ransformed. Such data warehouses typically present the biggest opportunity to optimize your SQL queries, either: (1) during the transformation step, (2) during the writing of analytical queries, or both: using knowledge of how the data was once transformed in order to optimize your analytics.
In addition to maintainability and ease of debugging, another obvious benefit of SQL optimization is reduced resource usage. In databases priced on a fixed cost basis, such as Amazon Redshift, where one rents a cluster with a specified storage and CPU capacity for a fixed monthly cost, inefficient SQL queries can easily overwhelm the cluster and prevent other queries from running quickly or at all. This can even block BI connections from being successfully made to the cluster as CPU utilization is pushed to 100%. In databases priced on a per-unit-of-computation basis, such as Google BigQuery, the benefit is even more obvious: more efficient queries mean lower costs for your organization. The effect is enhanced when BigQuery can cache-optimize queries which are similar to one previously run.
SELECT table1.x, table1.y, table1.z
FROM table1
WHERE table1.x IN
(SELECT table2.a
FROM table2
WHERE [table1.x < table2.a + table2.b])
Such logic is suited for cases where the subquery must be evaluated specifically for each entry of the outer table, with the effect of making it a computationally expensive operation.
While Panoply offers a no-code solution so you can get set up in minutes, you can still dive deeper in to the platform with your own SQL queries!
Looking to find out more? Contact us for a free demo.