Optimizing your SQL code is an important part of maintaining the performance and efficiency of your data pipeline. There are essentially no hard limits on the logical complexity of a single SQL query, plus, most modern database solutions allow a significant number of tables to be referenced in a single query. Google BigQuery, for example, allows you to reference upwards of 1000 tables in a single query. This relative lack of constraints, however, may lead your SQL developers to write inefficient queries that: (1) perform well when the source tables are small in size, but not as they grow with your organization, or (2) grow into "spaghetti code" that is difficult to maintain and modify as new developers struggle to understand exactly which tables are used for what purposes in the query.
In addition to maintainability, faster query performance benefits everyone in the organization, even in ways that are not immediately obvious.
OLTP databases
When it comes to OLTP (online transaction processing) databases, for example, data primarily comes in the form of huge tables, each documenting a specific type of transaction or event, such as customer payments. These OLTP databases generally lie earlier in the data pipeline, representing the destination for data "E"xtracted as part of the ELT process. In this way, data loaded into OLTP databases is more "raw" and closer to front-end (customer-facing) components. When analytical SQL queries further down the data pipeline are written cleanly and efficiently, front-end errors that had the chance to propagate down the data pipeline can be identified more quickly. Additionally, the painstaking task of debugging backwards - "unbaking the cookie" and determining which ingredient was rotten - becomes easier as well.
OLAP databases (data warehouses)
OLAP (online analytical processing) databases, further down the pipeline, are built for data analysis and often serve as the source for business dashboards presentable to an organization-wide audience. Such databases are often called data warehouses, to emphasize the fact that business stakeholders can pull a wide variety of data that has been more optimized for analysis compared to its form in the OLTP data warehouse - it has been "T"ransformed. Such data warehouses typically present the biggest opportunity to optimize your SQL queries, either: (1) during the transformation step, (2) during the writing of analytical queries, or both: using knowledge of how the data was once transformed in order to optimize your analytics.
Reducing resource usage
In addition to maintainability and ease of debugging, another obvious benefit of SQL optimization is reduced resource usage. In databases priced on a fixed cost basis, such as Amazon Redshift, where one rents a cluster with a specified storage and CPU capacity for a fixed monthly cost, inefficient SQL queries can easily overwhelm the cluster and prevent other queries from running quickly or at all. This can even block BI connections from being successfully made to the cluster as CPU utilization is pushed to 100%. In databases priced on a per-unit-of-computation basis, such as Google BigQuery, the benefit is even more obvious: more efficient queries mean lower costs for your organization. The effect is enhanced when BigQuery can cache-optimize queries which are similar to one previously run.
High-level optimization tasks to consider:
- Before implementing any of these tactics, define your stakeholders, business requirements, and expected outcomes. What is your optimal tradeoff between person-hours spent optimizing code, on one hand, and the agility associated with rapid deployment of makeshift SQL on the other hand?
- Define what optimization means for your team and company - what are the pain points you’re trying to solve?
- Define the optimal techniques to improve SQL query performance based on system resources and performance metrics such as time to refresh a dashboard, or number of rows scanned by a query on average.
A table-level optimization task to consider:
- Indexing and partitioning: implemented in different ways on different databases, these techniques allow you to customize the way that data is recorded in and allocated to different physical servers in a way that optimizes retrieval afterwards. As an example of a use case for indexing and partitioning, database entries corresponding to books in a library can be indexed/partitioned by genre. Behind the scenes, the database solution could physically allocate books of the same genre to the neighboring locations on disk, or it could otherwise maintain a separate file that keeps track of entries of the same genre for quicker access in the future, speeding up your analytical queries.
Lower-level optimization tasks to consider, depending on your database solution:
- Specify fields to retrieve by name in the SELECT clause rather than using SELECT *.
- Use TOP or LIMIT to sample query results rather than retrieving the entire result set.
- Use wildcards at the end of a search phrase only. Wildcards serve as a placeholder for words and phrases and, while they can be added at both the beginning and end of a search phrase, it stands to reason that qualifying/disqualifying a candidate entry based on characters at the beginning
- Pay attention to the number and type of your JOIN clauses. Joining tables is by no means inherently bad, and often, joining dozens of tables together is the only way to arrive at a complete view of the data necessary for a useful analytical query. However, if scenarios like: (1) a single table being referenced in multiple join clauses, or (2) many OUTER JOINs being used, resulting in many rows with null data which will have to be filtered out of the final result set, or (3) join conditions being logically complex beyond simple text or numerical equality, then it may be worth re-organizing your data according to principles of database normalization such as Boyce-Codd.
- Run the query during off-peak hours and at times when source tables are not being updated.
- When writing SQL transactions, avoid running queries in a loop. Instead, use bulk insert and update operations depending on the situation.
- Avoid using correlated subqueries where you can help it. Such subqueries reference columns in the outer query, usually in the form of a subquery that generates a list of values which form the WHERE condition of the outer query; for example,
SELECT table1.x, table1.y, table1.z
FROM table1
WHERE table1.x IN
(SELECT table2.a
FROM table2
WHERE [table1.x < table2.a + table2.b])
Such logic is suited for cases where the subquery must be evaluated specifically for each entry of the outer table, with the effect of making it a computationally expensive operation.
While Panoply offers a no-code solution so you can get set up in minutes, you can still dive deeper in to the platform with your own SQL queries!
Looking to find out more? Contact us for a free demo.