Setting up a Redshift cluster that hangs on some number of query executions is always a hassle. When users run queries in Amazon Redshift, the queries are routed to query queues. Usually the hangups could be mitigated in advance with a good Redshift query queues setup.
Amazon Redshift has implemented mechanism with which we can modify the queues to our advantage. In this article you will learn the challenges and some best practices on how to modify query queues and execution of queries to maintain an optimized query runtime.
Before we go into the challenges, let’s start with discussing about key components of Redshift:
Workload Manager (WLM)
Amazon Redshift workload manager is a tool for managing user defined query queues in a flexible manner. We use Redshifts Workload Management console to define new user defined queues and to define or modify their parameters. We can also use it to define the parameters of existing default queues.
3 Queue Types
By default, Amazon Redshift has three queues types: for super users, default queue and user defined queues.
The super user queue is reserved for running commands related to the system, trouble-shooting or for some emergency manual operations. This queue cannot be configured and can only process one query at a time.
Every Redshift cluster has a default queue. The default queue comes with the default setting of concurrency level of 5. For default queues you can change the concurrency, timeout and memory allocation. Any queries that are not routed to other queues run in the default queue.
User defined queues
Besides the default queue you can add other user defined queues. For user defined queues besides parameters listed in the default queues you can change user groups parameters and query groups parameters.
5 Queue Parameters
Smart use of queues parameters allows users to optimize time and execution cost of a query. We will look at parameters of queues:
Specifies the number of queries that run concurrently within a particular queue.
Executing a query by a member of a user group runs the query inside the queue assigned to that user group.
Query group is a simple label. Users can assign queries to a particular queue on the fly using this label.
You have the option of changing the percentage of memory assigned to each queue by setting WLM memory percent parameter. The rate for all the queues adds up to 100%.
With this parameter you specify the amount of time, in milliseconds, that the Redshift waits for a query to execute before canceling the query. Note that the timeout is based on query execution time which doesn’t include time spent waiting in a queue.
Now that we know what are the main points… let’s move to the challenges.
Unlike transactional systems which have queries of uniform size and execution cost, data warehouse queries vary greatly in execution cost, time and result-set. Optimal execution of these queries necessitates a balanced structure of execution queues configurations dedicated to different query size and/or priority. We want to make sure that the slow running queries are not blocking fast running queries that execute in manner of minutes or seconds.Arriving at an optimal queues setting for the Redshift cluster is a challenge and needs to take into account the needs of the specific implementation of user requirements.
The WLM configuration properties are either dynamic or static. If you change any of the dynamic properties, you don’t need to reboot your cluster for the changes to take effect unlike the change of the static properties.
The following WLM properties are static:
- User groups
- User group wild-card
- Query groups
- Query group wild-card
With defined queue parameter a dynamic execution parameters can be set for specific queries that impact their performance.
The following WLM properties are dynamic:
- Percent of memory to use
As mentioned above user can change dynamic property without restarting the Redshift cluster. It allows dynamic memory management when needed, we will look at some examples in the tips section.
An example of a setup of the WLM configuration that handles a solid DWH/BI configuration looks something like this :
We defined the fast_etl_execution query with the user group called etl. This user group handles ETL executions. Another group is for BI related queries. In this configuration ad-hoc queries are handled by the default queue. It is important to define ETL and BI user groups beforehand or you will have to restart your Redshift cluster as these parameters are static.
Assign all your ETL users to the etl user group :
create group etl with user etl_execution;
Now when the user etl_execution executes a etl job, if it takes more than 2 minutes (3000000 milliseconds), the timeout parameter of the first defined queue of the user (fast_etl_execution) will cancel the execution in that queue and route it to the long_etl_execution queue. The slow_etl_execution queue has more memory and lower concurrency level so each query has more power to finish the job.
We can check the memory allocation of our queues with the statement :
, num_query_tasks as slots
, query_working_mem as memory
The result shows the memory and the available slots for different “Service class #x” queues, where x denotes a queue mapped to the Redshift console “Query X” queue.
You can also see the internal query queues which are not accessible to users, service_class 1-4, and a super user query queue, service_class 5 .
Queues management tips
Let’s look at some general tips on working with query ques.
Having only default execution queue can cause bottlenecks. If a large time-consuming query blocks the only default queue small, fast queries have to wait. Make sure you create at least one user defined query besides the Redshift default query queue. I recommend creating a separate query queue for fast and slow queries, in our example fast_etl_execution. Mind the level of concurrent processes that run across all the query queues in Redshift. Another recommendation is having level of concurrency of at least 2 in particular queues.
Dynamic management for loads
We can modify the dynamic properties to tune the execution of particular queries that execute within queue via memory allocation. The main way you control this is with wlm_query_slot_count parameter. You should change dynamically wlm_query_slot_count when you perform resource intensive statements like:
VACUUM which reclaims space and resorts rows in either a specified table or all tables in the current database
ANALYZE which gathers table statistics for Redshifts optimizer
COPY which transfers data into Redshift
You should set the statement to use all the available resources of the query queue. In this case where the concurrency setting of the queue is 10 we set the slot count to 10 (meaning the following query will use all the available slots of the queue):
set wlm_query_slot_count to 10;
set wlm_query_slot_count to 1;
After the statement finishes, and VACUUM will take some time if you have a large database, you reset the session to use the normal slot count of one.
It is wise to increase the query slot count for COPY statements when ingesting data into your Redshift cluster. COPY works best with maximal parallelism enabled so that Redshift can route all the data into the nodes simultaneously.
Superuser queue for housekeeping
You can switch the query group of a query to the superuser queue for housekeeping activities like analyzing of even killing a query.
Changing the queue is done with the set query_group command. Command to gather statistics with the superuser queue is:
set query_group to 'superuser';
As usually there is no one universal setup to cover all the query setups of the Redshift cluster, it heavily depends on user requirements that you are implementing. Working with query queues without a predefined tool is more of an art-form than science. You will come to your optimal configuration after some trial and error.
We covered some rules that get you to a great Redshift cluster setup. More often than not you will setup a different a separate user defined queue besides the default one. You will set the level of concurrency of at least 2 for a query queue. Finally you will tune execution of your more demanding statements to use all the resources available in the query queue.