One of the main goals of this blog is to help developers and data architects, just like us, with their Amazon Redshift operations. Starting from a full comparison with Google BigQuery, explaining how to load Google analytics data into it and perform analysis using Tableau, all the way to providing you with tools such as PGProxy, which can help you load balance queries between multiple Redshift clusters, this article may extend your know-how on Redshift and help you do your job much better.
When building a new system, our urge is to do the magic, make it work, and gain the user appreciation for it as fast as we can. Security is not the first consideration. However, as the system grows, and especially as the amount of data we store grows, the realization that there’s an asset that needs protecting becomes abundantly clear.
This guide will help you continually encrypt your data loads to Redshift and make sure your information is safe.
Before getting into the step-by-step details, note that we are going to use one of Amazon’s main encryption tools, the AWS Key Management Service(KMS). This service allows for the creation and management of encryption keys. It is based on the envelope algorithm concept, so it requires two keys for encrypt/decrypt operations –– a data key to encrypt the data and a master key to encrypt the data key.
This Amazon Redshift utility helps to migrate data between Redshift clusters or databases. Using it we can also move the data from Redshift to S3. Then we will use KMS to encrypt the data and return it to Redshift. (Image source: GitHub)
This is planned for use as a scheduled activity for instances running a data pipeline shell activity.
The output will resemble the following:
Now, we need to copy this value and insert it into our configuration file.
$ python redshift-unload-copy.py s3://redshift-copy-cluster/config.json us-west-2
To access the full script, click here.
This solution builds an automatic pipeline that creates a KMS master key, uploads encrypted data to S3, and copies the encrypted data back to Redshift. Next, we’ll use Lambda to continuously encrypt newly incoming data.
At the initial stage, Lambda receives an S3 notification. Based on the file prefix, Lambda receives the bucket and the key, then builds the copy command that will run in the destination Redshift cluster. To make the Lambda function idempotent, it verifies the file has not already been copied before executing the ‘COPY’ command.
To implement this solution, we need to do the following things:
1- First of all, download the source file from this location.
2- After downloading the zip file, unzip it, and edit the parameters of the ‘copy.py’ script’ file according to these requirements:
3- Once the ‘copy.py’ file is edited, zip the whole folder again.
4- Now, in the AWS console, go to AWS Lambda. and create a function.
5- Click Create a Lambda Function to display the next page.
6- Select Blank Function to configure the triggers.
7- Click Next.
8- On this page, specify the Name, Description and Runtime environment. For our use case, the Runtime environment will be Python 2.7. From the Code entry type drop down list, select Upload a ZIP file and select the zip file we created earlier).
9- Provide the handler, role and existing role information, and click Next.
10- Click Create Function to create the function in Lambda.
The function runs the copy query to load the data into the Destination Redshift Cluster upon receiving any encrypted data in the S3 bucket.
This solution can be used to replicate real-time data from one Redshift cluster to another. In that case, we need to configure an event trigger in the S3 bucket. For details on configuring an event trigger with S3 bucket, we can follow this link.
Using own KMS customer-managed keys allow us to protect the Amazon Redshift data and give full control over who can use these keys to access the cluster data. It is worth mentioning that with AWS services KMS and S3, encryption of your data loads is provided without any additional charge, though KMS has a limited free tier offer up to 20,000 requests per month.
In the past, we had to meet our security professionals and ask them to integrate security into our data repositories, in large and complex environments we still do. However, we can also follow such guidelines which enable us, the data architects, to ensure the bare minimum when it comes to running our data solutions on the cloud. I hope this solution will help you do your work in a safe and secure data environment.