Data Management

How to Move Your MySQL to Redshift

Written by Roi Avinoam|November 21, 2016
 

Data analytics is a requirement for virtually every business today. However building an analytical data warehouse requires vast computing power to maintain performance and quick retrieval of data and results. In addition, the organization’s IT teams need to acquire new skills and infrastructure know-how to establish and maintain their BI systems.

To fill this infrastructure gap, Amazon Redshift brings a fully managed cloud solution for modern data warehouses. It delivers a massive parallel processing architecture and uses a columnar data store that can aggregate and analyze large amounts of data. It enables faster performance - no matter how much data. That’s why Amazon Redshift has become the solution of choice for companies like: Nokia that moved its high data usage activity to Redshift and now runs queries two times faster. And, VivaKi that loaded 10 terabytes of data in only six hours with Redshift.

So, for real-time data analytics on high volumes of data, Redshift has distinct benefits which obviously MySQL, PostgreSQL, or other common databases cannot handle at scale. However, though we see great success stories like Vivaki, moving data from your database to Redshift isn’t straight-forward.

In this article, we’ll provide a step-by-step sample procedure for how to migrate your MySQL to Redshift.

Amazon Redshift to the Rescue

Amazon Redshift has provided a very good solution for today’s issues and beyond. As you’ve probably experienced, MySQL only takes you so far. It’s not designed to cope with your data scaling, data consistency, query performance, or analytics on large amounts of data. Redshift is built to handle large scale data analytics. Plus, it’s cost effective, especially compared to other analytics-specific databases like Hive and Impala.

Data Warehouse too complex to manage? Learn how to leverage AWS

Before jumping into action, let’s understand these Redshift differences from MySQL:

Handling Database Workloads with OLAP vs. OLTP

One of the biggest differences between MySQL and Redshift is how the way each handles database workloads. MySQL uses OLTP system where the purpose is to control and run fundamental business tasks. Its purpose and design is for fast transactional random access, not analytics.

By using OLAP, Redshift gears its system for analytics and helps with planning and problem solving. With the OLAP columnar design, Redshift improves on performance. The data is organized in a way which allows for better compression and easier sequential reads. Therefore, large amounts of data scanning is done quickly.

In addition, as mentioned OLTP only gives output of ongoing business processes, whereas OLAP gives multi-dimensional views of various kinds of business activities. OLAP queries are often very complex and involve aggregations, versus OLTP queries that are often simple and return relatively few records.

Using Two Types of Compute Nodes

There are two types of compute nodes available for RedShift. The first is DW2 which is very a dense compute running on SSD. DW2 are very fast solid state memory drives, that support the database’s I/O performance needs. They cost about $1.50/Tb per hour. The second node type is DW1 or so-called dense storage nodes, which in comparison to DW2 run on traditional storage disks. They cost around $0.50/Tb per hour.

Step-by-Step

Now that you’re familiar with Amazon Redshift, we’ll show you how to load data to it.

Some preliminaries: There are several ways to perform the load. The COPY command is the most efficient way to load a table, as it can load data in parallel from multiple files and take advantage of the load distribution between nodes in the Redshift cluster. It supports loading data in CSV (or TSV), JSON, character-delimited, and fixed width formats.

After initial data load, you can add, modify, or delete a significant amount of data, follow up by running a VACUUM command to reorganize data and reclaim space after deletes.

Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if data already exists in other Amazon Redshift database tables, use SELECT INTO … INSERT or CREATE TABLE AS to improve performance.

We’ll discuss the best ways to use each of these commands by example.

So, let’s start - here are the 5 steps for loading data into Redshift:

  1. Create a Redshift cluster
  2. Export a MySQL database and split it into multiple files
  3. Upload the files to Amazon S3
  4. Run a COPY command to load the table to Redshift
  5. Verify that the data was loaded correctly

1. Create a Redshift cluster

Perform the following steps to create a cluster:

  • Enter your AWS account Console > Services > Redshift > Launch Cluster to start the RedShift cluster creation wizard.
  • Enter the required cluster details as shown below:

  • Configure the Node Type, Cluster Type and the number of Compute Nodes on the Node Configuration page:

Choose the default value for the rest of the options and you are good to go.

  • The next thing to configure is the Security Groups to allow hosts access:
  • Go to Redshift > Security > Security Groups > Connection Type > CIDR/IP and add the IP address of host that you want to access.

2. Export MySQL Data to Amazon S3

Let’s assume that we have several tables to export to Redshift.

In following example, we run a statement to select all sales transaction for the past 5 years from tbl_sales and split them over to 1 million rows per file:

mysql-to-amazon-redshift_table_1.png
Then, call the procedure:
mysql-to-amazon-redshift_table_2.png

Move the generated CSV files into a directory called s3-redshift:

mysql-to-amazon-redshift_table_3.png

3. Upload the load files to Amazon S3

Install s3cmd, a command line tool to manage Amazon S3 remotely which is available in epel-testing repository:

mysql-to-amazon-redshift_table_4.png

NOTE: If you are on Debian/Ubuntu, see: http://s3tools.org/repo/deb-all/stable/

Configure S3 credentials:

mysql-to-amazon-redshift_table_5.png

Create a new bucket for Redshift:

mysql-to-amazon-redshift_table_6.png

Start syncing the CSV data directory into S3 bucket:

mysql-to-amazon-redshift_table_7.png
Ensure the data files exist in S3’s bucket as shown in the following list:

data files in AWS s3s buckets image

In the next steps, we prepare our Redshift:

  1. Redshift runs on PostgreSQL 8.X version as the backend. We can use the standard PostgreSQL client to access the Redshift cluster with provided endpoint and credentials.

To install PostgreSQL command we can execute the following commands:

mysql-to-amazon-redshift_table_8.png

  1. Use psql program to access the Redshift cluster:
mysql-to-amazon-redshift_table_9.png
  1. Create a table similar to the one we have in MySQL
mysql-to-amazon-redshift_table_10.png
NOTE: Redshift only supports certain data types as listed here. You may need to alter some of your data types to fit the Redshift environment.

4. Run a COPY command to load the table

Next, run the COPY command to access the CSV files in our S3 bucket and parallel load them into the table:

mysql-to-amazon-redshift_table_11.png
After the initial data load from the S3 bucket, we need to run a VACCUM command to reorganize our data and ‘analyze’ commands to update the table statistics.
mysql-to-amazon-redshift_table_12.png
 

Now the data is available in the Redshift cluster and ready forquery processing.

5. Verify the Data

From the psql client console, verify the table structure:

mysql-to-amazon-redshift_table_13.png
Column TYPE Modifiers
id bigint NOT NULL
trx_id character varying(16) NOT NULL
p_id INTEGER NOT NULL
created TIMESTAMP without TIME ZONE  
quantity INTEGER NOT NULL
price numeric(13,2) DEFAULT 0
subtotal numeric(13,2) DEFAULT 0
discount double precision DEFAULT 0
total numeric(13,2) DEFAULT 0

To verify that each table loaded correctly, execute the following commands:

mysql-to-amazon-redshift_table_14.png

All executed queries on the cluster can be seen on the Redshift Queries tab as well:

Redshift queries tab image

Click on the Query ID to get in-depth details on the query plan and status:

Redshift query ID image

That’s it. Your data is now in Redshift!

Conclusion

Migrating data to Amazon Redshift is relatively easy when you have access to the right procedure. But, sometimes moving the data is sometimes not all you need to do. Be aware of issues like: Handling the metadata and things like stored procedures, triggers, custom functions that can be problematic and add complexity to the move.

Data Warehouse too complex to manage? Learn how to leverage AWS

Also, pay attention to issues that may arise while adapting all existing queries and dashboards to the new query language (PostgresSQL instead of MySQL’s SQL). Redshift also helps to solve your issues for adapting all existing queries and dashboards to their new query language by automatically sampling data into the most appropriate compression scheme.

With Amazon Redshift, your data is organized in a better way. Query processing and sequential storage gives your enterprise an edge with improved performance as the data warehouse grows. Therefore, migrating to Redshift can be a crucial step to enabling big data analytics in your organization.

Check our our detailed comparisons between MySQL and MongoDB, and MySQL vs MariaDB.

From raw data to analysis in under 10 minutes.

Sign up now for a demo or a free trail of the Panoply.io platform.

Learn more about platform features