Data analytics is a requirement for virtually every business today. However building an analytical data warehouse requires vast computing power to maintain performance and quick retrieval of data and results. In addition, the organization’s IT teams need to acquire new skills and infrastructure know-how to establish and maintain their BI systems.
To fill this infrastructure gap, Amazon Redshift brings a fully managed cloud solution for modern data warehouses. It delivers a massive parallel processing architecture and uses a columnar data store that can aggregate and analyze large amounts of data. It enables faster performance - no matter how much data. That’s why Amazon Redshift has become the solution of choice for companies like: Nokia that moved its high data usage activity to Redshift and now runs queries two times faster. And, VivaKi that loaded 10 terabytes of data in only six hours with Redshift.
So, for real-time data analytics on high volumes of data, Redshift has distinct benefits which obviously MySQL, PostgreSQL, or other common databases cannot handle at scale. However, though we see great success stories like Vivaki, moving data from your database to Redshift isn’t straight-forward.
In this article, we’ll provide a step-by-step sample procedure for how to migrate your MySQL to Redshift.
Amazon Redshift to the Rescue
Amazon Redshift has provided a very good solution for today’s issues and beyond. As you’ve probably experienced, MySQL only takes you so far. It’s not designed to cope with your data scaling, data consistency, query performance, or analytics on large amounts of data. Redshift is built to handle large scale data analytics. Plus, it’s cost effective, especially compared to other analytics-specific databases like Hive and Impala.
Before jumping into action, let’s understand these Redshift differences from MySQL:
Handling Database Workloads with OLAP vs. OLTP
One of the biggest differences between MySQL and Redshift is how the way each handles database workloads. MySQL uses OLTP system where the purpose is to control and run fundamental business tasks. Its purpose and design is for fast transactional random access, not analytics.
By using OLAP, Redshift gears its system for analytics and helps with planning and problem solving. With the OLAP columnar design, Redshift improves on performance. The data is organized in a way which allows for better compression and easier sequential reads. Therefore, large amounts of data scanning is done quickly.
In addition, as mentioned OLTP only gives output of ongoing business processes, whereas OLAP gives multi-dimensional views of various kinds of business activities. OLAP queries are often very complex and involve aggregations, versus OLTP queries that are often simple and return relatively few records.
Using Two Types of Compute Nodes
There are two types of compute nodes available for RedShift. The first is DW2 which is very a dense compute running on SSD. DW2 are very fast solid state memory drives, that support the database’s I/O performance needs. They cost about $1.50/Tb per hour. The second node type is DW1 or so-called dense storage nodes, which in comparison to DW2 run on traditional storage disks. They cost around $0.50/Tb per hour.
Now that you’re familiar with Amazon Redshift, we’ll show you how to load data to it.
Some preliminaries: There are several ways to perform the load. The COPY command is the most efficient way to load a table, as it can load data in parallel from multiple files and take advantage of the load distribution between nodes in the Redshift cluster. It supports loading data in CSV (or TSV), JSON, character-delimited, and fixed width formats.
After initial data load, you can add, modify, or delete a significant amount of data, follow up by running a VACUUM command to reorganize data and reclaim space after deletes.
Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if data already exists in other Amazon Redshift database tables, use SELECT INTO … INSERT or CREATE TABLE AS to improve performance.
We’ll discuss the best ways to use each of these commands by example.
So, let’s start - here are the 5 steps for loading data into Redshift:
- Create a Redshift cluster
- Export a MySQL database and split it into multiple files
- Upload the files to Amazon S3
- Run a COPY command to load the table to Redshift
- Verify that the data was loaded correctly
1. Create a Redshift cluster
Perform the following steps to create a cluster:
- Enter your AWS account Console > Services > Redshift > Launch Cluster to start the RedShift cluster creation wizard.
- Enter the required cluster details as shown below:
- Configure the Node Type, Cluster Type and the number of Compute Nodes on the Node Configuration page:
Choose the default value for the rest of the options and you are good to go.
- The next thing to configure is the Security Groups to allow hosts access:
- Go to Redshift > Security > Security Groups > Connection Type > CIDR/IP and add the IP address of host that you want to access.
2. Export MySQL Data to Amazon S3
Let’s assume that we have several tables to export to Redshift.
In following example, we run a statement to select all sales transaction for the past 5 years from tbl_sales and split them over to 1 million rows per file:
Then, call the procedure:
Move the generated CSV files into a directory called s3-redshift:
3. Upload the load files to Amazon S3
Install s3cmd, a command line tool to manage Amazon S3 remotely which is available in epel-testing repository:
NOTE: If you are on Debian/Ubuntu, see: http://s3tools.org/repo/deb-all/stable/
Configure S3 credentials:
Create a new bucket for Redshift:
Start syncing the CSV data directory into S3 bucket:
Ensure the data files exist in S3’s bucket as shown in the following list:
In the next steps, we prepare our Redshift:
- Redshift runs on PostgreSQL 8.X version as the backend. We can use the standard PostgreSQL client to access the Redshift cluster with provided endpoint and credentials.
To install PostgreSQL command we can execute the following commands:
- Use psql program to access the Redshift cluster:
- Create a table similar to the one we have in MySQL
NOTE: Redshift only supports certain data types as listed here. You may need to alter some of your data types to fit the Redshift environment.
4. Run a COPY command to load the table
Next, run the COPY command to access the CSV files in our S3 bucket and parallel load them into the table:
After the initial data load from the S3 bucket, we need to run a VACCUM command to reorganize our data and ‘analyze’ commands to update the table statistics.
Now the data is available in the Redshift cluster and ready for query processing.
5. Verify the Data
From the psql client console, verify the table structure:
|trx_id||character varying(16)||NOT NULL|
|created||TIMESTAMP without TIME ZONE|
|discount||double precision||DEFAULT 0|
To verify that each table loaded correctly, execute the following commands:
All executed queries on the cluster can be seen on the Redshift Queries tab as well:
Click on the Query ID to get in-depth details on the query plan and status:
That’s it. Your data is now in Redshift!
Migrating data to Amazon Redshift is relatively easy when you have access to the right procedure. But, sometimes moving the data is sometimes not all you need to do. Be aware of issues like: Handling the metadata and things like stored procedures, triggers, custom functions that can be problematic and add complexity to the move.
Also, pay attention to issues that may arise while adapting all existing queries and dashboards to the new query language (PostgresSQL instead of MySQL’s SQL). Redshift also helps to solve your issues for adapting all existing queries and dashboards to their new query language by automatically sampling data into the most appropriate compression scheme.
With Amazon Redshift, your data is organized in a better way. Query processing and sequential storage gives your enterprise an edge with improved performance as the data warehouse grows. Therefore, migrating to Redshift can be a crucial step to enabling big data analytics in your organization.