Setting Up A Cloud Data Warehouse - A Schema To SQL Tutorial

Cloud-based data warehouses are quickly replacing their on-premise counterparts, allowing organizations with big data needs to potentially save millions in costs. Amazon Web Services’ Redshift data warehouse platform is one of the most popular, offering relatively easy set up from a web-based console and significantly lower costs than building a data warehouse on-premise. But if you’re trying to get as quickly as possible from data to insight--or just don’t want to involve yourself in setting up security groups and IAM roles--setting up Redshift can still slow things down considerably. If you’re here after reading our previous post, “Setting up Redshift the Easy Way,” then you’re already intimately familiar with the detailed process of setting up a Redshift data warehouse cluster.

Setting up Redshift the even easier way

At Panoply, we’ve built a cloud data warehouse solution that makes everything about setting up Redshift easier, faster and more streamlined. Panoply lives on top of Redshift and automates the annoying parts of data ingestion, Redshift cluster configuration and warehouse maintenance. In this tutorial, we’ll run through setting up a Panoply cloud data warehouse, loading your data, and querying your data using our built-in SQL editor.

GETTING STARTED

If you want to follow along with this tutorial, you’ll need to set up a Panoply account first. Setting up a trial account is free (no credit card required) and comes with a 21-day trial period, so there’s plenty of time to experiment. Once you’ve set up your Panoply cloud data warehouse, you can start loading your data.

LOADING DATA

Once you’ve set up your Panoply cloud data warehouse, you can start loading your data. Head to the Data Sources pane on the left side of your Panoply dashboard. Then select Add Data Source in the upper right. Next, you’ll see a page listing a number of different potential data sources:

You can connect an existing SQL/NoSQL database, other cloud data stores like an S3 bucket, pull in analytics data from Facebook, Instagram, Google Ads, or Bing Ads or even just upload your data directly as a flat file. Let’s pull in some data from our Google Analytics account for the next part of the tutorial. In the data sources frame, select Google Analytics. This will open a login pane where you can enter your Google credentials:

If you head back to your Data Sources pane, you should be able to confirm that Google Analytics was successfully connected. Now, you’ll need to select the specific data you want to collect from your source. Select Google Analytics from the Data Sources pane, and you’ll see a new pane where you’ll be able to select exactly which metrics and dimensions to collect from your analytics account. In this case, we are using the data from googleappscripting.com.

We’ll select users, newUsers, sessions, and bounces as our metrics, then select deviceCategory, country, city, landingPagePath and dateHour as our dimensions. If you’re following along and have a Google Analytics account, you can make these same selections or choose whatever data you’d like to analyze on your end. Next, enter a date range for your selection and click Collect.

Once the data collection has finished, head to the Tables tab on your Panoply dashboard. You should see a list of tables containing the data you’ve pulled into your warehouse so far, including one named googleappscripting. Success!

If you select googleappscripting from your list of tables, you’ll get a preview of the automatically-generated table schema, including a sample of the data laid out as a dataframe right there in the Panoply dashboard:

QUERYING YOUR DATA WITH PANOPLY’S BROWSER-BASED SQL EDITOR

Now that you’ve got your data loaded in, you can start exploring. If you’re still in the googleappscripting preview pane, you can just select Query in the upper right corner to open Panoply’s SQL editor. If you’ve wandered off into another section of your Panoply dashboard but are ready to get back to your Google Analytics data, you can get to the editor directly by heading to the Analyze tab. Here, you’ll see a SQL editor up top and a space below to show the results of your queries.

First, let’s just look at the first 10 entries in googleappscripting table again. In the SQL editor, enter:

SELECT *

FROM googleappscripting

LIMIT 10

You should see something like this (especially if you scrolled over to the right hand side of the table):

Let’s go a little further and see what we can learn about users in different countries. Which country has the highest number of users in our dataset? Enter:

SELECT country, SUM(users) as num_users

FROM googleappscripting

GROUP BY country

ORDER BY num_users DESC

Which should give you a set of results that looks something like this:

And there you go! You’ve gone from account setup to data ingestion to insight with just a few clicks. Compare this to AWS’s redshift process, which would have required you to:

  • Create an IAM role for your Redshift instance
  • Attach that role to your cluster
  • Determine the current and potential future size requirements for your Redshift cluster
  • Launch your cluster
  • Launch an S3 bucket with your data
  • Pre-configure SQL table schema on your Redshift cluster
  • Load data from your S3 bucket to your Redshift cluster using the COPY command
  • Install and configure a SQL client (if necessary)
  • Query your data
  • Output the results of your queries to a BI tool like Tableau

SUMMARY

Panoply makes connecting with most common data sources much simpler than the standard Redshift process. Your Panoply cloud data warehouse will scale automatically according to the size of your data, and maintenance functions (i.e. VACUUM and ANALYZE) will be performed automatically, in the background, while your data is still accessible.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.