Setting Up A Cloud Data Warehouse: A Schema To SQL Tutorial

Written by Peter Weinberg | Jan 17, 2019 6:24:00 PM

Cloud-based data warehouses are quickly replacing their on-premise counterparts, allowing organizations with big data needs to potentially save millions in costs. Amazon Web Services’ Redshift data warehouse platform is one of the most popular, offering relatively easy set up from a web-based console and significantly lower costs than building a data warehouse on-premise.

But if you’re trying to get as quickly as possible from data to insight—or just don’t want to involve yourself in setting up security groups and IAM roles—setting up Redshift can still slow things down considerably. If you’re here after reading our previous post, “Setting up Redshift the Easy Way,” then you’re already intimately familiar with the detailed process of setting up a Redshift data warehouse cluster.

Setting up Redshift the even easier way

At Panoply, we’ve built a cloud data platform that makes setting up your storage and connecting your data sources incredibly streamlined. In this tutorial, we’ll run through setting up a Panoply cloud data warehouse, loading your data, and querying your data using our built-in SQL editor.

GETTING STARTED

If you want to follow along with this tutorial, you’ll need to set up a Panoply account first. Setting up a trial account is free (no credit card required) and comes with a 14 day trial period, so there’s plenty of time to experiment. Once you’ve set up your Panoply cloud data warehouse, you can start loading your data.

LOADING DATA

Once you’ve set up your Panoply cloud data warehouse, you can start loading your data. Head to the Data Sources pane on the left side of your Panoply dashboard. Then select Add Data Source in the upper right. Next, you’ll see a page listing a number of different potential data sources:

You can connect an existing SQL/NoSQL database, other cloud data stores like an S3 bucket, pull in analytics data from Facebook, Instagram, Google Ads, or Bing Ads or even just upload your data directly as a flat file. Let’s pull in some data from our Google Analytics account for the next part of the tutorial. In the data sources frame, select Google Analytics. This will open a login pane where you can enter your Google credentials:

If you head back to your Data Sources pane, you should be able to confirm that Google Analytics was successfully connected. Now, you’ll need to select the specific data you want to collect from your source. Select Google Analytics from the Data Sources pane, and you’ll see a new pane where you’ll be able to select exactly which metrics and dimensions to collect from your analytics account. In this case, we are using the data from googleappscripting.com.

We’ll select users, newUsers, sessions, and bounces as our metrics, then select deviceCategory, country, city, landingPagePath and dateHour as our dimensions. If you’re following along and have a Google Analytics account, you can make these same selections or choose whatever data you’d like to analyze on your end. Next, enter a date range for your selection and click Collect.

Once the data collection has finished, head to the Tables tab on your Panoply dashboard. You should see a list of tables containing the data you’ve pulled into your warehouse so far, including one named googleappscripting. Success!

If you select googleappscripting from your list of tables, you’ll get a preview of the automatically-generated table schema, including a sample of the data laid out as a dataframe right there in the Panoply dashboard:

QUERYING YOUR DATA WITH PANOPLY’S BROWSER-BASED SQL EDITOR

Now that you’ve got your data loaded in, you can start exploring. If you’re still in the googleappscripting preview pane, you can just select Query in the upper right corner to open Panoply’s SQL editor. If you’ve wandered off into another section of your Panoply dashboard but are ready to get back to your Google Analytics data, you can get to the editor directly by heading to the Analyze tab. Here, you’ll see a SQL editor up top and a space below to show the results of your queries.

First, let’s just look at the first 10 entries in googleappscripting table again. In the SQL editor, enter:

SELECT * 
FROM googleappscripting 
LIMIT 10

Let’s go a little further and see what we can learn about users in different countries. Which country has the highest number of users in our dataset? Enter:

SELECT 
  country, 
  SUM(users) as num_users 
FROM googleappscripting 
GROUP BY country 
ORDER BY num_users DESC

And there you go! You’ve gone from account setup to data ingestion to insight with just a few clicks. Compare this to AWS’s redshift process, which would have required you to:

Create an IAM role for your Redshift instance
Attach that role to your cluster
Determine the current and potential future size requirements for your Redshift cluster
Launch your cluster
Launch an S3 bucket with your data
Pre-configure SQL table schema on your Redshift cluster
Load data from your S3 bucket to your Redshift cluster using the COPY command
Install and configure a SQL client (if necessary)
Query your data
Output the results of your queries to a BI tool like Tableau

SUMMARY

Panoply makes spinning up a data warehouse and connecting all your data sources click-and-go simple. Plus, our best-in-class support means there's someone on the other end of the line when you need it. To learn more, schedule a personalized demo today.

View full post