You need a data warehouse, but should you take the traditional ETL route or opt for a modern ELT approach? When it comes to usability, there's no question: ELT data warehousing unlocks the power of views—a tool that will revolutionize the way you work.
Transform Your Data Using Views
Modern ELT data warehouses will have built-in integrations for the data sources you're most likely to need. So once you've loaded in your data either via a data pipeline or another tool, it's time to transform it so it's cleaned up and easy to use for building reports.
In the past, you've probably done something similar with Excel. You used VLOOKUP to join together sheets of data. You added formulas to create new columns. You copied and pasted data from one sheet to another. Perhaps you did a bunch of searching and replacing to clean up some of the data. As you did all this work, you probably created new sheets and/or tables to store the cleaned up data.
And then the next time you needed an updated report, you had to do it all over again.
With a modern data warehouse, that tedious manual work is in your past. Instead, you're going to accomplish the same goal using a feature of SQL called views.
Views can be a little tricky to understand in the abstract. The easiest way to learn about views is to see one in action.
Let's suppose you need to analyze some Wordpress data you've imported. You want to be able to analyze published blog posts, comments, and forum posts, and you want to be able to break them down by date and author. You need to pull data from two tables:
wp_users, which contains info about your WordPress account's users.
wp_posts, which contains info about website pages, blog posts, forum posts, and comments.
In this article, the author created the following query to pull together the relevant data from both tables:
SELECT posts.id, posts.post_name, posts.post_title, users.id author_id, users.display_name author_display_name, posts.post_type, posts.post_status, posts.post_parent, posts.comment_status, posts.comment_count, posts.post_date_gmt, posts.post_modified_gmt, posts.guid FROM wordpress_wp_posts posts JOIN wordpress_wp_users users ON posts.post_author = users.id WHERE post_type IN ('page', 'post', 'forum', 'topic') AND post_status = 'publish'
While this query isn't super complicated, it's a mouthful. Imagine if you had several reports that were analyzing number of comments per user's blog posts, forum traffic, etc. That's an awful lot of SQL you have to write or copy and paste over and over.
By turning that SQL into a view, you can create a kind of query-ready virtual table. The best part? All you have to do to create a view is add the first and last lines below to your SQL query:
CREATE VIEW wordpress_posts AS ( SELECT posts.id, posts.post_name, posts.post_title, users.id author_id, users.display_name author_display_name, posts.post_type, posts.post_status, posts.post_parent, posts.comment_status, posts.comment_count, posts.post_date_gmt, posts.post_modified_gmt, posts.guid FROM wordpress_wp_posts posts JOIN wordpress_wp_users users ON posts.post_author = users.id WHERE post_type IN ('page', 'post', 'forum', 'topic') AND post_status = 'publish' );
Now to produce the same result as in the first query, the only SQL you'd need is:
SELECT * FROM wordpress_posts;
What's going on here?
SQL is treating the view
wordpress_posts just like it was a table in the data warehouse.
If views didn't exist and you wanted to create a cleaned up version of the Wordpress data, you’d have to write that logic into all your SQL queries. That's an error-prone process that can lead to lots of discrepancies across your reports and dashboards.
The alternative is to take the above query and dump the results into a new table. And every time you wanted up-to-date data in the table? You'd have to delete all of the existing data in the table, rerun the query, and dump the new results into the table. Every. Single. Time.
But if you use a view, all that work goes away. You create the view using a SQL query, and from now on you get clean, up-to-the-minute data in this virtual table.
Best of all, you can always redefine your view to and it will apply to all the downstream queries consistently without any need to update the queries themselves!
Using Views Will Save You Lots of Work
Once you get comfortable with views, they will end up saving you an enormous amount of work. Here's why:
1. Views Streamline Your Work
Suppose you need to create a bunch of reports that use data from your company's SaaS platform. To do just about any analysis of this data, you need grab columns from 6 related tables. So for every report, you're going to have to join those 6 tables. That eats up your time and focus. It creates more opportunities to make mistakes. And it makes it more risky for you to allow end-users to create reports.
If you use a view that takes care of the JOINs for you, you figure out the SQL once and then you're set.
2. Views Hide Complexity
Views are also a great way of hiding calculations, business logic, and other complexities so creating reports is much easier and cleaner.
Suppose your business has two types of customers that everyone is obsessed with: whales, who use your service at least 1,000 times a month, and minnows, who use it 1-50 times a month. (Why? Your founder went to a talk, and ever since it's been nonstop whales and minnows.)
You could bake the business logic for whales and minnows into every report that needs these categories. Or you could take care of it once, in a view.
If you build the categories in a view, you'll also be very happy two months from now, when your company decides it makes more sense to expand the whales category so it includes everyone who uses your service at least 850 times a month. You don't have to rewrite a bunch of reports or a bunch of SQL. Change the view and you're set.
And in six months, when your company switches to using sophisticated machine learning models to categorize whales and minnows as well as adding a new predictive category called "potential whales"? Set up Panoply to regularly import the model's output table that contains customer ID and category, modify your view so it grabs the customer category from that table, and you're good to go.
3. It's Easy to Create Multiple Views of the Same Tables
Sometimes when you're writing a report, you want to be able to choose from a wide range of data. For example, you might want to be able to see all of the columns available in your SaaS platform's data so you can pick exactly the ones you need for a unique report.
But you also have a bunch of reports that only need the same subset of your SaaS platform's data. And it's a real pain to have to wade through a bazillion columns when creating these reports.
So, you create two views: one that contains all of your SaaS platform columns, another with the subset you regularly use. Now writing most of your reports is straightforward, but you've still got easy access to the full set of columns when you need them.
A Final Note: Views, ELT and ETL
If you've ever read an article about traditional data warehouses, it probably uses the acronym ETL.
When creating a modern data warehouse, you switch up the order for ingesting your data. Instead of extracting (E), loading (L), and transforming (T) your data, you extract (E), transform (T), then load (L) it.
The reason for this switch? Traditional data warehouses aren't speedy enough to handle transformations using views.
As a result, for each transformation you need to temporarily store the results in one or more tables. Once you've finished with all the transformations required to get the data into the format you need, you load the cleaned up data into your data warehouse.
If you're creating a traditional data warehouse, you're likely to have an enormous amount of data to transform. If you have to repeatedly modify the data in one or more tables, that's going to chew up a lot of time. If you're lucky, completing an ETL process takes several hours—and you have to run that process all over again every time you get a new batch of data.
But it's common for ETL processes to run for half a day or more. So in addition to figuring out which transformations you'll need to do, you also need to determine how to optimize your SQL and other factors to ensure one day's ETL process finishes before the next day's process begins.
And because the process takes so long, you have to worry about whether the process will choke and die while it's running. At a minimum, that means you have to rerun the process all over again. If you're pulling down live data, the process' failure could also mean you'd lose data you could never get back.
That's what the process is like if your business' needs never change. But of course they do. Now you'll have to make changes to how the tables are structured, change the SQL to accommodate the new structure, and reoptimize the process so it finishes before the next day.
Is it any wonder that you need highly skilled data engineers to build a traditional data warehouse?
Luckily, since you're using a modern data warehouse you'll never have to worry about any of this. With ELT, raw data is loaded into the warehouse and transformations are applied afterward, making the process that much faster and easier.
Working with a platform like Panoply makes working with your data a snap. With ease of use and automated optimization baked in, you can set up a data warehouse in minutes and easily integrate your company's BI tool in just a few clicks.