You need a data warehouse, but a data warehouse is worth nothing if it doesn’t contain - data! The first step to start seeing benefits from your data warehouse is to ingest data, and there are two main data ingestion methods: the traditional ETL and modern ELT. When it comes to usability, there's no question; ELT data ingestion is what unlocks the power of the data warehouse. It does that by allowing the data warehouse to utilize its compute power for complex, on-demand transformations. In many cases, you’ll want to use the same transformation in multiple different queries.
For that, you'll need to use views. Although views are not a new thing in the data world, with the increase in compute power in modern data warehouses the views can become the complex, organized transformation layer in your analytics pipeline.
With a modern data warehouse, that tedious manual work is in the past. Instead, you’ll accomplish the same goal using the views feature of SQL.
Views can be a little tricky to understand in the abstract, so the easiest way to learn about views is to see one in action.
Let's suppose you need to analyze some Wordpress data you've imported. You want to be able to analyze published blog posts, comments, and forum posts, and to be able to break them down by date and author. You need to pull data from two tables:
In this article, we created the following query to pull together the relevant data from both tables:
SELECT posts.id, posts.post_name, posts.post_title, users.id author_id, users.display_name author_display_name, posts.post_type, posts.post_status, posts.post_parent, posts.comment_status, posts.comment_count, posts.post_date_gmt, posts.post_modified_gmt, posts.guid FROM wordpress_wp_posts posts JOIN wordpress_wp_users users ON posts.post_author = users.id WHERE post_type IN ('page', 'post', 'forum', 'topic') AND post_status = 'publish'
While this query isn't super complicated, it's a mouthful. Imagine if you had several reports that were analyzing the number of comments per user's blog posts, forum traffic, etc. That's an awful lot of SQL to write or copy and paste over and over.
By turning that SQL into a view, you can create a kind of query-ready virtual table. The best part? All you have to do to create a view is add the first and last lines below to your SQL query:
CREATE VIEW wordpress_posts AS ( SELECT posts.id, posts.post_name, posts.post_title, users.id author_id, users.display_name author_display_name, posts.post_type, posts.post_status, posts.post_parent, posts.comment_status, posts.comment_count, posts.post_date_gmt, posts.post_modified_gmt, posts.guid FROM wordpress_wp_posts posts JOIN wordpress_wp_users users ON posts.post_author = users.id WHERE post_type IN ('page', 'post', 'forum', 'topic') AND post_status = 'publish' );
Now to produce the same result as in the first query, the only SQL you'd need is:
SELECT * FROM wordpress_posts;
What's going on here?
SQL is treating the view wordpress_posts
just like it was a table in the data warehouse.
If views didn't exist and you wanted to create a cleaned up version of the Wordpress data, you’d have to write that logic into all your SQL queries. That's an error-prone process that can lead to lots of discrepancies across your reports and dashboards.
The alternative is to take the above query and dump the results into a new table. And every time you wanted up-to-date data in the table? You'd have to delete all of the existing data in the table, rerun the query, and dump the new results into the table. Every. Single. Time.
But if you use a view, all that work goes away. You create the view using a SQL query, and from now on you get clean, up-to-the-minute data in this virtual table.
Best of all, you can always redefine your view and it will apply to all the downstream queries consistently without any need to update the queries themselves!
Once you get comfortable with views, they'll end up saving you an enormous amount of work. Here's why:
Suppose you need to create a bunch of reports that use data from your company's SaaS platform. To do just about any analysis of this data, you need to grab columns from six related tables. So for every report, you'll have to join those 6 tables. That eats up your time and focus. It creates more opportunities to make mistakes. And it makes it more risky for you to allow end-users to create reports.
If you use a view that takes care of the JOINs for you, figure out the SQL once and then you're set.
Views are also a great way of hiding calculations, business logic, and other complexities so creating reports is much easier and cleaner.
Say your business has two types of customers everyone is obsessed with: whales, who use your service at least 1,000 times a month, and minnows, who use it 1-50 times a month. (Why? Your founder went to a talk, and ever since it's been nonstop whales and minnows.)
You could bake the business logic for whales and minnows into every report that needs these categories. Or you could take care of it once, in a view.
If you build the categories in a view, you'll also be very happy two months from now, when your company decides it makes more sense to expand the whales category so it includes everyone who uses your service at least 850 times a month. You don't have to rewrite a bunch of reports or a bunch of SQL. Change the view and it’s one and done.
Sometimes when you're writing a report, you need to be able to choose from a wide range of data. For example, you might want to see all of the columns available in your SaaS platform's data so you can pick exactly the ones you need for a unique report.
But you also have a bunch of reports that only need the same subset of your SaaS platform's data. And it's a real pain to have to wade through a bazillion columns when creating these reports.
So, you create two views: one that contains all of your SaaS platform columns, another with the subset you regularly use. Now writing most of your reports is straightforward, but you've still got easy access to the full set of columns when you need them.
If you've ever read an article about traditional data ingestion, it probably uses the acronym ETL.
When using a modern data ingestion, you switch up the order for ingesting your data. Instead of extracting (E), transforming (T), and loading (L) your data, you extract (E), load (L), then transform (T) it.
The reason for this switch? Traditional data warehouses aren't speedy enough to handle real-time transformations using views.
As a result, for each transformation you need to temporarily store the results in one or more tables. Once you've finished with all the transformations required to get the data into the format you need, you load the cleaned up data into your data warehouse.
If you're creating a traditional data warehouse, you're likely to have an enormous amount of data to transform. If you have to repeatedly modify the data in one or more tables, that's going to chew up a lot of time. If you're lucky, completing an ETL process takes several hours—and you have to run that process all over again every time you get a new batch of data.
But it's common for ETL processes to run for half a day or more. So in addition to figuring out which transformations you'll need to do, you also need to determine how to optimize your SQL and other factors to ensure one day's ETL process finishes before the next day's process begins.
And because the process takes so long, you have to worry about whether the process will choke and die while it's running. At a minimum, that means you have to rerun the process all over again. If you're pulling down live data, the process' failure could also mean you'd lose data you could never get back.
That's what the process is like if your business' needs never change. But of course they do. Now you'll have to make changes to how the tables are structured, change the SQL to accommodate the new structure, and reoptimize the process so it finishes before the next day. In many cases, it will also require a complete recalculation of all the tables you already have in your data warehouse.
Is it any wonder that you need highly skilled data engineers to build a traditional data warehouse?
Luckily, since you're using a modern data warehouse you'll never have to worry about any of this. With ELT, raw data is loaded into the warehouse and transformations are applied afterward, making the process that much faster and easier.
Working with a platform like Panoply makes working with your data a snap. With ease of use and automated optimization baked in, you can set up a data warehouse in minutes and easily integrate your company's BI tool in just a few clicks.
Interested in learning more about how you can combine the best of ELT and data warehousing? Book a demo with one of our experts!