Yesterday at AWS San Francisco Summit, Amazon announced a powerful new feature - Redshift Spectrum. Spectrum offers a set of new capabilities that allow Redshift columnar storage users to seamlessly query arbitrary files stored in S3 as though they were normal Redshift tables, delivering on the long-awaited requests for separation of storage and compute within Redshift. The redshift query option opens up a ton of new use-cases that were either impossible or prohibitively costly before.
UPDATE: Initially this text claimed that Spectrum is an integration between Redshift and Athena. After speaking with the Redshift team and learning more, we’ve learned it’s inaccurate as Redshift loads the data and queries it directly from S3.
In essence Spectrum is a powerful new feature that provides Amazon Redshift customers the following features:
This is simple, but very powerful. Amazon just made Redshift MUCH bigger, without compromising on performance or other database semantics.
See what other changes Redshift architecture has gone through.
So, how does it all work? It starts by defining external tables. One limitation this setup currently has is that you can’t split a single table between Redshift and S3. One use-case that we cover in Panoply where such separation would be necessary is when you have a massive table (think click stream time series), but only want the most recent events, like 3-months, to reside in Redshift, as that covers most of your queries. Then you might want to have the rest of the data in S3 and have the capability to seamlessly query this table. While this is not yet part of the new Redshift features, I hope that it will be something that Redshift team will consider in the future. In the meantime, Panoply’s auto-archiving feature provides an (almost) similar result for our customers.
This means that every table can either reside on Redshift normally, or be marked as an external table. “External Table” is a term from the realm of data lakes and query engines, like Apache Presto, to indicate that the data in the table is stored externally - either with an S3 bucket, or Hive metastore. Effectively the table is virtual. It’s only a link with some metadata. Let’s consider the following table definition:
CREATE EXTERNAL TABLE external_schema.click_stream (
See: SQL Reference for CREATE EXTERNAL TABLE.
Basically what we’ve told Redshift is to create a new external table - read only table that contains the specified columns and has its data located in the provided S3 path as text files. That’s it. We can start querying it as if it had all of the data pre-inserted into Redshift via normal COPY commands.
This has several important implications:
There’s one technical detail I’ve skipped: external schemas. As you might’ve noticed, in no place did we provide Redshift with the relevant credentials for accessing the S3 file. Quitel cleverly, instead of having to define it on every table (like we do for every COPY command), these details are provided once by creating an External Schema, and then assigning all tables to that schema. I will not elaborate on it here, as it’s just a one-time technical setup step, but you can read more about it here.
It’s a common misconception that Spectrum uses Athena under the hood to query the S3 data files. But it’s not true. While the two looks similar, Redshift actually loads and queries that data on it’s own, directly from S3. But in order to do that, Redshift needs to parse the raw data files into a tabular format. In other words, it needs to know ahead of time how the data is structured, is it a Parquet file? a CSV or TSV file?
That’s where the aforementioned “STORED AS” clause comes in. When we initially create the external table, we let Redshift know how the data files are structured.
Now that the table is defined. We can query it just like any other Redshift table. But more importantly, we can join it with other non-external tables. So if we have our massive click stream external table and we want to join it with a smaller & faster users table that resides on Redshift, we can issue a query like:
SELECT clicks.time, clicks.user_id, users.user_name
FROM external_schema.click_stream as clicks
Redshift will construct a query plan that joins these two tables, like so:
Basically what happens is that the users table is scanned normally within Redshift by distributing the work among all nodes in the cluster. In parallel, Redshift will ask S3 to retrieve the relevant files for the clicks stream, and will parse it. Finally the data is collected from both scans, joined and returned.
Naturally, queries running against S3 are bound to be a bit slower. It’s still interactively fast, as the power of Redshift allows great parallelism, but it’s not going to be as fast as having your data pre-compressed, pre-analyzed data stored within Redshift. That’s not just because of S3 I/O speed compared to EBS or local disk reads, but also due to the lack of caching, ad-hoc parsing on query-time and the fact that there are no sort-keys.
But that’s fine. External tables cover a different use-case. You use them for data your need to query infrequently, or as part of an ELT process that generates views and aggregations. In fact, in Panoply we’ve simulated these use-cases in the past similarly - we would take raw arbitrary data from S3 and periodically aggregate/transform it into small, well-optimized materialized views within a cloud based data warehouse architecture.
As for the cost - this is a tricky one. On the get-go, external tables cost nothing (beyond the S3 storage cost), as they don’t actually store or manipulate data in anyway. It’s just a bunch of Metadata. But as you start querying, you’re basically using query-based cost model of paying per scanned data size. So if, for example, you run a query that needs to process 1TB of data, you’d be billed for $5 for that query. This model isn’t unique, as is quite convenient when you indeed query these external tables infrequently, but can become problematic and unpredictable when your team query it often.
Finally, using a columnar data format, like Parquet, can improve both performance and cost tremendously, as Redshift wouldn’t need to read and parse the whole table, but only the specific columns that are part of the query. This saves the costs of I/O, due to file size, especially when compressed, but also the cost of parsing.
See what other changes Redshift has gone through.
It’s clear that the world of data analysis is undergoing a revolution. It started out with Presto, which was arguably the first tool to allow interactive queries on arbitrary data lakes. Then Google’s Big Query provided a similar solution except with automatic scaling. And finally AWS Athena and now AWS Spectrum brings these same capabilities to AWS.
But here at Panoply we still believe the best is yet to come. These new awesome technologies illustrate the possibilities, but the performance is still a bit off, compared to classic data warehouses like Redshift and Vertica that had decades to evolve and perfect. Technically, there’s little reason for these new systems to not provide competitive query performance, despite their limitations and differences from the standpoint of classic data warehouses. Understanding the data warehouse concepts under the hood helps you develop an understanding of expected behavior. (Yeah, I said it. More on this topic to come...)
In any case, we’ve been already simulating some of these features for our customers internally for the past year and a half. Having these new capabilities baked into Redshift makes it easier for us to deliver more value - like auto archiving - faster and easier. In the near future, we can expect to see teams learn more from their data and utilize it better than ever before - by using capabilities that, until very recently, were outside of their reach.
Click here for a detailed comparison of Athena and Redshift