So. You run a business that lives on data. You’ve got a SQL-style relational database or two up and running to store your data, but your data keeps growing and you’re thinking you might want to set up a data warehouse to centralize everything and make life easier for your data analytics team. You’re familiar with PostgreSQL and like the idea of sticking with what you’re used to, but you’ve also heard good things about Amazon’s Redshift data warehouse platform and want to figure out how to weigh these two options. Sound about right? Well, lucky for you, we’ve put together a side-by-side comparison of Redshift and Postgres to help you figure out what best meets your needs.
Redshift isn’t just Postgres
You may have heard that Redshift is built on Postgres, so they’re basically the same thing. But there are, in fact, several important differences. While it’s true that Redshift is based on PostgreSQL (specifically PostgreSQL 8.0.2), Redshift’s original developers made several significant changes to Postgres when building their implementation, and these changes are important to keep in mind when thinking about how to build out your data warehousing operations. Basically, the differences boil down to a few key characteristics:
- Data architecture
- Speed & query optimization
- Distributed processing
- View materialization
- Language and data type differences
Data structures: columns vs rows
One of the most important distinctions between Redshift and traditional PostgreSQL comes down to the way data is stored and structured in the databases created by the two approaches. While PostgreSQL uses a row-ordered approach to building tables, Redshift’s architecture is set up such that all tables are ordered by column, with each column represented by its own file. This might sound like a minor difference, but it has major effects on the way the two database styles perform with bigger datasets.
Postgres: rows first
In Postgres, the fact that data is ordered by rows means that it is also queried by row. What this means in practice is that, if you have a very wide table (one with a large number of columns), even querying just a handful of rows requires pulling up every single column in the table. This isn’t necessarily a bad thing. Because Postgres is optimized for this setup, it means that its performance when querying a large number of columns isn’t significantly different than when fetching fewer. This is great for exploratory data analysis, where analysts might want to get a bird’s eye view of the data by looking at all columns at once to see if any patterns jump out. However, it can start to become a problem when you start to work with tables with lots of rows, especially if you are only interested in the data from one column. In order to, say, sum a single column in a million-row table, Postgres needs to read the entire table in order to complete the query. In practice, this can make for queries that are quite a bit slower than if they were performed in Redshift.
Redshift: columns first
In contrast, Redshift’s architecture puts columns first, which means that more straightforward, single- or few-column business queries don’t require reading the full table before a query can be completed. So if you want to see sales numbers in region A, Redshift can just go directly to those columns and load in the relevant rows. Redshift’s columnar organization also allows it to compress individual columns, which makes them easier and faster to read into memory for the purposes of processing queries. All this adds up to give Redshift a big speed boost for most standard, BI-type queries. Where it can fall down is in more exploratory queries, where an analyst might want to load a huge number of columns just to see what’s going on. This is especially likely to come up with new Redshift users who are familiar with Postgres and don’t understand how their queries will perform at scale on a Redshift instance.
Read more about Redshift architecture.
Distributed processing: the Redshift advantage
Aside from its columnar architecture, Redshift also has the advantage of being designed to run across multiple nodes. This makes for a couple more key distinctions between Redshift and standard PostgreSQL.
Postgres isn’t designed to use more than one core per query. The upside to this is that Postgres is much better at handling multiple parallel queries, but if you have a high-powered, multicore Postgres instance and you want to put all your resources into a single query, you won’t be able to. This might not sound like much of a problem, but it can limit your options if you’re trying to make big queries run faster in certain cases.
Redshift, on the other hand, chops stored tables up into blocks that can be read in parallel, which means that you have more flexibility in the way you use your processing power. You can run multiple queries in parallel, but you can also throw all your resources at a single massive query if you want.
As mentioned, Redshift is designed operate across multiple nodes, rather than on a single server instance. Like everything else, this comes with both advantages and disadvantages. Bad news first: the cluster-based architecture for Redshift instances means there’s basically a minimum size (and, therefore, price) for setting up a Redshift data warehouse. In certain cases, this could force you to buy more power than you really need for a relatively small data operation, depending on what your needs are.
But the benefits of this distributed architecture pretty quickly become apparent when it comes to performance. Because Redshift instances spread the work of querying and maintaining tables across multiple worker nodes, the speed of any particular operation isn’t limited by the speed of accessing a single disk. This means a Redshift instance can blow a single Postgres box out of the water in terms of the number of input/output operations per second (IOPS) that it’s capable of performing, which is another key factor in Redshift’s speed optimization.
Views: storing queries for later use
Another key difference between Redshift and PostgreSQL is the way that views are handled. If you’re a Postgres user, you may be used to setting up and storing views--also known as materialized views--for later. This is one place where Redshift’s performance falls down compared to standard Postgres, as it’s not possible to do in vanilla Redshift. If this doesn’t sound familiar, it works like so: creating a view based on a SQL query is a great way to save yourself the hassle of re-running a frequently-used query over and over again. If the query underlying that view takes a long time to run, though, you’re better off creating a materialized view, which will load the data into the view at the time it’s run and keep it there for later reference. But because of the way Redshift is architected, it’s not optimized for storing data in views, so trying to create standard views in Redshift will likely ding your performance. And like we said above, even if you are fine with slow views, you won’t be able to create and store them in the form of materialized views.
Luckily, Panoply adds view materialization back into Redshift, and even automates the process for you. You won’t even need to go about tracking and materializing your views on your own, as Panoply’s algorithm will track your queries and automatically materialize them for you based on how many times you’ve run them, how many rows they return and how long they take to run. In effect, you get the ease of setup and performance benefits of Redshift without losing the handy features of Postgres that you may already be using to make your life easier as an analyst.
Language and data types
If you’ve used Postgres or other SQL implementations before, you might also be curious about differences between PostgreSQL and Redshift at the language level. If you didn’t come here for a fine dissection of SQL dialects, feel free to skip this section.
Postgres and Redshift are different SQL dialects
It’s true, they are. The difference isn’t as stark as, say, Spanish and German, but it’s not quite as close as New York English and San Francisco English. It boils down to a few key differences, which you can read more about in the Redshift documentation if you’re curious:
- CREATE TABLE: Redshift does not support tablespaces and table partitioning. It also doesn’t support inheritance and certain other constraints. Redshift’s version of CREATE TABLE allows the user to define the sort and distribution algorithms for tables, which helps optimize data structures stored in Redshift for fast, parallel processing.
- ALTER TABLE: In Redshift, you also won’t be able to perform ALTER COLUMN-type actions, and ADD COLUMN is only possible for one column in each ALTER TABLE statement.
- COPY: because Redshift is an Amazon Web Services product, it’s optimized for use with other AWS products. That means that the COPY command is highly specialized for use with Amazon’s S3 buckets and Amazon DynamoDB tables, which makes it easier for Redshift to perform automatic compression.
- INSERT, UPDATE AND DELETE: When using INSERT, UPDATE and DELETE, Redshift doesn’t support using WITH clauses, so if that’s a familiar part of your flow, see the documentation to see best practices in INSERT/UPDATE/DELETE queries.
- VACUUM: VACUUM is one of the biggest points of difference in Redshift compared to standard PostgresSQL. Redshift defaults to VACUUM FULL, which resorts all rows as it reclaims disk space. Compare this to standard PostgreSQL, in which VACUUM only reclaims disk space to make it available for re-use. There are several other differences listed in the documentation (linked above), so if you’re coming from Postgres and are getting ready to run some VACUUM operations, be sure to check the documentation to see what else is different.
In building out from Postgres, Redshift’s developers also eliminated support for several data types that are supported in standard PostgreSQL. Per Amazon’s documentation, in Redshift, you won’t be able to work with:
- BIT, BIT VARYING
- Composite Types
- Date/Time Types
- Enumerated Types
- Geometric Types
- Network Address Types
- Numeric Types
- SERIAL, BIGSERIAL, SMALLSERIAL
- Object Identifier Types
- Range Types
- Text Search Types
In some cases, operations based on using these data types will throw errors, but others might be able to run on the leader node (but not the compute nodes) in a Redshift instance. Amazon also provides a better picture of how all that shakes out in their own documentation, so we won’t get too far into the weeds here.
Pricing: Redshift vs PostgreSQL
We mentioned this above, but you’re probably also looking for a more detailed breakdown of how Redshift and Postgres compare in terms of pricing. Postgres is a free and open source database package, so Redshift can’t really beat it there, as it’s a paid service. But if you want to set up a Postgres data warehouse, you’ll still have to pay for the hardware to run it on, whether you’re setting up an on-premise DW or trying to build something in the cloud. And if you want to set up a Postgres-based data warehouse that can beat a Redshift setup in terms of performance, you may even end up paying more in server costs than you would just going through AWS.
Now, you may also have picked up on the fact that Redshift is best suited for somewhat larger data operations, and Postgres might seem like the best choice for you if your data needs aren’t extensive. But there’s another way to get the speed and efficiency advantages of Redshift without having to pay top dollar for a data warehouse that’s bigger than you need. Panoply, which is built on top of Redshift, allow you to take advantage of the speed and efficiency benefits of a multi-node Redshift instance while paying a low, predictable monthly price. Panoply also takes care of all data warehouse maintenance tasks, and makes ETL and data integration a simple point-and-click affair.