Postgres to BigQuery: Convert PostgreSQL Queries into BigQuery SQL

 

One of the first big challenges in my career was migrating an application from Microsoft SQL Server to Oracle.

A coworker and I were tasked with the migration, which included the translation of views, functions, and stored procedures from one database system to the other.

I'm sure there were easier solutions for many of the problems we've faced, but we were fresh out of college and didn't have that much experience.

Why am I taking a trip down memory lane? Because database migration is hard work!

It gets even more complicated when the databases involved have numerous differences. For instance, performing a MySQL to MariaDB migration is much simpler than a SQL Server to Oracle one.

...database migration is hard work!

 

Let's say you've been successfully using PostgreSQL as a data warehouse, then you find yourself in the situation of needing to migrate to Google BigQuery.

To perform this migration, you'll need to convert your PostgreSQL queries into BigQuery syntax. In doing so, you'll run into challenges similar to what I faced in the migration I've just described.

Below, I'll share tips on how to do a proper PostgreSQL to BigQuery migration and cover some of the challenges you'll face. Let's get started!

Why a 'PostgreSQL to BigQuery SQL migration' is so challenging

Let's kick things off by talking about the challenges involved in migrating from PostgreSQL into BigQuery.

For starters, let's make sure we're on the same page regarding terminology.

(Feel free to skip this section if some of the parts feel too basic for you.)

SQL stands for "structured query language."

It's the language we use to interact with relational database systems, such as Microsoft SQL Server, MySQL, and—you guessed it!—PostgreSQL.

Different dialects

One of the challenges in working with the SQL language is that it has many dialects.

Even though SQL is standardized, each database system can tweak the language in some way, adding unique features or functions.

Different data types

The data types themselves differ from database to database. You can run into situations where multiple data types from the source database translate to a single data type on the destination database—or vice versa.

This scenario might result in problems such as loss of precision after migrating.

Missing data types

Another possible issue is when one or more data types from the source database simply don't exist on the other. In such a case, you'd typically have to improvise and use alternative data types at your disposal.

You have to be ready to run into those challenges when you're migrating from PostgreSQL to BigQuery.

BigQuery SQL syntax: standard vs. legacy

BigQuery SQL offers two types of syntax: the standard syntax and the legacy syntax. You can choose between the two dialects every time you run a BigQuery query.

While both syntaxes work, it's advisable to use the standard syntax.

How do data types differ in BigQuery vs. PostgreSQL?

As you've seen, one of the challenges when undertaking a database migration is the difference in data types from one system to another.

The following table features some of the main data types in PostgreSQL with their equivalent in BigQuery:

PostgreSQL

BigQuery

char(n)

STRING

varchar(n)

STRING

text

STRING

bool

BOOL

serial

 

integer

INT64

smallint

INT64

bigint

INT64

float(n)

FLOAT64/NUMERIC

real/float8

FLOAT64/NUMERIC

date

DATE

time

TIME

interval

 

JSON

 

UUID

 

 

You've likely noticed some interesting facts from the table above.

Multiple data type equivalents

First, there are several cases in which more than one type in PostgreSQL translates to a single type in BigQuery. For instance, in PostgreSQL, the following types represent an integer number: 

  • integer: a 32-bit signed integer
  • smallint: a 16-bit signed integer
  • bigint: a 64-bit signed integer

In BigQuery, there's only one type to represent integers: the INT64 type, which, unsurprisingly, is a 64-bit integer. You might encounter names like INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT. But these aren't different types; instead, they're simply aliases of INT64.

Lack of data type equivalent

Secondly, there are data types from PostgreSQL that don't have a BigQuery equivalent.

This lack of data type is a common challenge during these types of migrations, and most of the time, you can make do by replacing the lacking data type with a similar one.

For example, you can account for the lack of a UUID type by using STRING in its place, and as for the JSON data type, you can either use a record/struct or, again, a STRING.

Artifacts in BigQuery standard syntax

Now let's cover how to create two types of artifacts in Google BigQuery using the standard syntax for SQL, namely user-defined functions and subqueries.

User-Defined Functions

In BigQuery, you can create a user-defined function (UDF) using more than just SQL expressions; you can also use JavaScript. With JavaScript, you gain flexibility when writing more complex functions.

A user-defined function can be either temporary or persistent.

temporary UDF exists only during a single query, while a persistent UDF spans multiple queries and can get reused.

UFD is a pure function—that is, it can't talk to external dependencies or mutate data. That makes them safe to call, even when shared between owners.

The following code example shows the creation and calling of a temporary UDF:

CREATE TEMP FUNCTION AddAndDouble(x INT64, y INT64)
  RETURNS INT64
  AS ((x + y) * 2);

SELECT AddAndDouble(8, 2)

 

Now, let's see the code to create the same function, but as a persistent function:

CREATE FUNCTION myfirstdataset.AddAndDouble(x INT64, y INT64)
  RETURNS INT64
  AS ((x + y) * 2);

 

As you can see, the declaration loses the word "temp," and you need to declare a dataset for the permanent function to live in.

Expression subqueries

BigQuery offers 4 types of expression subqueries (i.e., subqueries that can get used wherever expressions can get used).

Here are some examples of what expression subqueries can be:

  • Scalar
  • Array
  • In
  • Exists

Scalar subqueries

Scalar subqueries return a single value (i.e., a single column); if you try to return more than one column, it results in an error.

Let's look at an example:

SELECT title, (SELECT description FROM Categories WHERE Titles.category = id) AS product_category
FROM Titles;

ARRAY subqueries

An array subquery returns an array.

Consider the following example:

SELECT
id,
ARRAY(SELECT description FROM db.class WHERE class.teacherId=teacher.id)
AS classes
FROM db.teacher;

 

The syntax of an array subquery is similar to that of PostgreSQL. To avoid errors, make sure the SELECT inside the subquery has exactly one column.

IN subqueries

IN subqueries return TRUE when the value exists in the list returned by the subquery, and FALSE otherwise.

The syntax is as follows:

value [ NOT ] IN ( subquery )

 

The SELECT in the subquery must have a single column, and its type must be compatible with the type of the value to be checked.

SELECT id, title, name FROM db.teacher WHERE id IN (select teacher_id from db.class)

 

The query above will retrieve teachers who are assigned to teach at least one class.

EXISTS subqueries

Last but not least, we have the EXISTS subqueries.

The usage and syntax of this type of subquery are simple and very similar to that from PostgreSQL:

SELECT EXISTS(SELECT name FROM Teachers WHERE salary>=1500) as result;

 

An EXISTS subquery returns TRUE when there's at least one result, and FALSE otherwise.

Wrapping things up

In this post, I covered some tips on how to convert your PostgreSQL queries to Google BigQuery.

As you've seen, BigQuery supports two types of syntax for SQL: the standard syntax and the legacy one. Though both work, it's advisable you switch to the standard syntax if you still don't use it.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with monthly tips and how-tos.