One of the first big challenges in my career was migrating an application from Microsoft SQL Server to Oracle.
A coworker and I were tasked with the migration, which included the translation of views, functions, and stored procedures from one database system to the other.
I'm sure there were easier solutions for many of the problems we've faced, but we were fresh out of college and didn't have that much experience.
Why am I taking a trip down memory lane? Because database migration is hard work!
It gets even more complicated when the databases involved have numerous differences. For instance, performing a MySQL to MariaDB migration is much simpler than a SQL Server to Oracle one.
...database migration is hard work!
Below, I'll share tips on how to do a proper PostgreSQL to BigQuery migration and cover some of the challenges you'll face. Let's get started!
Why a 'PostgreSQL to BigQuery SQL migration' is so challenging
For starters, let's make sure we're on the same page regarding terminology.
(Feel free to skip this section if some of the parts feel too basic for you.)
SQL stands for "structured query language."
One of the challenges in working with the SQL language is that it has many dialects.
Even though SQL is standardized, each database system can tweak the language in some way, adding unique features or functions.
Different data types
The data types themselves differ from database to database. You can run into situations where multiple data types from the source database translate to a single data type on the destination database—or vice versa.
This scenario might result in problems such as loss of precision after migrating.
Missing data types
Another possible issue is when one or more data types from the source database simply don't exist on the other. In such a case, you'd typically have to improvise and use alternative data types at your disposal.
You have to be ready to run into those challenges when you're migrating from PostgreSQL to BigQuery.
BigQuery SQL syntax: standard vs. legacy
BigQuery SQL offers two types of syntax: the standard syntax and the legacy syntax. You can choose between the two dialects every time you run a BigQuery query.
While both syntaxes work, it's advisable to use the standard syntax.
How do data types differ in BigQuery vs. PostgreSQL?
As you've seen, one of the challenges when undertaking a database migration is the difference in data types from one system to another.
The following table features some of the main data types in PostgreSQL with their equivalent in BigQuery:
You've likely noticed some interesting facts from the table above.
Multiple data type equivalents
First, there are several cases in which more than one type in PostgreSQL translates to a single type in BigQuery. For instance, in PostgreSQL, the following types represent an integer number:
- integer: a 32-bit signed integer
- smallint: a 16-bit signed integer
- bigint: a 64-bit signed integer
In BigQuery, there's only one type to represent integers: the INT64 type, which, unsurprisingly, is a 64-bit integer. You might encounter names like INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT. But these aren't different types; instead, they're simply aliases of INT64.
Lack of data type equivalent
Secondly, there are data types from PostgreSQL that don't have a BigQuery equivalent.
This lack of data type is a common challenge during these types of migrations, and most of the time, you can make do by replacing the lacking data type with a similar one.
Artifacts in BigQuery standard syntax
Now let's cover how to create two types of artifacts in Google BigQuery using the standard syntax for SQL, namely user-defined functions and subqueries.
A user-defined function can be either temporary or persistent.
A temporary UDF exists only during a single query, while a persistent UDF spans multiple queries and can get reused.
A UFD is a pure function—that is, it can't talk to external dependencies or mutate data. That makes them safe to call, even when shared between owners.
The following code example shows the creation and calling of a temporary UDF:
CREATE TEMP FUNCTION AddAndDouble(x INT64, y INT64) RETURNS INT64 AS ((x + y) * 2); SELECT AddAndDouble(8, 2)
Now, let's see the code to create the same function, but as a persistent function:
CREATE FUNCTION myfirstdataset.AddAndDouble(x INT64, y INT64) RETURNS INT64 AS ((x + y) * 2);
As you can see, the declaration loses the word "temp," and you need to declare a dataset for the permanent function to live in.
BigQuery offers 4 types of expression subqueries (i.e., subqueries that can get used wherever expressions can get used).
Here are some examples of what expression subqueries can be:
Scalar subqueries return a single value (i.e., a single column); if you try to return more than one column, it results in an error.
Let's look at an example:
SELECT title, (SELECT description FROM Categories WHERE Titles.category = id) AS product_category FROM Titles;
An array subquery returns an array.
Consider the following example:
SELECT id, ARRAY(SELECT description FROM db.class WHERE class.teacherId=teacher.id) AS classes FROM db.teacher;
The syntax of an array subquery is similar to that of PostgreSQL. To avoid errors, make sure the SELECT inside the subquery has exactly one column.
IN subqueries return TRUE when the value exists in the list returned by the subquery, and FALSE otherwise.
The syntax is as follows:
value [ NOT ] IN ( subquery )
The SELECT in the subquery must have a single column, and its type must be compatible with the type of the value to be checked.
SELECT id, title, name FROM db.teacher WHERE id IN (select teacher_id from db.class)
The query above will retrieve teachers who are assigned to teach at least one class.
Last but not least, we have the EXISTS subqueries.
The usage and syntax of this type of subquery are simple and very similar to that from PostgreSQL:
SELECT EXISTS(SELECT name FROM Teachers WHERE salary>=1500) as result;
An EXISTS subquery returns TRUE when there's at least one result, and FALSE otherwise.
Wrapping things up
In this post, I covered some tips on how to convert your PostgreSQL queries to Google BigQuery.
As you've seen, BigQuery supports two types of syntax for SQL: the standard syntax and the legacy one. Though both work, it's advisable you switch to the standard syntax if you still don't use it.