Being Data Driven Product

Introducing Panoply.io’s Upsert Mechanism

Written by Alon Brody|March 28, 2016

One of the most critical issue in data warehouse management is avoiding duplications. When you are periodically loading data into your data warehouse you will want to skip the rows that have been already inserted. For example, in a user data table, you don’t want to insert the same users over and over again. Luckily there is a simple and efficient way to deal with duplications. One that is included in most databases; id assignments.

The id column must be unique and the definition of each id in each table may be simple (such as incremental numbering) or complex (based on values in other fields) . While this solution helps us avoid duplications it also gives birth to a whole new issue as now we need to insert the new rows into our data warehouse and update old rows.

Data Warehouse too complex to manage? Learn how to leverage AWS

This process is called upsert (update and insert) and the logic guiding this process is to first update relevant rows and then insert the rest. This process is well known in many databases and also in Redshift, which is what Panoply.io is built over. The upsert process Redshift documentation suggest to using is:

table_introducing_panoply_upsert.png

While this process does the job we’ve encountered a few recurring issues which are definitely not edge cases. Here are two issues we encountered that we will explain how to solve:

  1. Source table duplications
  2. Slow and disk consuming processes

Duplications in the Source Table

We frequently encounter source duplications,both by mistake and by design. So long as an id is being used the end result will be one row per id. Let’s look at an example.

Target Table:

id name time
1 Joe 2015-07-16 15:23:40
2 Dave 2015-07-01 03:51:12
3 Bill 2015-08-05 22:30:34

Source Table:

id name time
4 Ken 2015-08-09 10:03:41
1 Joel 2015-08-09 10:22:18
4 Ben 2015-08-09 10:03:50

In this case it’s easy to understand which rows should be updated with which data. The resulting target table we will receive after the upsert is:

id name time
1 Joel 2015-08-09 10:22:18
2 Dave 2015-07-01 03:51:12
3 Bill 2015-08-05 22:30:34
4 Ben 2015-08-09 10:03:50

However, if we had used the upsert process shown above the result would have been:

id name time
1 Joel 2015-08-09 10:22:18
2 Dave 2015-07-01 03:51:12
3 Bill 2015-08-05 22:30:34
4 Ken 2015-08-09 10:03:41
4 Ben 2015-08-09 10:03:50

As you can see, the update worked fine but when we got to the insert 2 new rows of id = 4 were recognized and both were inserted . This type of faulted result came up on multiple occasions with most of our customers. The solution we came up with at Panoply.io was to not only use id but also ranks (Named idrank). We assign id ranks to each id in the source data. We then use the idrank to decide which data we should insert/update in every step. Our main rule for upsert mechanism is that there will never be two identical ids in one batch. After the addition of idrank the source table will look like this:

id name time idrank
4 Ken 2015-08-09 10:03:41 1
1 Joel 2015-08-09 10:22:18 1
4 Ben 2015-08-09 10:03:50 2

 

 

 

 

Now we need to run the upsert mechanism X times, X = #idrank, once per each idrank we have defined. By adding idrank = X to each query in the upsert process we are able to handle each upsert separately and safely. This add-on splits the source table into x different “sources”. There are two sources for the given example:

id name time idrank
4 Ken 2015-08-09 10:03:41 1
1 Joel 2015-08-09 10:22:18 1

id name time idrank
4 Ben 2015-08-09 10:03:50 2

Each table is handled by the upsert mechanism separately due to different idranks. The following script will run n times (n = max(idrank) from the source table):


table_introducing_panoply_upsert_mechanism_2.png

Once this solution is implemented the source table will be upserted twice. In the first upsert there will be one update of id=1 (name = Joel) and one insert of id=4 (name = Ken). In the second upsert there will be only one update of id=4 (name = Ben). That will be the final value of the name field for id=4).

Slow and Disk Consuming Process

Once we worked that out we thought that we had conquered the upsert mechanism. Little did we know that the delete part of the process was extremely slow and disk consuming on large target tables (hundreds of millions rows). So we did what everyone does,we searched the web and forums for solutions. Some suggested adding the id field to the distribution key, vacuuming the data before each upsert or use smaller batches in each upsert process. None worked because the main issue was the size of the target table (vacuumed on a daily basis) and we couldn’t use the id as a distribution key as it is already in use for query optimization.

We noticed that updating a table is much faster than deleting rows from a table. A matter of seconds/minutes for updating compared to 20+ minutes for deleting. On top of that sometimes the delete part of the upsert crashed the Redshift cluster (consuming disk space). So instead of deleting the updated rows we decided to try to update the source table and mark the updated rows. This is a simple update query on the source table:

table_introducing_panoply_upsert_mechanism_3.png

This fix worked but still incurred high disk space usage. The actual run time decreased but our Redshift disk space was at risk. But at least we had an inkling that we were on the right path so we decided to compare three different alternative update methods:

table_introducing_panoply_upsert_mechanism_4.png

Obviously we found a winner 😉 We managed to reduce 13+ minutes of processing time (plus the ‘bonus’ of the occasional failed process) into less than a minute and on the way we resolved the disk space issue.

The only remaining thing to do was to adjust the insert part of the process to search for rows with __deleted <> 1. Our upsert process is ready for high scale and it’s resistant to duplications. The final upsert is:

table_introducing_panoply_upsert_mechanism_5.png

Data Warehouse too complex to manage? Learn how to leverage AWS

With these two add-ons to the upsert process we managed to increase both speed and efficiency. In summary, this is how you go from a slow, simple upsert process which cannot manage duplications to a blazing fast process that knows how to adapt when it encounters duplications.

Stay tuned for more ongoing procedures and adjustments we developed to improve our upsert process. If you have any upsert process issues/improvements you would like us to take a look at or share, now’s the time!

From raw data to analysis in under 10 minutes.

Sign up now for a demo or a free trail of the Panoply.io platform.

Learn more about platform features