Manipulating Time Series/Datetime Data with SQL in Redshift

This article will cover how to work with time series/datetime data in Redshift.

Time series data holds the date and/or time information and is different from string or numeric data since it has components like day, month, year, hours, etc., that we might want to access/compare/modify.

For example, we might want sales in the month of May or to find the difference in two dates as the number of months.

Redshift is a database provided as a part of AWS (Amazon Web Services), based on the Postgres database, but with some differences. Like most other relational databases such as MySQLSQL ServerOracleDB2, etc., Redshift can handle and manipulate time series data.

Let's see how.

Datetime types

Firstly, let's look at which data types for time series are supported by Redshift and what kind of data they can hold.

  • DATE: Dates without the time part
  • TIME: Only the time part
  • TIMEZ: Time with time zone info (since time in New York is behind that in Tokyo, for instance, time zones reflect this difference)
  • TIMESTAMP: Includes both the date and time
  • TIMESTAMPZ: Timestamp along with the time zone info

For more information, check out the AWS documentation.

Fetching datetime results

Now we know the different data types you can get for time series. But how do we get the results of those types of queries? Does it always have to come from data stored in columns? 

No, we can get datetime results in two ways:

  • Selecting a datetime column in the query
  • Using a function to create a datetime from some other input

Let's see an example query.

Current datetime example

select current_date today, current_timestamp now, current_time time_now, to_date( '28-10-2010', 'DD-MM-YYYY')

today_from_str, current_date + 1 tomo

current datetime example

We've used Redshift's built-in functions to get today's date and time in the example above.

We used TO_DATE to create a date from a string, and we used date arithmetic to get tomorrow's date from today. Note how the "now" column has both the date and time. 

SYSDATE and GETDATE can also be used for the current timestamp, with the caveat that SYSDATE returns the timestamp for the current transaction.

Working with time zones

We saw that the type TIMESTAMPZ above that can hold time zone info, but what is that, and why is it needed?

Because of the earth's rotation, X a.m./p.m. of any given day can be different in different locations around the world; we have time zones to capture this info.

For example, when it's sunrise in Japan, England is still in the dark, waiting for the earth's rotation to bring it to face the sun. Thus Japan and England are in different time zones.

The TIMESTAMPZ data type captures the time zone along with the timestamp. If the time zone is not specified in a TIMESTAMPZ type, it's assumed to be UTC.

So first, how do we get a TIMESTAMPZ value?

Either we populate a TIMESTAMPZ column with a string literal containing the time zone (e.g., '2021-10-31 13:00:00 IST'), or we use the TIMEZONE function to create a TIMESTAMPZ value from a timestamp. In the below example, we have specified EST as the time zone.

select TIMEZONE( 'EST', TIMESTAMP '2021-10-31 10:00:00')

timezone

The result is being displayed as UTC. UTC is ahead of EST by five hours, so the UTC time is 10+5=15.

Let's say my boss in the US wants me to attend an online meeting at 7 p.m. on Friday, October 31, 2021. What time will it be in India?

The CONVERT_TIMEZONE function converts from one time zone to another, and here we want to convert from source EST to target IST. Unfortunately, it seems that Redshift is confusing IST with Israel time, so I'll also have to use the Asia string.


select CONVERT_TIMEZONE( 'EST', 'Asia/Kolkata', TIMESTAMP '2021-10-31 07:00:00 PM')

convert_timezone

Oh no! I have to get up at 5:30 a.m. on a Saturday? :(

Creating sample data and examples

Next, let's now create some data for running more queries.

In this scenario, we'll say we have a business, and we're looking at how many of our products get sold within certain time frames.

Let's assume for the examples below that we have a table "sales," which has columns titled as follows: product_name, sale_date(timestamp), and amount(decimal). Then, we can create the sample data as shown below:

create table sales (id integer, product_name varchar(100), purchase_date timestamp, sale_date timestamp, 
amount integer);

insert into sales values (1, 'Kellogs', null, to_timestamp('2021-10-29 07:00:00', 'YYYY-MM-DD HH:MI:SS') 
- interval '12 weeks', 20);
insert into sales values (2, 'Kellogs', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS') 
+ interval '2 hours', 10);
insert into sales values (3, 'Kellogs', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS'), 10);
insert into sales values (4, 'Milk', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS'), 5);

update sales set purchase_date = dateadd( months, -id, sale_date);

 

We've used TO_TIMESTAMP to parse the date from the string. HH24 should be used for hours when using the 24-hour time format.

Note the use of the INTERVAL keyword to perform date arithmetic using specific time intervals.

Redshift may not support month, year in intervals in some contexts. In addition, we've populated the purchase_date column by subtracting the ID number of months from the sale_date, using the DATEADD function.

It's also possible to generate such dates by adding the result of the ROW_NUMBER function to some initial date.

Sales in the morning example

Using the sample data we created above, let's learn some more information about the products.

For example, to find out which products are sold in the morning hours between 8 a.m. and 10 a.m., we could write the following:

select product_name, sum(amount)

from sales

where date_part( 'hour', sale_date) between 8 and 10

group by product_name

order by 2 desc;

 

Here we're using the DATE_PART function to extract the hour part. Note that using a function on an indexed column means the index cannot be used. Instead, a separate function index will have to be built on the expression.

So, in the above example, even if the sale_date column has an index on it, it won't be used, as we are not using sale_date directly.

Sales in the current quarter example

To query for sales in the current quarter, run the following:

select *

from sales

where date_part( 'quarter', sale_date) = date_part( 'quarter', current_date)

and date_part( 'year', sale_date) = date_part( 'year', current_date)

 

Here, we compare both the quarter and the year of the sale date with those of the current date.

First day of the month example

To get sales from the first day of the month, we can use the DATE_TRUNC function.

It will truncate the date/time parts that are less significant than the specified time part. Here, years are the most significant, and microseconds are the least significant.

In the below case, if we truncate till months, then the days, hours, and seconds parts that are less significant than the month will all be cleared/truncated. Similarly, we can get the first day of the year or the first day of the week.

There is a separate LAST_DAY function that returns the last day of the month for a given date.

select date_trunc('month', sale_date ) first_day, last_day(sale_date) from sales;

Date addition

We've seen some ways of working with dates. For example, adding two dates usually doesn't make sense (e.g., add 01-Jan-19 to 02-Feb-20). So usually, we add a date and a time interval (e.g., add a day or week or month or year to a given date).

We've already seen the use of the INTERVAL keyword above to add/subtract time intervals to/from dates. You can also use the DATEADD function for this purpose. The advantage it has over the INTERVAL keyword is that the value of the interval to be added can be a dynamic expression involving columns, etc. DATEADD can be used for subtraction also, by using negative time intervals.

Generating all days of the month example

The below query uses DATEADD and ROW_NUMBER to generate all dates of a given month (in this case, November).

For this purpose, we need a table that is sure to have at least 31 rows for the max of 31 days a month can have. So we use a system table PG_ATTRIBUTE_INFO, which contains info about all columns and has many rows for system table attributes.

Here we have used EXTRACT as an alternative to DATE_PART. We have used common table expressions to be able to reuse the queries. DATE_TRUNC is used to find the first day of the month.

with mdays as ( select top 31 row_number() over () rnum from pg_attribute_info),

ldt as (select last_day( to_date('01-11-2021','DD-MM-YYYY')) lday )

select rnum, dateadd(day, rnum -1, date_trunc('month', lday))

from   ldt, mdays

where  rnum <= extract( days from lday)

order  by rnum;

 

An alternative approach using the GENERATE_SERIES function from Postgres is given below.

Though it works on Redshift, it's not officially supported per the documentation.

with ldt as (select last_day( to_date('01-11-2021','DD-MM-YYYY')) lday )

select dateadd(day, generate_series-1, date_trunc('month', lday) )

from generate_series(1,(select convert( int,date_part(day,lday)) from ldt),1), ldt;

Date subtraction

After the addition of dates, how about subtraction? If we subtract one date from another, the result is a time interval.

While it's possible to simply subtract date1 from date2, we usually want to know the difference between dates in a particular time interval (e.g., days, months, or years). You can use the DATEDIFF function for this purpose.

On the other hand, if we subtract an interval from a date, the result is a date. For this purpose, we can use the DATEADD function.

Sales for the last 3 months example

select sum(amount) from sales where sale_date > current_date - interval '3 month'

 

Note that we could also use the DATEADD or the ADD_MONTHS functions instead of INTERVAL.

Sale is 2 months later than the purchase example

select * from sales where datediff( month, purchase_date, sale_date ) > 2

 

OR

 

select * from sales where sale_date > dateadd( month, 2, purchase_date )

Summary

We have explored various ways of working with datetime data in Redshift.

  • Date arithmetic
  • The INTERVAL keyword
  • The DATE_PARTDATE_TRUNCDATEADD, and DATEDIFF functions

And suppose you're looking for ways to manage this data in Redshift. In that case, the Panoply can load data from Amazon Redshift, amongst numerous other options to integrate with data sources and BI tools. So you can bring together and analyze data from various sources.

It can also perform transformations on that data with its low-code ETL tool. As they put it, "Everyone wins when data flows freely."

You can even request a free demo!

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Contents

Work smarter, better, and faster with monthly tips and how-tos.