This article will cover how to work with time series/datetime data in Redshift.
Time series data holds the date and/or time information and is different from string or numeric data since it has components like day, month, year, hours, etc., that we might want to access/compare/modify.
For example, we might want sales in the month of May or to find the difference in two dates as the number of months.
Redshift is a database provided as a part of AWS (Amazon Web Services), based on the Postgres database, but with some differences. Like most other relational databases such as MySQL, SQL Server, Oracle, DB2, etc., Redshift can handle and manipulate time series data.
Let's see how.
Firstly, let's look at which data types for time series are supported by Redshift and what kind of data they can hold.
For more information, check out the AWS documentation.
Now we know the different data types you can get for time series. But how do we get the results of those types of queries? Does it always have to come from data stored in columns?
No, we can get datetime results in two ways:
Let's see an example query.
select current_date today, current_timestamp now, current_time time_now, to_date( '28-10-2010', 'DD-MM-YYYY') today_from_str, current_date + 1 tomo
We've used Redshift's built-in functions to get today's date and time in the example above.
We used TO_DATE
to create a date from a string, and we used date arithmetic to get tomorrow's date from today. Note how the "now" column has both the date and time.
SYSDATE
and GETDATE
can also be used for the current timestamp, with the caveat that SYSDATE
returns the timestamp for the current transaction.
We saw that the type TIMESTAMPZ
above that can hold time zone info, but what is that, and why is it needed?
Because of the earth's rotation, X a.m./p.m. of any given day can be different in different locations around the world; we have time zones to capture this info.
For example, when it's sunrise in Japan, England is still in the dark, waiting for the earth's rotation to bring it to face the sun. Thus Japan and England are in different time zones.
The TIMESTAMPZ
data type captures the time zone along with the timestamp. If the time zone is not specified in a TIMESTAMPZ
type, it's assumed to be UTC.
So first, how do we get a TIMESTAMPZ
value?
Either we populate a TIMESTAMPZ
column with a string literal containing the time zone (e.g., '2021-10-31 13:00:00 IST'), or we use the TIMEZONE
function to create a TIMESTAMPZ
value from a timestamp. In the below example, we have specified EST as the time zone.
select TIMEZONE( 'EST', TIMESTAMP '2021-10-31 10:00:00')
The result is being displayed as UTC. UTC is ahead of EST by five hours, so the UTC time is 10+5=15.
Let's say my boss in the US wants me to attend an online meeting at 7 p.m. on Friday, October 31, 2021. What time will it be in India?
The CONVERT_TIMEZONE
function converts from one time zone to another, and here we want to convert from source EST to target IST. Unfortunately, it seems that Redshift is confusing IST with Israel time, so I'll also have to use the Asia string.
select CONVERT_TIMEZONE( 'EST', 'Asia/Kolkata', TIMESTAMP '2021-10-31 07:00:00 PM')
Oh no! I have to get up at 5:30 a.m. on a Saturday? :(
Next, let's now create some data for running more queries.
In this scenario, we'll say we have a business, and we're looking at how many of our products get sold within certain time frames.
Let's assume for the examples below that we have a table "sales," which has columns titled as follows: product_name, sale_date(timestamp), and amount(decimal). Then, we can create the sample data as shown below:
create table sales (id integer, product_name varchar(100), purchase_date timestamp, sale_date timestamp, amount integer); insert into sales values (1, 'Kellogs', null, to_timestamp('2021-10-29 07:00:00', 'YYYY-MM-DD HH:MI:SS') - interval '12 weeks', 20); insert into sales values (2, 'Kellogs', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS') + interval '2 hours', 10); insert into sales values (3, 'Kellogs', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS'), 10); insert into sales values (4, 'Milk', null, to_timestamp('2021-10-29 08:00:00', 'YYYY-MM-DD HH:MI:SS'), 5); update sales set purchase_date = dateadd( months, -id, sale_date);
We've used TO_TIMESTAMP
to parse the date from the string. HH24 should be used for hours when using the 24-hour time format.
Note the use of the INTERVAL
keyword to perform date arithmetic using specific time intervals.
Redshift may not support month, year in intervals in some contexts. In addition, we've populated the purchase_date column by subtracting the ID number of months from the sale_date, using the DATEADD
function.
It's also possible to generate such dates by adding the result of the ROW_NUMBER
function to some initial date.
Using the sample data we created above, let's learn some more information about the products.
For example, to find out which products are sold in the morning hours between 8 a.m. and 10 a.m., we could write the following:
select product_name, sum(amount) from sales where date_part( 'hour', sale_date) between 8 and 10 group by product_name order by 2 desc;
Here we're using the DATE_PART
function to extract the hour part. Note that using a function on an indexed column means the index cannot be used. Instead, a separate function index will have to be built on the expression.
So, in the above example, even if the sale_date
column has an index on it, it won't be used, as we are not using sale_date
directly.
To query for sales in the current quarter, run the following:
select * from sales where date_part( 'quarter', sale_date) = date_part( 'quarter', current_date) and date_part( 'year', sale_date) = date_part( 'year', current_date)
Here, we compare both the quarter and the year of the sale date with those of the current date.
To get sales from the first day of the month, we can use the DATE_TRUNC
function.
It will truncate the date/time parts that are less significant than the specified time part. Here, years are the most significant, and microseconds are the least significant.
In the below case, if we truncate till months, then the days, hours, and seconds parts that are less significant than the month will all be cleared/truncated. Similarly, we can get the first day of the year or the first day of the week.
There is a separate LAST_DAY
function that returns the last day of the month for a given date.
select date_trunc('month', sale_date ) first_day, last_day(sale_date) from sales;
We've seen some ways of working with dates. For example, adding two dates usually doesn't make sense (e.g., add 01-Jan-19 to 02-Feb-20). So usually, we add a date and a time interval (e.g., add a day or week or month or year to a given date).
We've already seen the use of the INTERVAL
keyword above to add/subtract time intervals to/from dates. You can also use the DATEADD
function for this purpose. The advantage it has over the INTERVAL
keyword is that the value of the interval to be added can be a dynamic expression involving columns, etc. DATEADD
can be used for subtraction also, by using negative time intervals.
The below query uses DATEADD
and ROW_NUMBER
to generate all dates of a given month (in this case, November).
For this purpose, we need a table that is sure to have at least 31 rows for the max of 31 days a month can have. So we use a system table PG_ATTRIBUTE_INFO
, which contains info about all columns and has many rows for system table attributes.
Here we have used EXTRACT
as an alternative to DATE_PART
. We have used common table expressions to be able to reuse the queries. DATE_TRUNC
is used to find the first day of the month.
with mdays as ( select top 31 row_number() over () rnum from pg_attribute_info), ldt as (select last_day( to_date('01-11-2021','DD-MM-YYYY')) lday ) select rnum, dateadd(day, rnum -1, date_trunc('month', lday)) from ldt, mdays where rnum <= extract( days from lday) order by rnum;
An alternative approach using the GENERATE_SERIES
function from Postgres is given below.
Though it works on Redshift, it's not officially supported per the documentation.
with ldt as (select last_day( to_date('01-11-2021','DD-MM-YYYY')) lday ) select dateadd(day, generate_series-1, date_trunc('month', lday) ) from generate_series(1,(select convert( int,date_part(day,lday)) from ldt),1), ldt;
After the addition of dates, how about subtraction? If we subtract one date from another, the result is a time interval.
While it's possible to simply subtract date1 from date2, we usually want to know the difference between dates in a particular time interval (e.g., days, months, or years). You can use the DATEDIFF
function for this purpose.
On the other hand, if we subtract an interval from a date, the result is a date. For this purpose, we can use the DATEADD
function.
select sum(amount) from sales where sale_date > current_date - interval '3 month'
Note that we could also use the DATEADD
or the ADD_MONTHS
functions instead of INTERVAL
.
select * from sales where datediff( month, purchase_date, sale_date ) > 2
OR
select * from sales where sale_date > dateadd( month, 2, purchase_date )
We have explored various ways of working with datetime data in Redshift.
INTERVAL
keywordDATE_PART
, DATE_TRUNC
, DATEADD
, and DATEDIFF
functionsAnd suppose you're looking for ways to manage this data in Redshift. In that case, the Panoply can load data from Amazon Redshift, amongst numerous other options to integrate with data sources and BI tools. So you can bring together and analyze data from various sources.
It can also perform transformations on that data with its low-code ETL tool. As they put it, "Everyone wins when data flows freely."
You can even request a free demo!