Creating Elapsed Time Histograms in SQL Using DATEDIFF

DATEDIFF is a super useful function that returns the difference between date parts (or dateparts) of two date or time expressions. For this post we will use data from: Hubspot Lead Analytics in SQL, Beyond CRM Reporting and will use DATEDIFF to answer the question: How much time does it take for a “contact” to convert to a “marketing qualified lead” (MQL)? To take this one step further, we’ll ask, what is the distribution of the amount of time it takes for a “contact” to become a “MQL”? We’ll answer this question using a SQL query and use the resulting dataset to create a quick visualization that will clearly communicate our findings.

DATEDIFF: All the things you need to know

Syntax

DATEDIFF ( datepart, {date|timestamp}, {date|timestamp} )

Arguments

DATEDIFF takes 3 arguments: a datepart, and 2 date or time expressions. The function subtracts the first date/time value from the second date/time value. So if the first argument is prior to the second, you’ll get a positive value. If the first value is more recent than the second (larger), the result will be negative.

Datepart

The datepart is the part of the date or time expression that the function operates on (for example, year, month, day or hour). You can find a more comprehensive list of dateparts here.

 

An important thing to note here is that DATEDIFF pays attention to datepart boundaries. If the datepart is ‘year’, it will only look at year and will ignore months and days. The difference between December 31, 2019 and January 1, 2020 is 1 year, if ‘year’ is the datepart.

Date/Timestamp

Both date or time expressions must contain the specified datepart in the first argument. In other words, both date/time expressions must have an hour value if you are to put 'hour' as the datepart.

Return Type

DATEDIFF returns an integer (BIGINT) data type.

DATEDIFF to calculate the difference between createdate and MQL conversion date

SELECT 
  DATEDIFF('day', “createdate”, “hs_lifecyclestage_marketingqualifiedlead_date”)

 

We can use the column names “createdate” and “hs_lifecyclestage_marketingqualifiedlead_date” because the values are formatted as dates and contain the date part 'day'. This part of the query subtracts “createdate” from “hs_lifecyclestage_marketingqualifiedlead_date” and provides the difference in unit 'day'.

Put it all together and GROUP BY date_diff

This query (see Hubspot Lead Analytics in SQL, Beyond CRM Reporting to see how and why we came up with the nested part) will give us the difference between “createdate” and “hs_lifecyclestage_marketingqualifiedlead_date” and also aggregates the data by the difference between these dates. For each date_diff value (in days), the results will display a count of observations (rows) with distinct hubspot_contacts_id (contacts) from the nested table. This is how we can see the distribution of the amount of time it takes for contacts to become marketing qualified leads.

 

The nitty gritty of nested queries is beyond the scope of this post, but you can learn more about them here.

 

SELECT 
  DATEDIFF(
    'day', "createdate", "hs_lifecyclestage_marketingqualifiedlead_date"
  ) mql_delay, 
  count(distinct hubspot_contacts_id) contact_count 
FROM 
  (
    SELECT 
      p.hubspot_contacts_id, 
      MIN(
        CASE WHEN "key" = 'createdate' THEN timestamp 'epoch' + p.value / 1000 
* interval '1 second' ELSE NULL END ) createdate, MIN( CASE WHEN "key" = 'hs_lifecyclestage_marketingqualifiedlead_date' THEN
timestamp 'epoch' + p.value / 1000 * interval '1 second' ELSE NULL END ) hs_lifecyclestage_marketingqualifiedlead_date FROM hubspot_contacts_properties p GROUP BY 1 ORDER BY 3 ) contact_dates --Filter out null values (optional) WHERE createdate IS NOT NULL AND hs_lifecyclestage_marketingqualifiedlead_date IS NOT NULL --Group results by mql_delay GROUP BY mql_delay
--Order results by mql_delay in ascending order
ORDER BY mql_delay

The results demonstrate the distribution of the time difference between contact creation and MQL conversion. The first column is in unit ‘days’ and the second is a count of ‘contacts’.

Elapsed Time Histogram

Using this SQL query, you can create a histogram in BI tools like Mode, Metabase or Chartio. The histogram below shows the number of MQL conversions by the amount of time (days) elapsed between contact creation and MQL conversion.You can quickly and clearly see that most MQL conversions occur on the same day as contact creation (difference = 0).

 

Note: A bar at the end of your chart indicates null values. You can choose to keep those or filter them out in the WHERE clause.

 

Additionally, you may want to create bins using a CASE statement. The query below creates bins that group together MQL conversions at days_diff intervals of the first day, the rest of the first week, the rest of the first four weeks, and after four weeks.

 

SELECT 
  CASE WHEN DATEDIFF(
'day', "createdate", "hs_lifecyclestage_marketingqualifiedlead_date") = 0
THEN 'Day 1'
WHEN DATEDIFF(
'day', "createdate", "hs_lifecyclestage_marketingqualifiedlead_date")
BETWEEN 1 AND 6 THEN 'Day 2-7'
WHEN DATEDIFF( 'day', "createdate", "hs_lifecyclestage_marketingqualifiedlead_date")
BETWEEN 6 AND 27 THEN 'Day 8-28'
ELSE 'Days 28+' END AS bins_days_diff, COUNT(distinct hubspot_contacts_id) FROM ( SELECT p.hubspot_contacts_id, MIN( CASE WHEN "key" = 'createdate' THEN timestamp 'epoch' + p.value / 1000
* interval '1 second' ELSE NULL END ) createdate, MIN( CASE WHEN "key" = 'hs_lifecyclestage_marketingqualifiedlead_date' THEN
timestamp 'epoch' + p.value / 1000 * interval '1 second' ELSE NULL END ) hs_lifecyclestage_marketingqualifiedlead_date FROM hubspot_contacts_properties p GROUP BY 1 ORDER BY 3 ) contact_dates

--Filter out null values (optional) WHERE createdate IS NOT NULL AND hs_lifecyclestage_marketingqualifiedlead_date IS NOT NULL AND bins_days_diff IS NOT NULL GROUP by bins_days_diff ORDER BY bins_days_diff

Here is a visualization of this dataset

Conclusion

DATEDIFF is a snazzy function with a variety of use cases, including creating histograms! Once I wrapped my mind around dateparts and the idea of datepart boundaries, it all started to make sense. (Remember that DATEDIFF ignores everything except for the datepart you enter in the first argument; It has tunnel vision. If you’re looking at ‘years’, it totally ignores ‘months’ and ‘days’.)

 

On the topic of tunnel vision, I would like to leave you with some parting wisdom. Always remember to start your queries with SELECT 🙈. See you next time.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Contents

Work smarter, better, and faster with monthly tips and how-tos.