Data Collection - How? What? When?

 

Too many data projects fail at the very first step: data collection.


Data collection is not too hard to do right, but there are quite a few pitfalls and many open questions. So let's go through the most important ones.

Author:

Tomi Mester is a data analyst and researcher.

He has worked for Prezi, iZettle and several smaller companies as an analyst/consultant.

He’s the author of the Data36 blog, where he writes posts and tutorials on a weekly basis about data science, AB-testing, online research and data coding. Plus, he creates video courses about coding and data strategy.

He’s an O’Reilly author and presenter at TEDxYouth, Barcelona E-commerce Summit and Data Conference.

Your number one issue: the quality of the data you collect

"Garbage in, garbage out"

It is surprisingly common how many businesses are collecting incomplete data, building unreliable databases, and running analyses on skewed datasets (without even knowing that there is a problem).

Needless to say, there is nothing more dangerous for an online business than being certain about a decision that is actually certainly wrong. But drawing conclusions from incomplete datasets will cause exactly that.

 

 

A few years ago I managed a data science project at a huge multinational company as a consultant. We ran an important and complex A/B test. The test should have run for 30 days. The only problem was that a freshly hired junior developer accidentally removed our tracking code from one of the subpages in the middle of the experiment. We realized the tracking issue only at the end of the A/B test. Removing that one code snippet caused an estimated 5-10% data loss. And that was more than enough to mess up all our A/B test results. We had to start the whole experiment over again. We lost 30 days…

It was a hard lesson to learn.

But imagine, if this could have happened so easily, what other - even worse - data quality issues might come up at any company.

The worst thing about a tracking issue is that once you have one, you can't trust your data anymore. You will be suspicious when something performs really well or really badly. Is it really an outstanding result - or is it just the quality of the data again?

The quality of your collected data is a top priority in every data project, so you should have someone who's responsible for it, maintaining it and checking it (at least) monthly.

What to collect and why?

For e-commerce business and digital applications, you can literally track everything. Every website pageview, every user interaction, even every mouse movement. The sky's the limit.

There are certain limitations that you should consider though:

  • ethical and legal limitations - These are the two most important limitations. This should be considered with guidance from your data governance strategy. As of 2018 be especially considerate of GDPR and any default settings associated with your data collection services.
  • implementation time - Tracking should be implemented carefully, so it will take substantial developer time. The more things you track, the more engineering time you will have to allocate to it. (And don't forget that this is not just a one-time implementation. Custom tracking code will need to be maintained and updated as the application it tracks changes.)
  • site speed (websites) - Data collection can be implemented very efficiently in terms of site speed but complex tracking scripts can still slow down your website page load time. We are talking about milliseconds here, but this delay can easily add up.

Business Requirements

From a purely business perspective, the general rule of thumb is: when in doubt, collect the data.

It is not uncommon for 90% of a company’s data to go unused but here comes the tricky part: you never know when you will have to use a tiny part of that 90% in some of your analytics projects.

I have a story about that.

My very first workplace was a startup called Prezi, a cloud-based presentation software. It's a really nice service that makes presentations look much cooler with the power of zooming and rotating the screen.

Prezi’s user interface is browser-based, so we were able to collect data about a big proportion the user interactions. One thing we've never collected, though, is when users rotate elements on their canvas. We thought, it's a tiny detail and we will never use that information in any data analysis ever. (Part of that 90%, right?)

Except, two years later we heard from our users that certain prezis are packed with way too many rotations (created by excited first time presenters). These over-rotated prezis were not looking cool anymore -- instead they were causing motion sickness.

We wanted to identify these over-rotated prezis in our database…

But since we hadn't collected the data about rotation, we couldn't find them.

Sad but true.

Try to find the right balance between the above-mentioned limitations and the “when in doubt, collect data” principle. And don't be afraid to collect things that you think you will never use… In 2 years, you might thank yourself.

How to decide what to collect?

Before your developers implement your tracking scripts, you will should come up with a data collection specification where you'll list all the things you want to track.

There are no general rules and best practices of what you should collect or what you shouldn't. It always depends on the given business and on the given use case.

But there are two simple methods that will help you to come up with the right data collection strategy. I use these all the time.

  1. List all the features of your product!

    (Note: If it's a website, go through all different page types.)

    Yes, I mean it: go through your product and make a list about every feature in it. This is a boring task that could literally take days... But this is the only way to get to know your product 100% before you set up a proper data collection specification.

    When you create this list, you will learn more about the typical user workflows, the relationships between the different features and also what's important and what's not.
    Plus, remember: the developer who will implement the tracking for these features will spend at least twice as much time on implementation as you do on this specification. So feel sorry for her, not for yourself. ;-)
  2. Work Backwards

    List all the stakeholders who have concerns with the functionality of you app or website. This is certainly product, engineering, and marketing but customer service, business development, and sales may need to understand how certain features or functionalities are working. Find out each stakeholders’ KPIs that relate to the application or website. Ask these people, “do we have all the data we need to understand how we can improve this KPI?” Does any feature need additional visibility to understand if it can be optimized?


    Often, less data-savvy business stakeholders will need inspiration and hand holding to understand what they don’t know about data collection. Once you communicate with them what the possibilities are for data collection, they may be inspired to ask for additional critical data-points.

  3. Run a workshop!

    When you are fully aware of every aspect of your product, bring developers, product people, marketers, managers, etc… into one room and run a data collection workshop.
    When I run these workshops these are the typical steps:
    • Clarifying the goal of the workshop (which is to get ideas and learn company needs for the data collection project)
    • Brainstorming: everyone can add her own idea about what we should collect, even if it seems impossible. (This is important, there is no bad idea at this point!) I usually record everything on a huge whiteboard.
    • Removing irrelevant things: It's time to rationalize. We can remove ideas that are not important or that would be technically really hard (or impossible) to track.
    • Organizing: we try to structure the different proposed data-points and build a schema from it.

(Note: If it's a website, go through all different page types.)

Yes, I mean it: go through your product and make a list about every feature in it. This is a boring task that could literally take days... But this is the only way to get to know your product 100% before you set up a proper data collection specification.

When you create this list, you will learn more about the typical user workflows, the relationships between the different features and also what's important and what's not.
Plus, remember: the developer who will implement the tracking for these features will spend at least twice as much time on implementation as you do on this specification. So feel sorry for her, not for yourself. ;-)

List all the stakeholders who have concerns with the functionality of you app or website. This is certainly product, engineering, and marketing but customer service, business development, and sales may need to understand how certain features or functionalities are working. Find out each stakeholders’ KPIs that relate to the application or website. Ask these people, “do we have all the data we need to understand how we can improve this KPI?” Does any feature need additional visibility to understand if it can be optimized?

Often, less data-savvy business stakeholders will need inspiration and hand holding to understand what they don’t know about data collection. Once you communicate with them what the possibilities are for data collection, they may be inspired to ask for additional critical data-points.

Following these three simple methods will help you a lot in not leaving out any important data points. When you are done with this, you will just have to write your data collection specification - and wait for developers to actually implement it.

 

V0.1 draft of a database schema from a data collection specification

Data Collection Specification (Tracking Plan)

A data collection specification or tracking plan is like Google Translate for translating business requirements into metrics by listing each event that should be tracked, how it should be collected (often using demo tracking code snippets), how it should be displayed in the analytics tool being used, and the associated data dictionary of possible values.

A good data collection spec should include tests that the engineers can check against to ensure that the data that is being collected in the right format, with the right values, and at the right frequency.

What tools to use?

Depending on the size of your company (and mostly on your resources), there are different tools that you should consider for your data collection layer of your data stack.

At smaller companies, third-party tools are the best options.

One of the most well-known third party tracking services is Google Analytics. It's free, easy to implement and it doesn't just collect the data, it actually creates automatic reports. So it's a nice all-in-one tool.

Note: Google Analytics is created for websites and mobile applications. If you have something more complex than that, or if you don't like Google Analytics, you can easily find alternative solutions. There are a plenty of those out there, many of them specialized for different needs and niches.

Let me quickly introduce how Google Analytics does data collection. That will help you to understand how most of the advanced tools (or in certain cases even your own tracking) could be implemented in the future.

It's simple.

  1. You have to add a javascript code snippet to your website's code base. (In the <head> section of each page).
    Note: People will often refer to this as a “pixel” or “beacon” though these references are not entirely accurate.
  2. Whenever your website is opened in a user's browser, it sends a "signal" to Google Analytics.
    This signal contains quite a few pieces of information about the user:
    1. when the user opened the website
    2. the user's unique ID (a randomly generated number that's then stored in a cookie in the user's browser)
    3. which exact page she is visiting
    4. what device she is using
    5. which country and city she is visiting from
    6. and many other dimensions.
  3. Every time this user visits another page of your website, a new "signal" is sent to Google Analytics (but with the same unique user ID).
  4. This is done for every pageview of every website visitor. And that draws out the typical user journey to create reports like this one:

With other special javascript code snippets, you can also track other events on your webpage. E.g. you can record where your website visitors clicked, how far they scrolled, etc.

Note: if you like Google Analytics, you will like Hotjar and Google Optimize, as well. I highly recommend checking them out.

Note 2: There are other popular 3rd party analytics services that are worth taking a look at. A few (but not all) of them: Amplitude, Mixpanel, Crazyegg and Optimizely.

What tools to use? (Next level)

There are many good solutions when you want to take your data collection to the next level. You can build your own database, you can use advanced 3rd party tools. Or you can do the combination of these two.

Either way, the technical background will be pretty much the same that you have done with Google Analytics, only this time, you don't send the "signal" to Google's data servers but to those advanced data collection tools' or yours.

Conclusion

Since proper data collection is the foundation of all your further analytics efforts, you have to take it seriously.

This article was meant to give you a high-level overview about the biggest questions and problems. I hope this also helped you to see the typical pitfalls - and not just to see them, but to avoid or prevent them at your company.

If you liked this article, you will love my comprehensive overview about business data science. Go and check it out here: Data Science for Business.

Get a free consultation with a data architect to see how to build a data warehouse in minutes.
Request Demo
Read more in:
Share this post:

Work smarter, better, and faster with weekly tips and how-tos.