You’ve done the hard work. You’ve now gotten to the point of data visualization, but can you trust your results?
We’ll look at some tips and strategies on how to avoid fallacies or pitfalls in data analysis.
Anyone analyzing data will come across multiple challenges when it comes to drawing meaningful conclusions. Some of these have to do with the data itself – often it is incomplete, inconsistent, and sometimes not available at all.
Assuming these challenges have been overcome, there are some other common pitfalls of data analysis to avoid.
1. Texas Sharpshoot Bias
One of the best-known data mining myths and blunders, this bias is named after a fictitious sharpshooter who took aim at a big barn door. Once he’d let off a whole lot of shots, he walked up to the door, found a grouping of hits, drew a target around them and showed this off to all of his friends, calling himself a “sharpshooter” – or so the story goes.
This mistake is typically made when people look at a large amount of data, identify small patterns, and then – incorrectly – derive a conclusion based on these patterns. Don’t make big data mistakes like this!
2. Gambler's Fallacy
Consider a coin toss with a friend. After an improbable 50 “heads” results, you’re considering what to call on the next spin. Even though you may feel that after so many similar results, there must be an end to the pattern, the chances of the next flip landing on “tails” is no different to the ones before. In stock market parlance, “past results are no indication of future performance”.
Be careful not to fall for the mistaken belief that just because something has happened more frequently than usual, it is now less likely to happen in future (and the other way around).
3. Regression Toward the Mean
When looking at data, it’s common to assume that when something happens that’s an extreme, or out of the ordinary, it will eventually return to the average. In fact, this is one of the most common data analysis problems.
Nobel-winning economist Daniel Kahneman, in his book “Thinking Fast and Slow”, talks about this fallacy when he was watching men’s ski jump.
The commentator announced, “Norway had a great first jump; he will be tense, hoping to protect his lead and will probably do worse” and “Sweden had a bad first jump and now he knows he has nothing to lose and will be relaxed, which should help him do better.”
The commentator had relied on this regression to the mean and had come up with a story for it, without any evidence.
4. Simpson's Paradox
Simpson's paradox, (also known as the Yule–Simpson effect) is a phenomenon in probability and statistics where a trend appears in several different groups of data, but disappears or reverses when these groups are combined.
One of the most famous examples of this is the UC Berkeley gender bias. When the admissions statistics of men and women was presented, it seemed that the university had a bias towards men (44% accepted compared to women’s 35%). However upon closer inspection, it was noted that 6 out of 85 departments appeared biased against men, and only 4 appeared significantly biased against women – and that overall, corrected data showed a "small but statistically significant bias in favor of women” (emphasis added).
5. McNamara Fallacy
Robert McNamara, was the United States Secretary of Defense from 1961 to 1968, a period that included the Vietnam War. McNamara became notorious for using quantitative metrics to measure success (in this case, the metric was enemy body count) almost exclusively, to the detriment of other factors.
One of the big problems with data analysis, this common mistake happens when data analysts rely solely on metrics in complex situations and lose sight of the bigger picture.
6. Hawthorne Effect
The Hawthorne effect refers to the tendency of subjects to work differently (often to work harder, or perform better) when they are being observed as participants in a study. The Hawthorne Effect was first described Henry A. Landsberger, and is named after the area where the experiments took place: Western Electric’s Hawthorne Works electric company, close to Hawthorne, Illinois.
7. Correlation does not imply causation
Did you know that there is a direct correlation between maths doctorates awarded, and uranium stored at US power plants? Or, that the age of Miss America closely correlates with the number of murders by steam, hot vapors and hot objects?
A good data scientist will say, “so what?”. One of the biggest mistakes people make when looking at the data is confusing correlation with causation, or in other words: similarities between two statistics or trends does not imply that the one caused the other.
8. Sampling Bias
Sampling bias is when a sample is collected in such a way that some members of the intended population are less likely to be included than others, skewing results.
This often happens, for example, when it comes to pre-screening trial participants, or advertising for volunteers within certain groups or areas. In polling, this actually happened in the 1930’s when a sample population was drawn from a list of car owners. These tended to be wealthier people with a certain political view, and the actual results turned out to be the polar opposite to what the poll predicted.
When it comes to data analysis, and drawing the correct conclusions from your data, keep these tips and common pitfalls in mind so as to make the best decisions to drive your organization forward.
As mentioned at the beginning of this piece, the first, critical step is to make sure that the data you’re basing your analysis on is valid, accurate, and complete. Having the right data partner is critical in this respect, and removes a huge part of the headache when it comes to data analysis.
Panoply is your ideal data partner, and provides an autonomous data warehouse built for analytics professionals, by analytics professional.