Do vast datasets equal validity?

<img alt="" src="/sites/default/files/images/pictures/blog/RichardClay.JPG" style="float: left; m...
RichardShotton

RichardShottonRichardClayAccording to Google Trends, interest in big data on the internet has grown by 70 per cent since this time two years ago. Is it really worth the hype? One obvious strength of big data is the increased sample size it serves up for analysis. When scrutinising search terms or tweets, we’re monitoring the behaviour of millions.

This contrasts with previous methodologies, which relied on surveys or psychological experiments, drawn from smaller groups of respondents. The problem with this is that these groups tend to be drawn from atypical groups – they are reliant on people who repeatedly fill in surveys for a small incentive. It’s a stretch to assume this group represents target consumers.

A second strength is that big data is often a by-product of other behaviours, removing potential sources of bias. When consumers use their Clubcard, search on Google or click on a banner, they are oblivious to being in a research project. These “natural” actions are preferable to claimed data which relies on consumers knowing their motivations and being prepared to verbalise them. This is an unrealistic assumption.S15P34 Main

These strengths illustrate the potential for big data analysis to improve the consumer insight which underpins marketing. Big data advocate Chris Anderson, in an influential Wired article in June 2008, proclaimed “The End of Theory”. He suggested explanatory hypotheses were no longer necessary when we have such large datasets that can be mined for correlations. In his words, “petabytes of data allow us to say: ‘correlation is enough’… With enough data, the numbers speak for themselves.”

However, big data is not a panacea. An unthinking reliance on correlation can result in the multiple comparison problem: if you test enough combinations, you will be inundated with fluke discoveries. The belief in the power of correlation alone has been debunked by Tyler Vigen who specialises in finding spurious data relationships. He has found genuine correlations between the divorce rate in Maine and margarine sales, and between web searches for Justin Bieber and tonsillitis. These are insight-free. It’s hard to trust in correlation alone when it blames a pop star for outbreaks of a disease.

The wealth of data at advertisers’ disposal means they can find evidence to back up any existing hypothesis, regardless of its validity. An experiment by Charles Lord at Stanford University in 1979 suggested that people selectively analyse evidence to back up their beliefs. He gave an evenly-balanced piece of text about the death penalty to people who were either pro or anti capital punishment. After reading the texts, both groups came out more certain of their beliefs.  

A second problem is the belief that, by expanding the sample, bias can be avoided altogether. If n=all, this might be true, but this state is rarely, if ever, reached. The example of the Street Bump app is illuminating. It uses smartphone accelerometers to record when a car hits a pothole. The data is then automatically sent to the authorities. It’s an ingenious cost-saving solution which removes the need for council staff to scour the streets looking for potholes. However, potholes are more likely to be reported in upmarket areas populated by sophisticated smartphone users. Areas of deprivation are left unreported and full of potholes – an unhelpful combination.

These criticisms don’t mean we should disregard big data. Expecting any methodology to be perfect is to burden it with unreasonable expectations. Instead, we need to be aware of possible sources of bias and counter-balance them. This way we can harness its potential, enabling its power to be put to proper use. Otherwise it’s easy to be misled by larger data sets and draw the wrong conclusions.

 

Upcoming Events

No event found!