Like the network, hate the data

ao link

Members

Contact

New to DataIQ?

Take our FREE data literacy indicator now

Unlock the power of data - take our FREE data literacy indicator now

Joining Facebook is quick and easy - that is why it has been able to grow so rapidly and dominate the world of social networks. What is not easy for the company is fixing the data quality problems which it has allowed to be created, as David Reed finds out.

Facebook gained a lot of media coverage in September for reaching a significant milestone - the social network had registered one billion users. The figure means one seventh of the world’s population has created an online account. It is an astonishing growth rate, given that the network was only launched in 2004 and two years ago had 500 million users.

What would make the figure even more impressive would be if it were accurate. Despite being widely reported as a simple fact, the reality is more complicated - Facebook might have reached the billion mark some months ago, or it may still need another 87 million users in order to hit that target.

In its financial filing to American authorities in June, the social network made a surprising admission, namely that as many as 8.7 per cent of accounts could be duplicates, fakes or mis-categorised (see panel). Given the pressure on Facebook to monetise its users by selling more of them to advertisers, this looked like an admission about wastage which would have been unacceptable to media buyers even in the glory days of broadcast television.

Even more surprising was that Facebook is in effect just guessing about the accuracy of its figures. On close reading, the filing notes that it used “an internal review of a limited sample of accounts” to come up with this risk factor in its statement. So the problem could be bigger (or smaller) than stated.

“To give them due credit, I respect their honesty,” is how Nigel Turner, VP information management strategy at Trillium Software, a division of Harte Hanks, responded to the statement. “I know all too well how many companies hide their data quality issues from public gaze.”

This could be taken as in line with statements by Mark Zuckerberg, founder of Facebook, that privacy is no longer the social norm. More likely is that in the wake of its flotation, the company is now having to address some of the internal measures and controls that longer-established businesses already have in place.

At the time of its filing, the network was reporting 901 million monthly and daily average users. The scale of its adoption by consumers had undoubtedly helped it to achieve a $66.5 billion market capitalisation at launch. But it is also why Turner believes Facebook may have more serious questions to answer. “That market cap is predicated on the numbers of active users, which could have been inflated because it has got suspect accounts. So the company may not be worth what investors are paying for it,” he says.

Investors already appear to have been spooked by the numbers coming out of the business, leading to a $25 billion fall in its current book value. Much of that relates to the challenges for the network in maintaining its usage and growth rates while also maximising ad revenues. Analysts may not yet have made the link with data quality, but it may not be long until they do.

“I don’t think they really know how many users they have got,” notes Colin Rickard, enterprise and channel sales director at Experian QAS. “There are things they should do, like having an internal change of culture so that data gets attention at a senior level. There is an underlying need for better data management. That is a business maturity issue.”

It is fair to say that Facebook has achieved phenomenal growth for a company founded by people in their twenties and who are barely out of them now. With the flotation will have come a raft of more conventional management processes (and older people to implement them). Even so, social networks are at the leading edge of the digital revolution and are unlikely to have embedded “old economy” activities like data validation and correction.

Rickard suggests that fixing the problem at Facebook could involve adopting techniques that are themselves at the cutting edge of data quality. “It has got a lot of unstructured data about its users and there are companies which can help it to understand the links between them. If you look at what people do on the network, there are some very interesting things you can use for validation, for example the way teenagers behave will have a pattern,” he says.

Facebook’s data quality problems start at the outset. When registering to create an account, users provide name, email address and date of birth. While DOB is an extremely powerful piece of information for matching records, it can only be used in conjunction with other elements that help to filter out multiple or inaccurate data. The absence of a real-world address at sign-up significantly hampers this, since it precludes instant validation at point of capture or subsequent batch cleaning and deduplication.

There also appears to be an absence of conventional data analysis and database management. “Relying on an internal sample is not a very good way for a business with a large number of customers to act. They seem to have no ongoing measurement of data quality in place,” says Turner.

Extrapolating error rates from a sample is fraught with problems, not least that the larger the target figure, the bigger any mistakes become. So when calculating problems across a population of 1 billion, starting with a sample that could be flawed in itself rapidly escalates the difficulties. As Turner points out, the 8.7 per cent rate found by Facebook in its then 901 million users amounts to more than 78 million problematic accounts - more than the entire population of the United Kingdom.

This approach also gives the appearance of managing ad-hoc in an era when businesses are trying to be more evidence-based in their decision making. If this could be an issue now, when Facebook is earning revenues from advertisers, imagine the challenge if users start to make payments via the social network. “There doesn’t seem to be a structured approach to identity verification or to deal with potential fraud,” says Turner.

As long as the social network is just a way for individuals to link up and share with each other, there may not seem to be much harm. But the UK government, for one, has announced plans to use Facebook identities as part of its Identity Assurance programme which will give access to services online. Fake accounts could lead to a host of data security and fraud issues.

As the network continues to grow, Turner says it needs to start to change. “Best practice would be to put controls and consolidation in at source. When people create accounts, they need a process to validate if that person is a real entity or to pull up an existing account. Don’t leave that to the back end where it is expensive and difficult,” he says.

This front-end fix need not be expensive (although funding is unlikely to be a problem for Facebook), but it would ensure that all new accounts are at least genuine or within a 1 per cent tolerance. As the network continues to grow, Facebook would at least then be certain that these new users are authentic. Robust and proven technology exists to support this process.

Sorting out the problems that have already been created is more difficult. Most companies with customer databases undertake regular data cleansing, deduplication and verification. Trying to run these processes on a global database of 1 billion has almost certainly never been tried and would potentially cost hundreds of millions of dollars. Plenty of consultancies and vendors would be happy to make a play for the job. But could any of them actually achieve a good outcome?

“This is a good example of a very large Big Data challenge,” points out Rickard. “It is big both in terms of the numbers of users and the volume of unstructured data. It would require bringing together a number of leading edge solutions to tack, as well as the management challenge.”

Remarkable as the scale of Facebook’s data problems might be, the principle that caused them is not. “It is far from atypical - look at the banks. We know what happened there even though Basel II has been in place for a long time. Why didn’t that foretell the problems? Because the numbers banks were working on weren’t accurate,” says Rickard.

He notes that the insurance industry relies on the “expert judgement” of actuaries to determine whether risk profiles look right or not. In its financial filing, Facebook is essentially following the same path - it has considered the problem and come up with a value for the risk which may or may not be accurate. If nobody within the company can be certain about it, it is just as clear that nobody outside - investors, analysts, media - can be either.

For now, the continued growth and apparent commercial potential of Facebook is protecting it from pressure to change its processes. That may not last forever. And Turner for one is not surprised that the social network has found itself with this built-in problem: “It is typical of start-ups that expand in the way they have. There is probably nobody there familiar with data management - they were geeks with a brilliant idea.”

Log in to read the entire article

Gain access to the entire article by logging in or registering for a free account here.

Did you find this content useful?

Thank you for your input

Thank you for your feedback

Next read

The 2023 DataIQ Awards shortlist showcases excellence in data and analytics

Find out who has made the shortlist for the 2023 DataIQ Awards, the most prestigious awards in the data and analytics calendar.

Next read

Key data leader challenges in 2024: Part one – Foundations

30 Apr 2024by Rachael Pimblett

DataIQ’s Research Analyst, Rachael Pimblett, shares the findings on what data leaders feel will be their main challenges in the next year, presented in the first of a four-part article series.

A case of the AI biter bit?

23 Apr 2024by David Reed

DataIQ’s Chief Knowledge Officer and Evangelist, David Reed, examines the hype cycle around generative AI and the actual speed of transformation being seen.

Pioneering AI initiatives revealed: DataIQ Announces 2024 AI Awards Shortlist

15 Apr 2024by Alex Roberts

The shortlist for the 2024 DataIQ AI Awards has been unveiled, with the winners to be announced at the DataIQ Summit on May 21.

You may also be interested in

CDO Challenges – Improving data quality and data governance

This instalment of CDO Challenges examines addressing the issue of shadow IT within data offices and the wider organisation and how it can affect operations.

NHS England partners with Ada Lovelace Institute to combat AI bias in healthcare

NHS England has announced plans to pilot an algorithmic impact assessment, designed by the Ada Lovelace Institute, to combat the potential risks associated with AI in healthcare.

AI data for good data governance Data regulation Government/public sector and healthcare

Invisible people: How gaps in health data are holding back the developing world

Six out of every ten deaths go unregistered globally, preventing authorities from implementing policy that could save lives, particularly in the developing world. Speaking at the ODI Summit 2021, Dr Samira Asma unveiled the World Health Organization’s plans to tackle this fatal gap in global health data.

Data foundations data governance data infrastructure Data Sharing open data Open Data Institute World Health Organization

Data infrastructure: The foundation for better health care, and better business

According to a study by the Open Data Institute, as many as 8 million GP appointments could be freed up every year through effective data infrastructure.

data governance data infrastructure data silos NHS open data Open Data Institute social prescribing

DataIQ is a trading name of IQ Data Group Limited
10 York Road, London, SE1 7ND

We use cookies so we can provide you with the best online experience. By continuing to browse this site you are agreeing to our use of cookies. Click on the banner to find out more.

Cookie Settings