privacy

Are you sure you sanitized the data enough to remove all private information and made it anonymous?

In our work as security professionals a lot of time we encounter decisions about what data to reveal or hide due to privacy concerns. I am sure when Netflix decided to release its data as a part of the challenge to come up with a better recommendation algorithm, it thought that it had sanitized the data enough. But, it seems that Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, were able to extract enough info out of the data to be able to identify a few sampled users. More at this faq.

They used some data mining techniques and associated the Netflix data with that stored by the user in IMDB. Since the IMDB data is available publicly and is associated with usernames, they were able to associate the two sets of data and identify which Netlix data belonged to a particular used on IMDB. In effect they removed the anonymity of the Netflix data for that user.
It makes perfect sense if you realize that with so many choices in this world, our habits are very individual and unique whether they are movie watching habits, eating habits, buying habits etc.
So, how do we anonymize the data? By inserting errors, removing random pieces etc. It seems none of that works as very little data is needed to uniquely identify you.
This quote from this article explains it very clearly
"Other research reaches the same conclusion. Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. "In general," the researchers wrote, "few characteristics are needed to uniquely identify a person."

So in effect as the power of computers and the amount of public data increases, Can we pretty much say goodbye to anonymity? Only the future will tell, but the signs don't look good. Attacking anonymous data is exactly how we will learn how to defend anonymity. Narayanan and Shmatikov are currently working on developing algorithms and techniques that enable the secure release of anonymous datasets like Netflix's. We need a lot more researchers like them to learn how to do anonymity right.

Syndicate content