Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So if I publish a study (with backing data) that says that 38 year old males are likely to commit adultery, I've "violated the privacy" of all 38 year old males?

Yes. The difference between 'John Elks of 7 Arborview is a rapist' and 'there is a 0.5 correlation between living in Pleasantville and being a rapist' or '38 year old males commit 10% more rapes than average' etc. is solely one of degree and not kind.

Suppose I have a set of datapoints like that. And let's say each datapoint applies to only half the population. How many datapoints before I have broken your privacy and linked you to the furry porn you like to rent? Well, I'm guessing you're an American male. The US population is ~300 million, and roughly half of that is male, so 150 million. The first datapoint pins you down to within 75 million. The second, down to 33 million. The third, down to 16 million, the fourth, 8 million, the fifth 4 million, the sixth 2 million, the seventh 1 million, the eighth 500,000 (starting to feel nervous yet?), the ninth 250,000, the tenth 125,000, the eleventh 75,000, the twelfth 30k, etc. until the 25th or 26th specifies just 1 - you.

Now, tell me: Where in this slippery slope did it suddenly flip from not being a privacy violation of some degree, to being a privacy violation?

Was it at the 5th bit of information? Are you damaged at the 12th bit of information? Or did it take until the 24th or 25th bit of information before it magically flips from being good science to bad privacy violation?

Is it fine just so long as it might also be your neighbor down the street, even though most people would shun you based on far less than a 50-50 chance of things like being a child rapist? (An employer on the bench might regard a 10% chance of you being objectionable as being too much; that only requires, what, 18 bits of information?)

Predictions embody a great deal of information. That's how Bayesian statistics and statistics in general work, after all.



The difference between poking someone and shooting them is also "one of degree rather than kind", but we choose to make a categorical distinction between the two cases nonetheless.


I'm very curious about what bits of information you think exist that so precisely bisect the population.


The Electronic Frontier Foundation does it to browsers pretty easily.

http://panopticlick.eff.org/

And according to another article by them, all you need is zip code, gender and birthday to identify someone with a high degree of certainty.

http://www.eff.org/deeplinks/2009/09/what-information-person...


Of course, zip code and birthday are a pretty huge amount of information. With the simplifying assumptions of roughly uniform distributions, knowing a birthday confers ~8.5 bits and knowing your zip code is worth ~15 bits. That's 23.8 in total.

It's not surprising that those two pieces of information can pretty easily narrow down an identity.

  log2(3*10^8) = 28.2


Birthday encodes a lot more than one bit of information. The above example was a series of factors that reduced the identified population in half.


Alright; would you prefer me to redo it with pieces of information each worth 7 bits?

'The first piece cuts it down to 2.3 million people...'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: