ursaminortaur wrote:onthemove wrote:ursaminortaur wrote:
Are you sure they weren't just referring to triangulation of your mobile phone's location by detecting the signal from your phone from three cell phone towers at the same time ?
I've just dug out the article, and I stand corrected, it's referring to 4 data points rather than 3...https://www.bbc.co.uk/news/science-environment-21923360
"Scientists say it is remarkably easy to identify a mobile phone user from just a few pieces of location information. ...
... But a study in Scientific Reports warns that human mobility patterns are so predictable it is possible to identify a user from only four data points.
The growing ubiquity of mobile phones and smartphone applications has ushered in an era in which tremendous amounts of user data have become available to the companies that operate and distribute them - sometimes released publicly as "anonymised" or aggregated data sets.
...
Recent work has increasingly shown that humans' patterns of movement, however random and unpredictable they seem to be, are actually very limited in scope and can in fact act as a kind of fingerprint for who is doing the moving.
(...)"
They analysed their database of activity tracks for individual phone users and then took samples at different times to determine the minimum number of such samples needed to uniquely distinguish between tracks. That isn't the same as being able to uniquely identify someone from 4 randomly selected samples. The fact that you only need about 4 should not be that surprising since you should be able to distinguish most such tracks from one another with just two data points - one corresponding to them being at home and another corresponding to their being at work.
There's a reason I put 'identification' in inverted commas.
The theme of my post was raising the question as to what does 'identification' mean.
I did go on to use online usernames as an example in that post, but figured my post was long enough so deleted that bit. But perhaps it's worth introducing it.
Most of us post with a pseudonym (made up username) in the belief we are anonymous. Yet anyone posting regularly becomes known (on here) principally by their pseudonym.
But we probably interact with more people on here than in real life (or at least more people probably see us here than in real life). More people will probably 'know' the pseudonym character rather than the person in real life.
Which then raises the question of which is your 'real' 'identity'.
If readers of this board were to bump into me in real life and not know who I am, which would be more 'informative' in identifying me? Me telling them my 'real' name or me telling them my lemonfool username?
I suspect that people familiar with lemonfool, would find me using my lemonfool username more informative than me giving them my real name, the latter would likely mean nothing to them.
My point? As far as lemonfool readers are concerned, my 'anonymous' lemonfool username is actually my identity. They can associate a personality with that. They can relate that username to my posts that they have read.
So far from being 'anonymous', all that's happened is that I have a new or alternative identity that people recognise. Is that any more or less real than my (supposed) 'real' identity?
My overall point? That identity isn't an absolute, given thing. It's not your post code. It's not even your real name - there are multiple people with the same names, as that Argentinian TV presenter has just discovered when he thought _the_ William Shakespeare had died the other day.
'Identity' is simply whatever means other people or systems use to distinguish you from others.
When it comes to the question in the OP about our data being made public, what 'identification' matters?
I mean, if you have a rare illness, that in itself could be 'identifying' if people who you know, know you have it. If they were to see a dataset with an individual in that dataset having that rare illness, they would quite likely be able to identify it as you.
But does that matter? They already know you have that illness.
"The fact that you only need about 4 should not be that surprising since you should be able to distinguish most such tracks from one another with just two data points - one corresponding to them being at home and another corresponding to their being at work."
That is a form of identity. You have identified a track, distinct from other tracks. It's something you recognise, distinct from other tracks, and can refer to it - i.e. it is something you can identify.
Likewise with your medical record, your particular combination of illnesses is also likely to form an identifiable 'fingerprint'. And there could be quite a number of ways that someone using that data could make a link outside of the dataset.
The idea of true anonymity is just non-existent.
The kinds of patterns that researchers will be looking for, the correlations between illnesses and potential causal factors, etc, are just the sort of patterns that will also provide a unique identification of individuals within a dataset, just the same as that mobile phone data research that I mentioned above.
I mean, you're not going to get very far if the dataset only allows a count of the number of people with each individual illness. To be useful, researchers will want to look at correlations as to whether people with illness A also have illness B, etc. But that inherently means each individual needs to be 'distinguished' in the dataset so that you know each of the illness each individual had. But that, inherently, is likely to provide a unique enough 'fingerprint' that would allow that individual to be connected back to the individual in the real world.
That combination, like the data points from the mobile phones, is going to act as an identity for each individual in the dataset.
And some of that data might very well 'key' with data in other datasets, or what people know about you, enabling those who identify the key to 'join' the datasets (to use the database terminology). In essence, chipping away at the supposed anonymity - i.e. recognising that an 'identified' individual in one dataset, is actually one and the same person as an 'identified' individual in another dataset (or someone you know in real life).
In practice, it is going to have to end up like the NHS already.
When you go for an operation, you're not anonymous. The reception staff recognise your face. The nurses see and recognise your face and know what illness you're in there for, and so on. You don't go into hospital wearing a bag over your head so that people in the hospital can't identify you.
In practice, what matters is not that so many people know about you and your illness. It's unavoidable.
What matters, is that the law protects you from them publicly disclosing that information.
And that works in the NHS right now. There's no reason why the same principle shouldn't apply to the provision of your medical data for research.
As long as the NHS puts the requirement - backed up by legal protections - that those buying that NHS data cannot make public any individual identifiable information, then that is the most pragmatic solution.
Trying to protect 'identity' by worrying about what is, or isn't, allowed in the dataset is just a rabbit hole with no defined end.
Do you wear a paper bag over your head when in the doctor's waiting room so that other patients don't recognise you? No, most people don't. Even though others in the waiting room will clearly be able to 'identify' you from your appearance, and may know you by name, and you presence there is giving them information that you need to see the doctor.
Life is always about finding a reasonable balance.