Page History

...

The possibility of singling out an individual arises when the anonymised dataset contains unique identifiers, such as IP address, device ID or a combination of quasi-identifiers. In the latter case, however, additional steps are needed to identify the individual, as several datasets for the same person need to be merged.

Example

If the dataset has only one entry about a person who is male, aged between 31 and 40, has a higher education, works in sub-unit Y of institution X and has ten years of experience, he is identifiable as an individual. In such a case, he could be identified merely based on a public list of staff members of institution X, together with their photos and brief CVs. He is also likely to be identifiable by all the employees of the same institution.

The main method to avoid the identification of an individual is k-anonymity, which requires that for each combination of quasi-identifiers, there are at least k different matches in the dataset. The value of k-anonymity has to be chosen by the researchers themselves, depending on the sensitivity of the data and the specificities of the dataset.

The possibility of linkability arises when two datasets can be matched based on some characteristics (e.g. the same quasi-identifiers). In such a case, linking two datasets may reveal that they both contain a similar unique combination of quasi-identifiers, which allows to obtain additional information about some individuals and to identify them. Merging the datasets has been the main way in which data that were initially considered anonymous have nevertheless been used to identify individuals.

Read more:

Linking genealogical databases and anonymous DNA donor data: Bohannon, J. (2013). Genealogy Databases Enable Naming of Anonymous DNA Donors. Science, 339(6117), 262
Identification of Netflix users based on movie ratings data thought to be anonymous: Narayanan, A.; Shmatikov, V. (2008). Robust De-anonymisation of Large Sparse Datasets. IEEE Symposium on Security and Privacy, 111–125

Inference is possible if additional information is known about the person in the dataset. For example, people who work or study together know more about each other and can recognise each other from datasets without direct identifiers. Additional information may simply be the knowledge that a person you know took part in the survey – hence one of the data lines is about them. It is also possible to recognise a person by their voice or by the use of words characteristic of them. A special case of inference is when a person recognises him or herself from the data.

It is quite difficult to avoid inference, as the amount of possible background knowledge is indefinite and depends on the individual. It should also be kept in mind that k-anonymity may not protect against inferred knowledge if the protected characteristics are homogeneous.

Example

The dataset has at least five (k = 5) matches for the combination of four characteristics: female, 30–40 years old, from Tartu, employment status: on parental leave. One needs to know only three of the characteristics to obtain additional information on the fourth characteristic or to identify the person. In such a case, the l-diversity indicator should be considered, which assumes that there are also different values for each sensitive characteristic. For example, l-diversity = 2 would assume that for these five 30–40-year-old women from Tartu, the employment status should have at least two values: some on parental leave, some actively employed, unemployed, etc.

At some point, due to advances in technology or merging with new datasets, it may become possible to identify anonymised individuals, especially if the data are stored for decades. In this case, the risk of identification must be assessed, and it must be taken into account that if the data become identifiable, the data protection principles will apply again. The data controller must then assess reasonable identifiability and demonstrate that the data can indeed be considered anonymous.

...

Page tree

Versions Compared

Old Version 3

New Version Current

Key