Page History

...

In its Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak, the European Data Protection Board has stated that only entire datasets can be anonymised, not single data patterns. From a legal perspective, it is unclear to what level the dataset must be processed to be considered anonymous. Anonymisation methods offer varying degrees of protection and often depend on the specific dataset.

3.

...

4.1. Causes and timing of data anonymisation

The anonymisation of personal data helps to protect people’s privacy and supports the principle of minimisation: if research objectives can be achieved with anonymised data, anonymisation should be preferred in all cases.

...

Data can also be collected anonymously from the start, but if unique identifiers are stored in the process (e.g. computer IP address), post-processing is necessary to exclude the possibility of indirect identification of individuals. Therefore, it is important to carefully assess whether the planned method allows for collecting the data anonymously from the start or whether it is necessary to anonymise the data after the data collection or the completion of the study.

3.

...

4.2. Anonymisation entities

The University of Tartu is responsible for anonymising personal data, but the university researcher who has the necessary knowledge, skills and resources is responsible for the specific anonymisation activities. Anonymisation may also be carried out by persons not directly involved in the research, provided that the data subjects have been informed of that in advance and that the lawfulness and compliance with data protection principles of such anonymisation are ensured.

Where secondary data are used, they may be anonymised by the institution issuing the data.

3.

...

4.3. Methods of data anonymisation

The means of anonymisation largely depend on the nature and amount of personal data. Therefore, it is necessary to assess to what extent the chosen method prevents the association of the data with the person and whether this result is irreversible.

...

To increase transparency, the method of anonymisation should be precisely described to the data owner so that they can assess whether and to what extent they consider such processing to be adequate. This is particularly necessary when anonymised data are published as open scientific data.

3.

...

4.4. Avoiding the linking of data and persons

To reduce the possibility of attributing data to an individual, it is necessary to look at the characteristics of the dataset, such as the structure, type or amount of data. For example, surveys with a very narrow sample, which collect very precise values for many social characteristics or contain voluminous free-text responses, reduce anonymity. The European Data Protection Board’s Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak addresses cases where data can be linked to an individual after anonymisation. To avoid this, it is important to be aware of the weaknesses of anonymisation.

The possibility of singling out an individual arises when the anonymised dataset contains unique identifiers, such as IP address, device ID or a combination of quasi-identifiers. In the latter case, however, additional steps are needed to identify the individual, as several datasets for the same person need to be merged.

Example

If the dataset has only one entry about a person who is male, aged between 31 and 40, has a higher education, works in sub-unit Y of institution X and has ten years of experience, he is identifiable as an individual. In such a case, he could be identified merely based on a public list of staff members of institution X, together with their photos and brief CVs. He is also likely to be identifiable by all the employees of the same institution.

The main method to avoid the identification of an individual is k-anonymity, which requires that for each combination of quasi-identifiers, there are at least k different matches in the dataset. The value of k-anonymity has to be chosen by the researchers themselves, depending on the sensitivity of the data and the specificities of the dataset.

The possibility of linkability arises when two datasets can be matched based on some characteristics (e.g. the same quasi-identifiers). In such a case, linking two datasets may reveal that they both contain a similar unique combination of quasi-identifiers, which allows to obtain additional information about some individuals and to identify them. Merging the datasets has been the main way in which data that were initially considered anonymous have nevertheless been used to identify individuals.

Read more:

Linking genealogical databases and anonymous DNA donor data: Bohannon, J. (2013). Genealogy Databases Enable Naming of Anonymous DNA Donors. Science, 339(6117), 262
Identification of Netflix users based on movie ratings data thought to be anonymous: Narayanan, A.; Shmatikov, V. (2008). Robust De-anonymisation of Large Sparse Datasets. IEEE Symposium on Security and Privacy, 111–125

Inference is possible if additional information is known about the person in the dataset. For example, people who work or study together know more about each other and can recognise each other from datasets without direct identifiers. Additional information may simply be the knowledge that a person you know took part in the survey – hence one of the data lines is about them. It is also possible to recognise a person by their voice or by the use of words characteristic of them. A special case of inference is when a person recognises him or herself from the data.

It is quite difficult to avoid inference, as the amount of possible background knowledge is indefinite and depends on the individual. It should also be kept in mind that k-anonymity may not protect against inferred knowledge if the protected characteristics are homogeneous.

Example

The dataset has at least five (k = 5) matches for the combination of four characteristics: female, 30–40 years old, from Tartu, employment status: on parental leave. One needs to know only three of the characteristics to obtain additional information on the fourth characteristic or to identify the person. In such a case, the l-diversity indicator should be considered, which assumes that there are also different values for each sensitive characteristic. For example, l-diversity = 2 would assume that for these five 30–40-year-old women from Tartu, the employment status should have at least two values: some on parental leave, some actively employed, unemployed, etc.

At some point, due to advances in technology or merging with new datasets, it may become possible to identify anonymised individuals, especially if the data are stored for decades. In this case, the risk of identification must be assessed, and it must be taken into account that if the data become identifiable, the data protection principles will apply again. The data controller must then assess reasonable identifiability and demonstrate that the data can indeed be considered anonymous.

3.

...

4.5. How to conduct an anonymous survey?

An anonymous survey collects responses in such a form and manner that respondents cannot be identified in any way.

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

3.

4.1. Causes and timing of data anonymisation

3.

4.2. Anonymisation entities

3.

4.3. Methods of data anonymisation

3.

4.4. Avoiding the linking of data and persons

3.

4.5. How to conduct an anonymous survey?