According to the Opinion 05/2014 on Anonymisation Techniques of the European Data Protection Working Party, anonymisation is the processing of data in an irreversible way, i.e. after which it is no longer possible to identify individuals by any reasonable and likely method. As a result, anonymised data are not vulnerable to attacks: even if all the data fell into the hands of an attacker, they could not be personalised. Therefore, anonymised data are not considered personal data.
In its Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak, the European Data Protection Board has stated that only entire datasets can be anonymised, not single data patterns. From a legal perspective, it is unclear to what level the dataset must be processed to be considered anonymous. Anonymisation methods offer varying degrees of protection and often depend on the specific dataset.
The anonymisation of personal data helps to protect people’s privacy and supports the principle of minimisation: if research objectives can be achieved with anonymised data, anonymisation should be preferred in all cases.
As anonymised data are no longer considered personal data, they can be used and shared more freely. They can be forwarded to partners in a research project, stored as open data in repositories or sent to other persons and institutions interested in them.
Anonymised data also make it easier to ensure the security of data processing. The only risk to bear in mind and assess from time to time is the possibility that individuals in anonymised datasets may become re-identifiable as technology evolves and new datasets are added.
Anonymisation almost always reduces the availability of the data. If the data are voluminous, multivariate or qualitative, anonymisation may prevent their use or render them useless by distorting the data. For example, anonymising qualitative data from social sciences (interview transcripts, texts) may reduce the possibility of reusing them. Moreover, anonymised material does not allow the replication of scientific analyses based on personal data.
Data can also be collected anonymously from the start, but if unique identifiers are stored in the process (e.g. computer IP address), post-processing is necessary to exclude the possibility of indirect identification of individuals. Therefore, it is important to carefully assess whether the planned method allows for collecting the data anonymously from the start or whether it is necessary to anonymise the data after the data collection or the completion of the study.
The University of Tartu is responsible for anonymising personal data, but the university researcher who has the necessary knowledge, skills and resources is responsible for the specific anonymisation activities. Anonymisation may also be carried out by persons not directly involved in the research, provided that the data subjects have been informed of that in advance and that the lawfulness and compliance with data protection principles of such anonymisation are ensured.
Where secondary data are used, they may be anonymised by the institution issuing the data.
The means of anonymisation largely depend on the nature and amount of personal data. Therefore, it is necessary to assess to what extent the chosen method prevents the association of the data with the person and whether this result is irreversible.
The three most common methods of data anonymisation:
In addition, depending on the data to be anonymised, some specific cases can be distinguished.
As anonymisation must be irreversible, there must be no copy of the original data that could be recombined with the anonymised dataset. However, it is possible to make anonymised extracts of the dataset for public disclosure so that the original data are preserved. Once an extract has been made, it must no longer be possible to link it to the original data.
If previously pseudonymised data are anonymised, the secret key must be deleted. In addition, the adequacy of the pseudonymisation should be assessed: if only the direct identifiers are replaced by the pseudonym and not the data values, the dataset may contain unique quasi-indicator combinations that facilitate the identification of individuals. In this case, in addition to deleting the key, the data should be further processed – e.g. generalised – to exclude the possibility of indirect identification. However, if the data have been correctly pseudonymised, permanent deletion of the key may be sufficient.
To increase transparency, the method of anonymisation should be precisely described to the data owner so that they can assess whether and to what extent they consider such processing to be adequate. This is particularly necessary when anonymised data are published as open scientific data.
To reduce the possibility of attributing data to an individual, it is necessary to look at the characteristics of the dataset, such as the structure, type or amount of data. For example, surveys with a very narrow sample, which collect very precise values for many social characteristics or contain voluminous free-text responses, reduce anonymity. The European Data Protection Board’s Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak addresses cases where data can be linked to an individual after anonymisation. To avoid this, it is important to be aware of the weaknesses of anonymisation.
Example If the dataset has only one entry about a person who is male, aged between 31 and 40, has a higher education, works in sub-unit Y of institution X and has ten years of experience, he is identifiable as an individual. In such a case, he could be identified merely based on a public list of staff members of institution X, together with their photos and brief CVs. He is also likely to be identifiable by all the employees of the same institution. |
The main method to avoid the identification of an individual is k-anonymity, which requires that for each combination of quasi-identifiers, there are at least k different matches in the dataset. The value of k-anonymity has to be chosen by the researchers themselves, depending on the sensitivity of the data and the specificities of the dataset.
Read more:
|
It is quite difficult to avoid inference, as the amount of possible background knowledge is indefinite and depends on the individual. It should also be kept in mind that k-anonymity may not protect against inferred knowledge if the protected characteristics are homogeneous.
Example The dataset has at least five (k = 5) matches for the combination of four characteristics: female, 30–40 years old, from Tartu, employment status: on parental leave. One needs to know only three of the characteristics to obtain additional information on the fourth characteristic or to identify the person. In such a case, the l-diversity indicator should be considered, which assumes that there are also different values for each sensitive characteristic. For example, l-diversity = 2 would assume that for these five 30–40-year-old women from Tartu, the employment status should have at least two values: some on parental leave, some actively employed, unemployed, etc. |
An anonymous survey collects responses in such a form and manner that respondents cannot be identified in any way.
When collecting data from people in an online survey, it should be borne in mind that IP addresses are also personal data (see also 1.3.2) and, when stored, may render the individuals identifiable. In this case, the survey is not anonymous but collects personal data. However, anonymisation is possible if the data are post-processed – for example, if IP addresses are permanently deleted after the data collection. Participants in the survey must be clearly informed of both the collection of personal data and their subsequent anonymisation.
Some survey environments also allow you to configure what additional data are collected and stored in the survey. If it is possible to turn off the collection of IP addresses and other data, the data collection can be considered anonymous. However, it is important to be aware of the possibility that even very carefully configured survey responses can make a person identifiable – for example, by asking for contact details.
At the university, using the LimeSurvey environment is recommended, which offers additional options to ensure anonymity, incl. turning off the automatic recording of the respondent’s IP address. If the researcher uses LimeSurvey or another environment recognised by the university, support is available from the IT helpdesk (arvutiabi@ut.ee) if questions arise. When using environments not recognised by the university, the IT helpdesk cannot assist the researcher in case of problems.