3.4. Why and how to anonymise personal data?

According to the Opinion 05/2014 on Anonymisation Techniques of the European Data Protection Working Party, anonymisation is the processing of data in an irreversible way, i.e. after which it is no longer possible to identify individuals by any reasonable and likely method. As a result, anonymised data are not vulnerable to attacks: even if all the data fell into the hands of an attacker, they could not be personalised. Therefore, anonymised data are not considered personal data.

In its Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak, the European Data Protection Board has stated that only entire datasets can be anonymised, not single data patterns. From a legal perspective, it is unclear to what level the dataset must be processed to be considered anonymous. Anonymisation methods offer varying degrees of protection and often depend on the specific dataset.

3.4.1. Causes and timing of data anonymisation

The anonymisation of personal data helps to protect people’s privacy and supports the principle of minimisation: if research objectives can be achieved with anonymised data, anonymisation should be preferred in all cases.

As anonymised data are no longer considered personal data, they can be used and shared more freely. They can be forwarded to partners in a research project, stored as open data in repositories or sent to other persons and institutions interested in them.

Anonymised data also make it easier to ensure the security of data processing. The only risk to bear in mind and assess from time to time is the possibility that individuals in anonymised datasets may become re-identifiable as technology evolves and new datasets are added.

Anonymisation almost always reduces the availability of the data. If the data are voluminous, multivariate or qualitative, anonymisation may prevent their use or render them useless by distorting the data. For example, anonymising qualitative data from social sciences (interview transcripts, texts) may reduce the possibility of reusing them. Moreover, anonymised material does not allow the replication of scientific analyses based on personal data.

Data can also be collected anonymously from the start, but if unique identifiers are stored in the process (e.g. computer IP address), post-processing is necessary to exclude the possibility of indirect identification of individuals. Therefore, it is important to carefully assess whether the planned method allows for collecting the data anonymously from the start or whether it is necessary to anonymise the data after the data collection or the completion of the study.

3.4.2. Anonymisation entities

The University of Tartu is responsible for anonymising personal data, but the university researcher who has the necessary knowledge, skills and resources is responsible for the specific anonymisation activities. Anonymisation may also be carried out by persons not directly involved in the research, provided that the data subjects have been informed of that in advance and that the lawfulness and compliance with data protection principles of such anonymisation are ensured.

Where secondary data are used, they may be anonymised by the institution issuing the data.

3.4.3. Methods of data anonymisation

The means of anonymisation largely depend on the nature and amount of personal data. Therefore, it is necessary to assess to what extent the chosen method prevents the association of the data with the person and whether this result is irreversible.

The three most common methods of data anonymisation:

Removal involves deleting or permanently replacing all directly identifiable features (name, personal identification code). Removing direct identifiers does not immediately guarantee anonymity, as a person can also be identified from other data: for example, they can be distinguished by a unique combination of identifiers or when different datasets are combined;
Randomisation implies the random distortion of data based on certain values or characteristics. As the data gets distorted, randomisation may not be suitable for the publication of scientific data. On the other hand, randomisation is used to protect large public datasets against re-identification;
Generalisation involves grouping values by characteristics. For example, birth years can be grouped into age ranges, wage amounts into wage ranges, etc. Generalisation helps to ensure that an individual is not identifiable but has the disadvantage of reducing the degree of precision of the value.

In addition, depending on the data to be anonymised, some specific cases can be distinguished.

Anonymisation of an extract of a dataset

As anonymisation must be irreversible, there must be no copy of the original data that could be recombined with the anonymised dataset. However, it is possible to make anonymised extracts of the dataset for public disclosure so that the original data are preserved. Once an extract has been made, it must no longer be possible to link it to the original data.

Anonymisation of pseudonymised data

If previously pseudonymised data are anonymised, the secret key must be deleted. In addition, the adequacy of the pseudonymisation should be assessed: if only the direct identifiers are replaced by the pseudonym and not the data values, the dataset may contain unique quasi-indicator combinations that facilitate the identification of individuals. In this case, in addition to deleting the key, the data should be further processed – e.g. generalised – to exclude the possibility of indirect identification. However, if the data have been correctly pseudonymised, permanent deletion of the key may be sufficient.

To increase transparency, the method of anonymisation should be precisely described to the data owner so that they can assess whether and to what extent they consider such processing to be adequate. This is particularly necessary when anonymised data are published as open scientific data.

3.4.4. Avoiding the linking of data and persons

To reduce the possibility of attributing data to an individual, it is necessary to look at the characteristics of the dataset, such as the structure, type or amount of data. For example, surveys with a very narrow sample, which collect very precise values for many social characteristics or contain voluminous free-text responses, reduce anonymity. The European Data Protection Board’s Guidelines 04/2020 on the use of location data and contact tracing tools in the context of the COVID-19 outbreak addresses cases where data can be linked to an individual after anonymisation. To avoid this, it is important to be aware of the weaknesses of anonymisation.

The possibility of singling out an individual arises when the anonymised dataset contains unique identifiers, such as IP address, device ID or a combination of quasi-identifiers. In the latter case, however, additional steps are needed to identify the individual, as several datasets for the same person need to be merged.

Example

If the dataset has only one entry about a person who is male, aged between 31 and 40, has a higher education, works in sub-unit Y of institution X and has ten years of experience, he is identifiable as an individual. In such a case, he could be identified merely based on a public list of staff members of institution X, together with their photos and brief CVs. He is also likely to be identifiable by all the employees of the same institution.

The main method to avoid the identification of an individual is k-anonymity, which requires that for each combination of quasi-identifiers, there are at least k different matches in the dataset. The value of k-anonymity has to be chosen by the researchers themselves, depending on the sensitivity of the data and the specificities of the dataset.

The possibility of linkability arises when two datasets can be matched based on some characteristics (e.g. the same quasi-identifiers). In such a case, linking two datasets may reveal that they both contain a similar unique combination of quasi-identifiers, which allows to obtain additional information about some individuals and to identify them. Merging the datasets has been the main way in which data that were initially considered anonymous have nevertheless been used to identify individuals.

Read more:

Linking genealogical databases and anonymous DNA donor data: Bohannon, J. (2013). Genealogy Databases Enable Naming of Anonymous DNA Donors. Science, 339(6117), 262
Identification of Netflix users based on movie ratings data thought to be anonymous: Narayanan, A.; Shmatikov, V. (2008). Robust De-anonymisation of Large Sparse Datasets. IEEE Symposium on Security and Privacy, 111–125

Inference is possible if additional information is known about the person in the dataset. For example, people who work or study together know more about each other and can recognise each other from datasets without direct identifiers. Additional information may simply be the knowledge that a person you know took part in the survey – hence one of the data lines is about them. It is also possible to recognise a person by their voice or by the use of words characteristic of them. A special case of inference is when a person recognises him or herself from the data.

It is quite difficult to avoid inference, as the amount of possible background knowledge is indefinite and depends on the individual. It should also be kept in mind that k-anonymity may not protect against inferred knowledge if the protected characteristics are homogeneous.

Example

The dataset has at least five (k = 5) matches for the combination of four characteristics: female, 30–40 years old, from Tartu, employment status: on parental leave. One needs to know only three of the characteristics to obtain additional information on the fourth characteristic or to identify the person. In such a case, the l-diversity indicator should be considered, which assumes that there are also different values for each sensitive characteristic. For example, l-diversity = 2 would assume that for these five 30–40-year-old women from Tartu, the employment status should have at least two values: some on parental leave, some actively employed, unemployed, etc.

At some point, due to advances in technology or merging with new datasets, it may become possible to identify anonymised individuals, especially if the data are stored for decades. In this case, the risk of identification must be assessed, and it must be taken into account that if the data become identifiable, the data protection principles will apply again. The data controller must then assess reasonable identifiability and demonstrate that the data can indeed be considered anonymous.

3.4.5. How to conduct an anonymous survey?

An anonymous survey collects responses in such a form and manner that respondents cannot be identified in any way.

When collecting data from people in an online survey, it should be borne in mind that IP addresses are also personal data (see also 1.3.2) and, when stored, may render the individuals identifiable. In this case, the survey is not anonymous but collects personal data. However, anonymisation is possible if the data are post-processed – for example, if IP addresses are permanently deleted after the data collection. Participants in the survey must be clearly informed of both the collection of personal data and their subsequent anonymisation.

Some survey environments also allow you to configure what additional data are collected and stored in the survey. If it is possible to turn off the collection of IP addresses and other data, the data collection can be considered anonymous. However, it is important to be aware of the possibility that even very carefully configured survey responses can make a person identifiable – for example, by asking for contact details.

At the university, using the LimeSurvey environment is recommended, which offers additional options to ensure anonymity, incl. turning off the automatic recording of the respondent’s IP address. If the researcher uses LimeSurvey or another environment recognised by the university, support is available from the IT helpdesk (arvutiabi@ut.ee) if questions arise. When using environments not recognised by the university, the IT helpdesk cannot assist the researcher in case of problems.

Page tree