Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

According to Article 4 of the GDPR, pseudonymisation means the “processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”. Thus, by pseudonymisation, all the data or identifiers that allow a person’s direct or indirect identification are replaced by a pseudonym, after which the person is no longer identifiable.

However, pseudonymisation is reversible. The additional information referred to in the GDPR, such as a key, a code or any other identifying information, can be used to re-establish the original link between the data and the person. In its simplest form, this additional information can be, for example, a table of identifiable data and the pseudonyms assigned to replace them. To ensure security, this information must be carefully protected.

Thus, pseudonymisation is a generic term for all data processing methods that allow both de-identification and re-identification of a person. It should not be forgotten that pseudonymised data are still personal data. Even if researchers cannot identify an individual by looking at the data, the data must be treated the same way as identifiable data, and all data protection principles must be respected.

There are different understandings of pseudonymised data in Estonian law. For example, the Data Protection Inspectorate has drawn attention to the need to amend section 7 of the Human Genes Research Act. It reads: “The provisions regulating the processing of personal data do not apply to the processing of pseudonymised tissue samples, pseudonymised descriptions of DNA and pseudonymised descriptions of state of health if such tissue samples, descriptions of DNA and descriptions of state of health are processed as a set of data and on the condition that the set of data to be processed contains DNA samples, descriptions of DNA or descriptions of state of health of at least five gene donors at a time.” However, Recital 26 of the GDPR states that pseudonymised personal data should be considered information on an identifiable natural person. Thus, pseudonymised genetic or health data cannot be classified as non-personal data to which neither the GDPR nor the Personal Data Protection Act apply.

3.2.1.      Causes and timing of data pseudonymisation

According to the GDPR, pseudonymisation enhances the security of the processing of personal data and data protection by design. The principle of minimisation must be respected: if the processing does not require the identification of the data subject, the processing of personalised data is not justified. Thus, pseudonymisation does not only concern the transmission of personal data but also the work of a research institution or a research project to reduce the number of researchers who can identify individuals based on the data.

The more sensitive the data, the more necessary pseudonymisation may be. It should also be considered when data are transferred to third parties.

Personal data should be pseudonymised as soon as possible. For example, in a research project with several partners abroad, this should be done immediately after data collection and before starting the analysis or transferring the data to project partners.

3.2.2.      Pseudonymisation entities

As stated in the 2019 guidelines “Pseudonymisation techniques and best practices”by the European Union Agency for Cybersecurity (ENISA), a pseudonymisation entity can be either a data controller, a data processor or a trusted third party. However, the responsibility for the security of data processing always rests with the controller.

Pseudonymisation is certainly necessary in the case of a joint study between several institutions. For example, two or more partners may be joint controllers, but they must agree that the research institution collecting the personal data pseudonymises the data before transferring them to the joint controllers. In this way, the principles of minimisation and security in data processing are respected. Similarly, the personal data may be pseudonymised by the processor (e.g. the survey company) before transferring the data to the research institution.

3.2.3.      Methods of data pseudonymisation

When setting up an institution- or project-based pseudonymisation policy, the ENISA guidelines on pseudonymisation techniques and best practices, which recommend a risk-based approach to the choice of pseudonymisation method, can be used. The risks considered include potential attacks on pseudonymised datasets, the sensitivity of the data, the availability of the data and the need to protect the data.

In most cases, pseudonymisation is not merely replacing a person’s name and personal identification code with a pseudonym but needs to consider all data that can be easily associated with that person. The type of data is also important – for example, pseudonymisation of identifiers is not appropriate for images and pictorial data, but it is useful if the file names of images or metadata of images contain identifiers that may allow the identification of individuals. The processor of pseudonymised data must be unable to identify the persons behind the data.

There are different ways to replace identifiers with pseudonyms:

  • Counter uses numbers generated based on a predefined sequence to replace identifiers. The advantage of this method is simplicity and the fact that the number assigned by the counter has no direct relationship to the identifier to be replaced;
  • Random number generator also replaces identifiers with numbers but with random ones. Random numbers are safer than the counter because pseudonyms are not generated sequentially. However, the disadvantage is that two identifiers can be associated with the same pseudonym. The likelihood of that can be reduced by generating longer numbers;
  • Cryptographic hash function allows arbitrary-length identifiers to be encrypted into a fixed-length code. The hash function is a one-way method, meaning it is extremely difficult to compute the original value from the hash code. It is also collision-free, i.e., no two identifiers result in the same hash code. However, since the same input results in the same hash, it is possible to depseudonymise the data by knowing the original identifier and the hash function;
  • Message authentication code (keyed hash function)uses a secret key to generate the pseudonym in addition to the hash function. It provides the additional assurance that it is not possible to compute the identifier from the hash code;
  • Symmetric encryption uses a single secret key for encryption and decryption;
  • For smaller datasets, pseudonyms can also be created manually, e.g. by replacing a person’s identifiers and quasi-identifiers[1] with the generic name interviewee A or subject M45. However, replacing names with initials or pseudonymising some quasi-identifiers may not provide very strong protection against identification. Random pseudonyms are almost always safer than systemic ones.

Regardless of the method, it is important to protect the pseudonymisation secret – the key, code, method or other data that allow the pseudonym to be associated with an individual. If this secret is leaked due to a cyberattack, it is possible to identify people in all the datasets compiled in the pseudonymised form. Such an attack is even more dangerous if the same pseudonymisation method is used all the time. In this case, a very serious privacy violation can occur.

To maintain secrecy, as few people as possible should have access to the depseudonymisation information, but it is good to have more than one such person in case something happens to the owner of the information or if the person leaves the job.


[1]A quasi-identifier is gender, age, nationality or any other characteristic that, on its own, cannot uniquely identify a person, but can be used in combination with other characteristics to create a direct identifier that refers to a specific person.