Guidelines for Anonymization & Pseudonymization¶

The New School is required by privacy and data protection laws and regulations to protect the Personal Data it Processes from inappropriate disclosure or use. Anonymization and Pseudonymization can help achieve this objective when datasets containing elements of Personal Data must be Processed by parties who should not have access to the values of those elements.

Anonymization¶

Anonymization is a de-identification technique that involves the complete and irreversible removal of any information from a dataset that could lead to an individual being identified, either from the removed information itself or by combining the removed information with other data held by the university or a third party. Anonymization renders data permanently anonymous; the process cannot be reversed to re-identify individuals.

To anonymize any dataset, sufficient elements must be removed from the dataset such that it can no longer be used by the Data Controller or a third party to identify a Data Subject by using “all the means reasonably likely to be used.”

Fully anonymized information is not Personal Data and therefore is not covered by privacy and data protection laws and regulations.

Risks of re-identification¶

When considering “all the means reasonably likely to be used,” it is important to consider the specific anonymization technique being used, the current state of technology, and the following three risks that anonymization must address:

Singling out, the possibility of isolating some or all records that identify an individual in the dataset.
Linkability, the ability to link multiple records concerning the same Data Subject or a group of Data Subjects (either in the same dataset or multiple different datasets). If an attacker can establish (e.g., by means of correlation analysis) that two records are assigned to the same group of individuals but cannot single out individuals in this group, the technique provides resistance against singling out but not against linkability.
Inference, the possibility of deducing, with significant probability of correctness, the value of an attribute from the values of a set of other attributes.

An effective anonymization solution prevents all parties from singling out an individual in a dataset, from linking two records within a dataset (or between two separate datasets) and from inferring any information in the dataset. This represents an extremely high bar, and can be very difficult to achieve in practice, because it can be difficult to recognize beforehand the capabilities of modern "big data" processing systems.

Example

One well-known study showed that 87% of the U.S. population could be personally identified using just their gender, date of birth, and five-digit ZIP code. Even though these identifiers are non-identifiable on their own, linking them together makes it possible to uniquely identify an individual.

Approaches to anonymization¶

Broadly speaking there are two different approaches to anonymization:

Randomization is a family of techniques that alters the veracity of the data in order to remove the strong link between the data and the individual. If the data are sufficiently uncertain then they can no longer be associated with a specific individual. Randomization by itself will not reduce the singularity of each record as each record will still be derived from a single data subject, but it may protect against inference attacks/risks and can be combined with generalization techniques to provide stronger privacy guarantees. Additional techniques may be required to ensure that a record cannot identify a single individual. Randomization techniques include:
- Noise addition
- Permutation
- Differential privacy
Generalization is the process of generalizing, or diluting, the attributes of individuals by modifying the respective scale or order of magnitude (e.g., a region rather than a city, a month rather than a week). While generalization can be effective to prevent singling out, it does not allow effective anonymization in all cases; in particular, it requires specific and sophisticated quantitative approaches to prevent linkability and inference. Generalization techniques include:
- Aggregation
- K-anonymity
- L-diversity
- T-closeness

In most cases, a single method from the list above will not be sufficient to fully anonymize a dataset and will have to be combined with other methods.

Partial anonymization¶

Full anonymization is often difficult to attain and for research, often not desirable. In most cases the information can only be partially anonymized and therefore will still be subject to data protection laws and regulations. If information cannot be fully anonymized, it is still good practice to partially anonymize it as this limits the ability to identify people.

Example

If people’s names are removed from a dataset about New School students but their N-numbers are left intact, the information has not been fully anonymized, as it is still possible to identify the people concerned. However, it will be more difficult for the people working with the dataset to identify them.

Pseudonymization¶

Pseudonymization is a privacy-enhancing technique that renders data neither completely anonymous nor directly identifying. Direct identifiers in the dataset are replaced with artificial identifiers, or pseudonyms, so that linkage to the original identity of a Data Subject is no longer possible without knowledge of the mapping between the direct identifiers and the pseudonyms. This mapping is normally held separately and not shared with the people working with the dataset. Pseudonymized data can be re-identified by applying the mapping in reverse, so pseudonymization does not result in an anonymous dataset.

To pseudonymize any dataset, sufficient elements must be replaced with pseudonyms such that the dataset can no longer be used by the Data Controller or a third party to identify a Data Subject by using “all the means reasonably likely to be used.”

Preventing re-identification¶

When considering “all the means reasonably likely to be used,” it is important to consider the specific pseudonymization technique being used, the current state of technology, and the three risks identified previously:

Singling out: It is still possible to single out individuals’ records as the individual is still identified by the unique attribute resulting from the pseudonymization function ( = the pseudonymized attribute).
Linkability: Linkability will still be trivial between records using the same pseudonymized attribute to refer to the same individual. Even if different pseudonymized attributes are used for the same Data Subject, linkability may still be possible by means of other attributes. Only if no other attribute in the dataset can be used to identify the Data Subject and if every link between the original attribute and the pseudonymized attribute has been eliminated (including by deletion of the original data), will there be no obvious cross-reference between two datasets using different pseudonymized attributes.
Inference: Inference attacks on the real identity of a Data Subject are possible within the dataset or across different databases that use the same pseudonymized attribute for an individual, or if pseudonyms are self-explanatory and do not mask the original identity of the Data Subject properly

Pseudonymization, on its own, is not usually sufficient to anonymize a dataset. In many cases it can be as easy to identify a Data Subject in a pseudonymized dataset as with the original data. Extra steps should be taken in order to consider the dataset as anonymized, including removing and generalizing attributes or deleting the original data or at least bringing it to a highly aggregated level.

Approaches to pseudonymization¶

The most used pseudonymization techniques are as follows:

Encryption with secret key: in this case, the holder of the key can trivially re-identify each Data Subject through decryption of the dataset because the Personal Data is still contained in the dataset, albeit in an encrypted form. Assuming that a state-of-the-art encryption scheme was applied, decryption can only be possible with the knowledge of the key.
Hash function: a function that returns a fixed size output from an input of any size (the input may be a single attribute or a set of attributes) and cannot be reversed; this means that the reversal risk seen with encryption no longer exists. However, if the range of input values to the hash function are known they can be replayed through the hash function in order to derive the correct value for a particular record. For instance, if a dataset was pseudonymized by hashing the Social Security number, then this can be derived simply by hashing all possible input values and comparing the result with those values in the dataset. Hash functions are usually designed to be relatively fast to compute, and are subject to brute force attacks. Pre-computed tables can also be created to allow for the bulk reversal of a large set of hash values. The use of a salted-hash function (where a random value, known as the “salt,” is added to the attribute being hashed) can reduce the likelihood of deriving the input value but nevertheless, calculating the original attribute value hidden behind the result of a salted hash function may still be feasible with reasonable means.
Keyed-hash function with stored key: a particular hash function that uses a secret key as an additional input (this differs from a salted hash function as the salt is commonly not secret). A Data Controller can replay the function on the attribute using the secret key, but it is much more difficult for an attacker to replay the function without knowing the key as the number of possibilities to be tested is sufficiently large as to be impractical.
Deterministic encryption or keyed-hash function with deletion of the key: this technique may be equated to selecting a random number as a pseudonym for each attribute in the database and then deleting the correspondence table. This solution allows diminishing the risk of linkability between the personal data in the dataset and those relating to the same individual in another dataset where a different pseudonym is used. Considering a state-of-the-art algorithm, it will be computationally difficult for an attacker to decrypt or replay the function, as it would imply testing every possible key, given that the key is not available.
Tokenization: this technique is typically applied in (even if it is not limited to) the financial sector to replace card ID numbers by values that have reduced usefulness for an attacker. It is derived from the previous ones being typically based on the application of one-way encryption mechanisms or the assignment, through an index function, of a sequence number or a randomly generated number that is not mathematically derived from the original data.

Strengths and weaknesses of different techniques¶

The table below provides an overview of the strengths and weakness of the techniques considered in terms of the three basic requirements:

Method	Singling Out still a risk	Linkability still a risk	Inference still a risk
Anonymization
Noise addition	Yes	Yes	Maybe not
Permutation	Yes	Maybe not	Maybe not
Differential privacy	Maybe not	Maybe not	Maybe not
Aggregation and K-anonymity	No	Yes	Yes
L-diversity and T-closeness	No	Yes	Maybe not
Pseudonymization	Yes	Yes	Yes

Good practices¶

To reduce Data Subject identification risks, the following good practices should be taken into account:

In general¶

Do not rely on the “release and forget” approach. Given the residual risk of identification, Data Controllers should:
1. Identify new risks and re-evaluate the residual risk(s) regularly;
2. Assess whether the controls for identified risks suffice and adjust accordingly; and
3. Monitor and control the risks.
As part of such residual risks, take into account the identification potential of the non-anonymized portion of a dataset (if any), especially when combined with the anonymized portion, plus of possible correlations between attributes (e.g., between geographical location and wealth level data).

Contextual elements¶

The purposes to be achieved by using the anonymized dataset should be clearly defined as they play a key role in determining the identification risk.
This goes hand in hand with the consideration of all the relevant contextual elements—e.g., nature of the original data, control mechanisms in place (including security measures to restrict access to the datasets), sample size (quantitative features), availability of public information resources (to be relied upon by the recipients), envisaged release of data to third parties (limited, unlimited/public, etc.).
Consideration should be given to possible attackers by taking account of the appeal of the data for targeted attacks (again, sensitivity of the information and nature of the data will be key factors in this regard).

Technical elements¶

Data Controllers should disclose the anonymization or pseudonymization technique(s) being implemented, especially if they plan to release the anonymized dataset.
Obvious (e.g., rare) attributes / quasi-identifiers should be removed from the dataset.
If noise addition techniques are used (in randomization), the noise level added to the records should be determined as a function of the value of an attribute (that is, no out-of-scale noise should be injected), the impact for Data Subjects of the attributes to be protected, and/or the sparseness of the dataset.
When relying on differential privacy (in randomization), account should be taken of the need to keep track of queries so as to detect privacy-intrusive queries as the intrusiveness of queries is cumulative.
If generalization techniques are implemented, it is fundamental for the Data Controller not to limit themselves to one generalization criterion even for the same attribute; that is to say, different location granularities or different time intervals should be selected. The selection of the criterion to be applied must be driven by the distribution of the attribute values in the given population. Not all distributions lend themselves to being generalized—i.e., no one-size-fits-all approach can be followed in generalization. Variability within equivalence classes should be ensured; for instance, a specific threshold should be selected depending on the “contextual elements” mentioned above (sample size, etc.) and if that threshold is not reached, then the specific sample should be discarded (or a different generalization criterion should be set).

References¶

European Data Protection Board Article 29 Data Protection Working Party Opinion 05/2014 on Anonymisation Techniques.

Document history

Date	Author	Description
Jul 2020	D. Curry	Initial publication

Parts of this guideline are adapted from the University of Edinburgh’s guidance on the anonymization of personal data, the contents of which are used with permission.

Parts of this guideline are adapted from the European Data Protection Board Article 29 Data Protection Working Party Opinion 05/2014 on Anonymisation Techniques.