Data Anonymisation and disclosure risk measurement

KoboloadeR integrate the SdcMicro package for disclosure risk measurement.

The basis behind anonymisation

Anonymisation

This method should be used whenever Kobo or ODK forms are used as data collection tools and personal data is being collected. Even when personal data is not being collected it still may be appropriate to apply the methodology since quasi-identifiable data or other sensitive data could lead to personal identification or should not be shared.

Type	Description
Direct identifiers	Can be directly used to identify an individual. E.g. Name, Address, Date of birth, Telephone number, GPS location
Quasi- identifiers	Can be used to identify individuals when it is joined with other information. E.g. Age, Salary, Next of kin, School name, Place of work
Sensitive information	& Community identifiable information Might not identify an individual but could put an individual or group at risk. E.g. Gender, Ethnicity, Religious belief

The following are different anonymisation actions that can be performed on sensitive fields. The type of anonymisation should be dictated by the desired use of the data. A good approach to follow is to start from the minimum data required, and then to identify if any of those fields should be obscured.

The methods below can be referenced in the dedicated column within xlsform (cf above)

How to reference the anonymisation plan in the xlsfrom.

The anonymise column is used to reference the anonymisation plan. it can take the following values.

Method	Description
remove	Variable is removed entirely from the data set. The Variable is preserved in the original file.
reference	Variable is removed entirely from the data set and is copied into a reference file. A random unique identifier field is added to the reference file and the data set so that they can be joined together in future. The reference file is never shared and the Variable is also preserved in the original file.
key	Variable to be consisdered for k-anonymity & individual disclosure risk analysis (i.e. quasi-identifiers)
outlier	Variable is removed entirely
sensitve	Variable is removed entirely

Potential Good practices

Data about who, where and how the data is collected is often stored separately to the main data and can be used identify individuals (i.e. metadata will be removed)

Text variables likely to be removed as well…

The Variable values are replaced with meaningless values but the categories are preserved. A reference file is created to link the original value with the meaningless value. Typically applied to categorical Variable . For example, Town names could be masked with random combinations of letters. It would still be possible to perform statisitical analysis on the Variable but the person running the analysis would not be able to identify the original values, they would only become meaningful when replaced with the original values. The reference file is never shared and the data is also preserved in the original file.

Generalise

Continuous Variable is turned into categorical or ordinal Variable by summarising it into ranges. For example, Age could be turned into age ranges, Weight could be turned into ranges. It can also apply to categorical Variable where parent groups are created. For example, illness is grouped into illness type. Generalised Variable can also be masked for extra anonymisation. The Variable is preserved in the original file.

Edouard Legoupil

2021-02-04

Anonymisation

How to reference the anonymisation plan in the xlsfrom.

Potential Good practices