What emphasis should be placed on publishing de-identified data?


Personal data should not be published unless it is legal to do so.

Personal data can be published if individuals cannot be identified in the data. People’s identities can be hidden by various techniques e.g.

  • removing columns such as name and address
  • making data less accurate (changing an age of 13 to 10-19 years old)
  • removing outlying data
  • and many more sophisticated methods.

De-identification takes effort however the data can be extremely valuable to researchers, policy makers and others.

What effort should publishers put into de-identifying data versus releasing other non-sensitive data or other improvements?


This is a tough area, as many techniques to deidentify data aren’t perfect for every dataset and the ability to cross-reference datasets can, in some cases, defeat at least part of the deidentification process.

It really needs expert consideration on a case-by-case basis.

I’d suggest that agencies work with the people interested in this data (researchers, policy makers, etc) to scope the deidentification requirements in each case. Leverage the group who wants to use the data - they have the most interest, and often the most experience, in figuring out how to effectively deidentify the datasets they want.


I find it amusing that to test your de-identification you may try to re-identify the data which may become illegal.