Data anonymisation and manipulation

Within business analytics, data anonymisation and manipulation are critical to the provision of operational intelligence.

Data Anonymisation

The Information Commissioner’s Office (ICO) view on data anonymisation is:

Anonymisation is the process of turning data into a form which does not identify individuals and where identification is not likely to take place. This allows for a much wider use of the information. The Data Protection Act controls how organisations use ‘personal data’ – that is, information which allows individuals to be identified.

Organisations are increasingly reliant on anonymisation techniques to enable wider use of personal data. The code of practice explains the issues surrounding the anonymisation of personal data, and the disclosure of data once it has been anonymised. The code describes the steps an organisation can take to ensure that anonymisation is conducted effectively, while retaining useful data.

The code of practice can be found at: https://ico.org.uk/for-organisations/guide-to-data-protection/anonymisation/?q=

Some organisations such as Health, Police and Education release data sets into the public domain and research organisations for information purposes and to facilitate structured research programmes. It is essential that such data cannot be reconstructed to identify any individual. There are various methods and techniques to accomplish anonymisation, some are better than others, but there has been a great deal of publicity when identity reconstruction has been successfully achieved.

As part of our research and development within the EU funded Valcri project a UK Police Force has provided us with data sets to anonymise so that they can be used within the project and then be released into the research community. They include:

  • Crime reports – 3 years
  • Person records – 3 years
  • Incident reports – 3 years
  • Custody records – 1 year
  • Stop & Search records – 1 year
  • ANPR records – months

We have devised a 2-stage method of anonymisation:

  1. Each individual data set is anonymised and returned to the Force to be verified that the set cannot be reconstructed. On receiving verification, the set is released into the project.
  2. A further level of anonymisation will be applied before combining the data sets and utilising/combining open source data to try and reconstruct the data. This will be undertaken by a minimum of 3 teams of persons unconnected with the anonymisation process.

We have produced a White Paper regarding our data anonymisation VALCRI-WP-2017-007 Research Data and will be publishing the results of the 2nd stage when completed.

Data Manipulation

As part of the analytical processes, within a structured methodology such as CRISP-DM (http://www.a-esolutions.com/authority-miner/), the greatest amount of time will be consumed in manipulating data.

The ability to “play with the data” is a critical capability in business intelligence. Data manipulaton is that process of re-sorting, rearranging and otherwise moving your research data, without fundamentally changing it. This is used both as a preparatory technique – i.e. as a precursor to some other activity – or as a means of exploring the data as an analytic tool in its own right. Manipulating the data helps us to identify patterns that may otherwise not be apparent. In fact, it is almost certain that most patterns won’t be visible at first glance.

The aims of manipulating data include:

  • Searching for patterns/trends
  • Making it easier to understand
  • Be more organised
  • Combine data items to add additional information/data/meta data
  • Identify outliers
  • Deal with missing and/or incorrect data