Anonymize Your Data

Ever needed to anonymize some data, altering identities while preserving its demographics? Me too.

After yet another tiresome session manually anonymizing some data, shlepping data about, building in some reasonable obfuscation instead of the naive case of just dumping random characters into spreadsheet cells, I'd had enough.

So I built a Ruby tool to do it, and incorporated it into Tableau Tools, available via the tableautools Ruby gem that can be found on Rubygem.org here.

Anonymizer

Anonymizer is a simple tool that scans a CSV file, anonymizes the values in named fields, and saves the changed data to another CSV file.

It ensures that the fields' original values map consistently to anonymous values so that the distribution of the records among the values remains the same. Analyses on the anonymized data will have the same record-volume structure as with the original data. Sorting will be different.

Anonymizer also creates a log file of the substitutions it makes, recording the changes as CSV data records, each of which contains the record's #, the field name, the field's original value, and the substituted value. This file makes it simple to see how Anonymizer has changed the data.

How to use it

Preparation

  • Ruby is installed – Anonymizer was built with version 2.4.2
  • the gem is installed – it's simple, just issue this command (from the command line):
    > gem install tableautools
  • the CSV file you want to anonymize:
    • is available – it's best but not required that the file is in the local/working directory
    • has a header record containing the field names – Anonymizer needs this to know which fields to anonymize
    • contains comma-separated values – some files actually contain tab-separated values, even though they have '.csv' as a suffix
  • you know which field or fields you want anonymized

Making it happen

Anonymizer is simple to use, it only takes a few simple commands, e.g.:


> require 'tableautools'
> anon = TableauTools::Anonymizer.new
> anon.anonymize( 'input.csv', ['A Field','Field X','Another Field'], 'output.csv')

The statements:

> require 'tableautools'
establish a connection to the gem

> anon = TableauTools::Anonymizer.new
get an Anonymizer, call it 'anon'
- note the case of TableauTools, it's a proper name of the module; by convention gem names are lower case

> anon.anonymize( 'input.csv', ['Field1','FieldX','Another Field'], 'output.csv' )
tell anon to:
- read the data in 'input.csv' (or other CSV file),
- anonymize the named fields, and
- save the anonymized data into
'output.csv(or other CSV file)

Running the code - manual

There are two main ways to manually run the Ruby code.
 
In a file named, say 'anon.rb' from the command line like so:
> ruby anon.rb
In irb - the interactive Ruby environment, with the sequence:
> irb --noecho
$ require 'tableautools'
$ anon = TableauTools::Anonymizer.new
$ anon.anonymize( 'input.csv', ['Field1','FieldX','Another Field'],'output.csv' )

Running the code - programmatic

As Anonymizer is a standard Ruby class it can be used anywhere Ruby can be used, as long as the tableau tools gem has been installed.

This makes is simple to integrate it into a wide variety of scripting and data processing tools and environments.

What it does – details

The Anonymizer opens the input file and extracts the field names from the header record.

It then iterates through the input file's CSV records.

loren ipsum

For each record the named field(s) have their contents replaced with with several—from two to four—words chosen at random from a loren ipsum sample generated here.

Each distinct value from the named field(s) being is replaced with the same loren ipsum content. This preserves the structure of the data insofar as the distribution of records for the named fields' values goes.

Anonymizer will not provide content for a null input field - fields without values in the original data will have no value in the anonymized data.

The anonymized record is output to the named CSV file.

Anonymizer keeps track of the anonymized fields in the log file 'Anonymized.csv', as CSV records identifying the input file, output file, field name, original value, and anonymized value. This makes it easy to check the anonymization if desired.

Example

A CSV file -- 'input.csv' -- to be anonymized:

--
Name,value,ID for something
abc,1,maybe me
scoundrel,11
rascal,23,not me
,117,I'm OK
rascal,10023,another
varmint,23,not me
--notes: - the 2nd record has no value for 'ID for something' - the 4th record has no value for 'Name'

In use: a terminal session to Anonymize input.csv

OS commands and responses in black are there to show the CSV files involved
Ruby commands and responses are in blue

> ls *put.csv
input.csv

> wc *put.csv
 7 13 131 input.csv

> cat input.csv
Name,value,ID for something
abc,1,maybe me
scoundrel,11
rascal,23,not me
,117,I'm OK
rascal,10023,another
varmint,23,not me

irb
irb(main):001:0> require 'tableautools'
=> true

irb(main):003:0> anon = TableauTools::Anonymizer.new
=> {"input file"=>nil, "output file"=>nil, "fields"=>nil}

irb(main):003:0> anon.anonymize( 'input.csv', ['Name','ID for something'], 'output.csv')
Anonymizing input.csv to output.csv - ["Name", "ID for something"]
=> "10 fields in 6 rows anonymized"

irb(main):004:0> exit

> ls *put.csv
input.csv output.csv

> wc *put.csv
7 13 131 input.csv
7 29 241 output.csv
14 42 372 total

> cat output.csv
Name,value,ID for something
laoreet dolor,1,orci dui
amet facilisis risus malesuada,11,
Ut id viverra,23,et urna bibendum
,117,nisl Nam sagittis nisl
Ut id viverra,10023,luctus nisi lacus laoreet
dignissim euismod,23,et urna bibendum
> ls Anonymized.csv Anonymized.csv > wc Anonymized.csv 11 54 682 Anonymized.csv > cat Anonymized.csv File - original,File - anon,Record #,Field,Value - original,Value - anon input.csv,output.csv,1,Name,abc,laoreet dolor input.csv,output.csv,2,ID for something,maybe me,orci dui input.csv,output.csv,3,Name,scoundrel,amet facilisis risus malesuada input.csv,output.csv,4,Name,rascal,Ut id viverra input.csv,output.csv,5,ID for something,not me,et urna bibendum input.csv,output.csv,6,ID for something,I'm OK,nisl Nam sagittis nisl input.csv,output.csv,7,Name,rascal,Ut id viverra input.csv,output.csv,8,ID for something,another,luctus nisi lacus laoreet input.csv,output.csv,9,Name,varmint,dignissim euismod input.csv,output.csv,10,ID for something,not me,et urna bibendum

Here's a Tableau dashboard showing the original and anonymized files. It shows that the fields have been anonymized, and that the record distribution has been preserved.

Here's a Tableau dashboard showing the contents of the Anonymized.csv file:

We can clearly see that the anonymous values are consistent with the original field values, and which records in the CSV files the values occur in.

Note: the anonymizing process restarts for every invocation, therefore each anonymized file will have its own replacement field values. The main consequence of this is that if the same input file is anonymized more than once, each time different anonymous values will be provided for the same original field values. For the example above, if input.csv was anonymized again, the values shown as 'Value - anon' would be different than shown.

 

 

Leave a Reply

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax