> ruby anon.rb
> irb --noecho $ require 'tableautools' $ anon = TableauTools::Anonymizer.new $ anon.anonymize( 'input.csv', ['Field1','FieldX','Another Field'],'output.csv' )
Running the code - programmatic
What it does – details
The Anonymizer opens the input file and extracts the field names from the header record.
It then iterates through the input file's CSV records.
loren ipsum
For each record the named field(s) have their contents replaced with with several—from two to four—words chosen at random from a loren ipsum sample generated here.
Each distinct value from the named field(s) being is replaced with the same loren ipsum content. This preserves the structure of the data insofar as the distribution of records for the named fields' values goes.
Anonymizer will not provide content for a null input field - fields without values in the original data will have no value in the anonymized data.
The anonymized record is output to the named CSV file.
Anonymizer keeps track of the anonymized fields in the log file 'Anonymized.csv', as CSV records identifying the input file, output file, field name, original value, and anonymized value. This makes it easy to check the anonymization if desired.
Example
A CSV file -- 'input.csv' -- to be anonymized:
--
Name,value,ID for something
abc,1,maybe me
scoundrel,11
rascal,23,not me
,117,I'm OK
rascal,10023,another
varmint,23,not me
--notes: - the 2nd record has no value for 'ID for something' - the 4th record has no value for 'Name'
In use: a terminal session to Anonymize input.csv
OS commands and responses in black are there to show the CSV files involved
Ruby commands and responses are in blue
> ls *put.csv input.csv > wc *put.csv 7 13 131 input.csv > cat input.csv Name,value,ID for something abc,1,maybe me scoundrel,11 rascal,23,not me ,117,I'm OK rascal,10023,another varmint,23,not me irb irb(main):001:0> require 'tableautools' => true irb(main):003:0> anon = TableauTools::Anonymizer.new => {"input file"=>nil, "output file"=>nil, "fields"=>nil} irb(main):003:0> anon.anonymize( 'input.csv', ['Name','ID for something'], 'output.csv') Anonymizing input.csv to output.csv - ["Name", "ID for something"] => "10 fields in 6 rows anonymized" irb(main):004:0> exit > ls *put.csv input.csv output.csv > wc *put.csv 7 13 131 input.csv 7 29 241 output.csv 14 42 372 total > cat output.csv Name,value,ID for something laoreet dolor,1,orci dui amet facilisis risus malesuada,11, Ut id viverra,23,et urna bibendum ,117,nisl Nam sagittis nisl Ut id viverra,10023,luctus nisi lacus laoreet dignissim euismod,23,et urna bibendum
> ls Anonymized.csv Anonymized.csv > wc Anonymized.csv 11 54 682 Anonymized.csv > cat Anonymized.csv File - original,File - anon,Record #,Field,Value - original,Value - anon input.csv,output.csv,1,Name,abc,laoreet dolor input.csv,output.csv,2,ID for something,maybe me,orci dui input.csv,output.csv,3,Name,scoundrel,amet facilisis risus malesuada input.csv,output.csv,4,Name,rascal,Ut id viverra input.csv,output.csv,5,ID for something,not me,et urna bibendum input.csv,output.csv,6,ID for something,I'm OK,nisl Nam sagittis nisl input.csv,output.csv,7,Name,rascal,Ut id viverra input.csv,output.csv,8,ID for something,another,luctus nisi lacus laoreet input.csv,output.csv,9,Name,varmint,dignissim euismod input.csv,output.csv,10,ID for something,not me,et urna bibendum
Here's a Tableau dashboard showing the original and anonymized files. It shows that the fields have been anonymized, and that the record distribution has been preserved.
Here's a Tableau dashboard showing the contents of the Anonymized.csv file:
We can clearly see that the anonymous values are consistent with the original field values, and which records in the CSV files the values occur in.
Note: the anonymizing process restarts for every invocation, therefore each anonymized file will have its own replacement field values. The main consequence of this is that if the same input file is anonymized more than once, each time different anonymous values will be provided for the same original field values. For the example above, if input.csv was anonymized again, the values shown as 'Value - anon' would be different than shown.

