The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
One way to reduce identifiability of a data set is by converting a categorical variable to have a more aggregated taxonomy (i.e. a many-to-one mapping). Here we refer to such a method as a ‘blur’ as it causes features to be joined together in such a way to hide the underlying information.
As an example, consider the ShiftsWorked
data:
library(deident)
head(ShiftsWorked)
#> # A tibble: 6 × 7
#> `Record ID` Employee Date Shift `Shift Start` `Shift End` `Daily Pay`
#> <int> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 1 Maria Cook 2015-01-01 Night 17:01 00:01 78.1
#> 2 2 Stephen Cox 2015-01-01 Day 08:01 16:01 155.
#> 3 3 Kimberly O… 2015-01-01 Day 08:01 16:01 77.8
#> 4 4 Nathan Alv… 2015-01-01 Day 08:01 15:01 203.
#> 5 5 Samuel Par… 2015-01-01 Night 16:01 23:01 211.
#> 6 6 Scott Morr… 2015-01-01 Night 17:01 00:01 142.
A simple ‘blur’ might be to change the taxonomy of ‘Shift’ e.g. combine ‘Day’ and ‘Night’ into a new group ‘Working’ and ignore the ‘Rest’ shifts. To do this we define the values we wish to change as a vector, build a pipeline and apply it to the data:
shift_blur <- c("Day" = "Working", "Night" = "Working")
blur_pipe <- ShiftsWorked |>
add_blur(Shift, blur=shift_blur)
apply_deident(ShiftsWorked, blur_pipe)
#> # A tibble: 3,100 × 7
#> `Record ID` Employee Date Shift `Shift Start` `Shift End` `Daily Pay`
#> <int> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 1 Maria Cook 2015-01-01 Work… 17:01 00:01 78.1
#> 2 2 Stephen C… 2015-01-01 Work… 08:01 16:01 155.
#> 3 3 Kimberly … 2015-01-01 Work… 08:01 16:01 77.8
#> 4 4 Nathan Al… 2015-01-01 Work… 08:01 15:01 203.
#> 5 5 Samuel Pa… 2015-01-01 Work… 16:01 23:01 211.
#> 6 6 Scott Mor… 2015-01-01 Work… 17:01 00:01 142.
#> 7 7 Nathan Sa… 2015-01-01 Rest <NA> <NA> 0
#> 8 8 Jose Lopez 2015-01-01 Work… 17:01 00:01 213.
#> 9 9 Donna Bro… 2015-01-01 Work… 16:01 00:01 219.
#> 10 10 George Ki… 2015-01-01 Work… 16:01 00:01 242.
#> # ℹ 3,090 more rows
category_blur
utilityApplying the blur is relatively simple, but constructing it can be
complex. Consider the starwars
data set supplied by
dplyr:
starwars <- dplyr::starwars
head(starwars)
#> # A tibble: 6 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Org… 150 49 brown light brown 19 fema… femin…
#> 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
And notably the species
variable:
table(starwars$species)
#>
#> Aleena Besalisk Cerean Chagrian Clawdite
#> 1 1 1 1 1
#> Droid Dug Ewok Geonosian Gungan
#> 6 1 1 1 3
#> Human Hutt Iktotchi Kaleesh Kaminoan
#> 35 1 1 1 2
#> Kel Dor Mirialan Mon Calamari Muun Nautolan
#> 1 2 1 1 1
#> Neimodian Pau'an Quermian Rodian Skakoan
#> 1 1 1 1 1
#> Sullustan Tholothian Togruta Toong Toydarian
#> 1 1 1 1 1
#> Trandoshan Twi'lek Vulptereen Wookiee Xexto
#> 1 2 1 2 1
#> Yoda's species Zabrak
#> 1 2
Imagine we wanted to reduce identifiability by aggregating the data
into Human vs Non-Human. We could code the vector by hand, but human
error can lead to mistakes. To aid in designing complex blurs we supply
the category_blur
utility which uses regex to define the
groups.
human_blur <- category_blur(
starwars$species,
"NotHuman" = "^(?!Human)" # Doesn't start with "Human"
)
And the vector returned can be passed into a new pipeline as before.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.