A common set up is the following: x
is a dataset of names with mistakes and abbreviations, while y
is a dataset of id and their possible names - the “true” or “master dataset”. The goal is to find the id
of each observation in x
.
The package includes two functions that can be applied to the “true” dataset y
before using fuzzy_join
. They allow to control both type I
and type II
mistakes when matching on names.
The function count_combinations
returns a data.frame with four columns. Within each pair (id
, name
), it computes all permutation of length lower than n
, and returns the number of occurences of each permutation within and across groups:
id <- c(1, 1, 2, 2)
name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp")
count_combinations(name, id = id)
#. id name count_within count_across
#> 1 1 coca 2 1
#> 2 1 cola 2 1
#> 3 1 company 1 1
#> 4 1 incorporated 1 2
#> 5 2 apple 2 1
#> 6 2 corp 1 1
#> 7 2 incorporated 1 2
Words with high count_within
and low count_across
are good identifiers, since they are specific to some id
. On the other hand, words with low count_within
and high count_across
are not good identifiers, and one may want to delete these words from x
and y
.
The function compute_distance
returns a data.frame with three columns. Within each pair (id
, name
), it computes all permutations of length lower than n
and returns the minimum string distance of the name with respect to other ids.
id <- c(1, 1, 2, 2)
name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp")
compute_distance(name, id = id, n = 0)
#> id name distance
#> 1 1 coca cola company 0.5087146
#> 2 1 coca cola incorporated 0.2727273
#> 3 2 apple corp 0.4701299
#> 4 2 apple incorporated 0.2727273
compute_distance(name, id = id, n = 1)
#> id name distance
#> 1 1 coca 0.2666667
#> 2 1 cola 0.2666667
#> 3 1 company 0.2190476
#> 4 1 incorporated 0.0000000
#> 5 2 apple 0.5166667
#> 6 2 corp 0.2190476
#> 7 2 incorporated 0.0000000
Again, words with low distance should be discarded while words with high distance could be added to the “master” dataset y
.
For each row in x
, fuzzy_join
finds the closest row(s) in y
for a specific metric. The distance is a weighted average of a string distance over multiple columns. Both the weights and the string distance can be specified by the user. By default, fuzzy_join
uses the jaro-winkler distance with a winkler adjustment of 0.1 (which gives a higher score to common prefixes).
x <- data.table(a = c("france", "franc"), b = c("arras", "dijon"))
y <- data.table(a = c("franc", "france"), b = c("arvars", "dijjon"))
fuzzy_join(x, y, fuzzy = c("a", "b"), w = c(0.1, 0.9))
#> distance a.x b.x a.y b.y
#> 1: 0.09133333 france arras franc arvars
#> 2: 0.03833333 franc dijon france dijjon
fuzzy_join(x, y, exact = "a", fuzzy = "b")
#> distance a.x b.x a.y b.y
#> 0 franc dijon franc arvars
#> 0 france arras france dijjon
The function corresponds roughly to the Stata command reclink
. The type of distance between strings can be arbitrarly specified thanks to the package stringdist
.