The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Consolidating information from multiple sources is often the first
step in these investigations. This vignette gives a brief introduction
to the basics of record linkage as implemented by
diyar
.
Let’s begin by reviewing missing_staff_id
- a sample
dataset containing incomplete staff information.
data(missing_staff_id)
missing_staff_id
#> r_id staff_id age initials hair_colour branch_office source_1 source_2
#> 1 1 NA 30 G.D. Brown Republic of Ghana A 3
#> 2 2 NA 30 B.G. Teal France A 1
#> 3 3 NA 30 X.P. <NA> <NA> A 1
#> 4 4 NA 30 X.P. Green <NA> B 1
#> 5 5 NA 30 <NA> Green France A 1
#> 6 6 2 30 G.D. Dark brown Ghana A 1
#> 7 7 2 30 G.D. Brown Republic of Ghana B 2
A unique identifier that distinguishes one entity (staff) from
another is often unavailable or incomplete as is the case with
staff_id
in this example. links()
can be used
to create one. The identifier is created as an S4
class
(pid
) with useful information about each group in its
slots.
The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.
In the example below, we use initials
and
hair_colour
as distinguishing characteristics.
missing_staff_id$p1 <- links(criteria = missing_staff_id$initials)
missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour)
missing_staff_id[c("initials", "hair_colour", "p1", "p2")]
#> initials hair_colour p1 p2
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
Unsurprisingly, the uniqueness of identifiers p1
and
p2
correspond to the uniqueness of the
initials
and hair_colour
respectively. Both
identifiers represent different outcomes - p1
identifies
records 3 and 4 as the same person, while p2
has it as
records 4 and 5.
To maximise coverage, links()
can implement an ordered
multistage deterministic approach to record linkage. For example, we can
say that records with matching initials
should be linked to
each other first, then other records with a matching
hair_colour
should then be added to each group. This is
referred to as group expansion.
missing_staff_id$p3 <- links(
criteria = as.list(missing_staff_id[c("initials", "hair_colour")])
)
missing_staff_id[c("initials", "hair_colour", "p1", "p2", "p3")]
#> initials hair_colour p1 p2 p3
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits) P.3 (CRI 001)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001) P.3 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001) P.3 (CRI 003)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits) P.1 (CRI 001)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
We see that p3
now identifies records 3, 4 and 5 as the
same person. The logic here is that since record 4 has the same
initial
as record 3 and also has the same
hair_colour
as record 5, all three are therefore linked as
part of the same entity. Note that records 3 and 5 have only been linked
due to their shared link with record 4. If record 4 is removed from this
dataset, and the analysis repeated, records 3 and 5 will not be linked
and therefore remain separate entities.
During group expansion the following rules are applied.
tie_sort
argument.At each stage, additional match criteria can be specified. This is
done through a sub_criteria
object. This is an
S3
class containing attributes to be compared and functions
for the comparisons. A sub_criteria
object is used for
evaluated, fuzzy and/or nested matches.
For example, we can compare hair_colour
and
branch_office
without any order (priority) to them. This is
the equivalent of saying matching hair color OR/AND
branch
office.
scri_1 <- sub_criteria(
missing_staff_id$hair_colour,
missing_staff_id$branch_office,
operator = "or"
)
scri_2 <- sub_criteria(
missing_staff_id$hair_colour,
missing_staff_id$branch_office,
operator = "and"
)
missing_staff_id$p4 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = scri_1),
recursive = TRUE
)
missing_staff_id$p5 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = scri_2),
recursive = TRUE
)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5")]
#> hair_colour branch_office p4 p5
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
There is no limit to the number of sub_criteria
that can
be specified but each sub_criteria
must be paired to a
criteria
. Any unpaired sub_criteria
will be
ignored.
As mentioned, a sub_criteria
can be nested. For example,
scri_3
below is the equivalent of saying
(scri_1
; matching hair colour OR
branch
office) AND
(matching initials OR
branch
office).
scri_3 <- sub_criteria(
scri_1,
sub_criteria(
missing_staff_id$initials,
missing_staff_id$branch_office,
operator = "or"),
operator = "and"
)
missing_staff_id$p6 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = scri_3),
recursive = TRUE
)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6")]
#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.4 (No hits) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
Evaluated matches can be implemented with user-defined functions. The only requirement for this is that they:
x
and y
,
where y
is the value for one observation being compared
against the value of all other observations - x
.TRUE
or FALSE
.For example, there are variations of the same
hair_colour
and branch_office
values in
missing_staff_id
. A quick look and we see that using the
last word of each value will improve the linkage result. We can create
and pass a function to the sub_criteria
object that will
make this comparison. After doing this below (p7
), we see
that record 6 has now been linked with records 1 and 7, which was not
the case earlier.
# A function to extract the last word in a string
last_word_wf <- function(x) tolower(gsub("^.* ", "", x))
# A logical test using `last_word_wf`.
last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y)
scri_4 <- sub_criteria(
missing_staff_id$hair_colour,
missing_staff_id$branch_office,
match_funcs = c(last_word_cmp, last_word_cmp),
operator = "or"
)
missing_staff_id$p7 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = scri_4),
recursive = TRUE
)
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.4 (No hits) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> p7
#> 1 P.1 (CRI 001)
#> 2 P.2 (CRI 001)
#> 3 P.3 (No hits)
#> 4 P.4 (No hits)
#> 5 P.2 (CRI 001)
#> 6 P.1 (CRI 001)
#> 7 P.1 (CRI 001)
A sub_criteria
can provide a lot of flexibly in terms of
how attributes are compared however, it comes at the cost of processing
time. This is because links()
is an iterative function,
comparing batches of record-pairs in iterations. This generally leads to
a lower maximum memory usage but longer run times needed to analyse the
multiple batches. There are three modes of a batched
analysis with links()
- "yes"
,
"semi"
and "no"
. These help manage the maximum
memory usage or maximum number of iterations expended to complete the
analyses.
For instance, below is a match criteria for a rolling match of
records within three days of each other. With print()
, we
can see the record-pair batches compared at each iteration.
dfr <- data.frame(x = 1:5)
roll_window_funx <- function(x, y){
match <- abs(x - y) <= 2
print(data.frame(y, x, match))
cat("\n")
return(match)
}
roll_window_scri <- sub_criteria(
dfr$x,
match_funcs = roll_window_funx
)
With the "yes"
option, the linkage takes 5 iterations
(run time) but only creates 5 record-pairs (max memory usage) are
compared at each iteration.
dfr$b.p1 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = roll_window_scri),
batched = "yes",
recursive = TRUE
)
#> y x match
#> 1 1 1 TRUE
#> 2 1 2 TRUE
#> 3 1 3 TRUE
#> 4 1 4 FALSE
#> 5 1 5 FALSE
#>
#> y x match
#> 1 2 2 TRUE
#> 2 2 3 TRUE
#> 3 2 4 TRUE
#> 4 2 5 FALSE
#> 5 2 1 TRUE
#>
#> y x match
#> 1 3 3 TRUE
#> 2 3 4 TRUE
#> 3 3 5 TRUE
#> 4 3 1 TRUE
#> 5 3 2 TRUE
#>
#> y x match
#> 1 4 4 TRUE
#> 2 4 5 TRUE
#> 3 4 1 FALSE
#> 4 4 2 TRUE
#> 5 4 3 TRUE
#>
#> y x match
#> 1 5 5 TRUE
#> 2 5 1 FALSE
#> 3 5 2 FALSE
#> 4 5 3 TRUE
#> 5 5 4 TRUE
Conversely, the "no"
option completes the linkage in 1
iteration but creates 15 record-pairs in that single iteration.
dfr$b.p2 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = roll_window_scri),
batched = "no"
)
#> y x match
#> 1 1 1 TRUE
#> 2 1 2 TRUE
#> 3 1 3 TRUE
#> 4 1 4 FALSE
#> 5 1 5 FALSE
#> 6 2 2 TRUE
#> 7 2 3 TRUE
#> 8 2 4 TRUE
#> 9 2 5 FALSE
#> 10 3 3 TRUE
#> 11 3 4 TRUE
#> 12 3 5 TRUE
#> 13 4 4 TRUE
#> 14 4 5 TRUE
#> 15 5 5 TRUE
The "semi"
option is a balance between the
"yes"
and "no"
options. The number of
record-pairs increases as matches are identified. This generally leads
to a lower maximum memory usage compared to the "no"
option
and fewer number of iterations compared to the "yes"
options.
dfr$b.p3 <- links(
criteria = "place_holder",
sub_criteria = list(cr1 = roll_window_scri),
batched = "semi",
recursive = TRUE
)
#> y x match
#> 1 1 1 TRUE
#> 2 1 2 TRUE
#> 3 1 3 TRUE
#> 4 1 4 FALSE
#> 5 1 5 FALSE
#>
#> y x match
#> 1 2 2 TRUE
#> 2 2 3 TRUE
#> 3 2 4 TRUE
#> 4 2 5 FALSE
#> 5 2 1 TRUE
#> 6 3 3 TRUE
#> 7 3 4 TRUE
#> 8 3 5 TRUE
#> 9 3 1 TRUE
#>
#> y x match
#> 1 4 4 TRUE
#> 2 4 5 TRUE
#> 3 4 1 FALSE
#> 4 4 2 TRUE
#> 5 4 3 TRUE
#> 6 5 5 TRUE
#> 7 5 1 FALSE
#> 8 5 2 FALSE
#> 9 5 3 TRUE
dfr
#> x b.p1 b.p2 b.p3
#> 1 1 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 2 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 3 3 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 4 4 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 5 5 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
There are variations of links()
such as
links_wf_probabilistic()
,
links_af_probabilistic()
and
links_wf_episodes()
for specific use cases such as
probabilistic record linkage and grouping temporal events.
The implementation of probabilistic record linkage is based on
Fellegi and Sunter (1969) model for deciding if two records belong to
the same entity. In summary, m_probabilities
and
u_probabilities
, which are the probabilities of a true and
false match respectively are used to calculate a final match score for
each record-pair. Records below or above a certain
score_threshold
are considered matches or non-matches
respectively. See help(links_wf_probabilistic)
for a more
detailed explanation of the method. Below we see the same analysis as
above but as a probabilistic record linkage.
missing_staff_id$p9 <- links_wf_probabilistic(
attribute = list(missing_staff_id$hair_colour,
missing_staff_id$branch_office),
cmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = TRUE
)
missing_staff_id[c("hair_colour", "branch_office", "p7", "p9")]
#> hair_colour branch_office p7 p9
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits)
#> 6 Dark brown Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.