Introduction to record linkage with diyar

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

12 November 2023

Consolidating information from multiple sources is often the first step in these investigations. This vignette gives a brief introduction to the basics of record linkage as implemented by diyar.

Let’s begin by reviewing missing_staff_id - a sample dataset containing incomplete staff information.

data(missing_staff_id)
missing_staff_id
#>   r_id staff_id age initials hair_colour     branch_office source_1 source_2
#> 1    1       NA  30     G.D.       Brown Republic of Ghana        A        3
#> 2    2       NA  30     B.G.        Teal            France        A        1
#> 3    3       NA  30     X.P.        <NA>              <NA>        A        1
#> 4    4       NA  30     X.P.       Green              <NA>        B        1
#> 5    5       NA  30     <NA>       Green            France        A        1
#> 6    6        2  30     G.D.  Dark brown             Ghana        A        1
#> 7    7        2  30     G.D.       Brown Republic of Ghana        B        2

A unique identifier that distinguishes one entity (staff) from another is often unavailable or incomplete as is the case with staff_id in this example. links() can be used to create one. The identifier is created as an S4 class (pid) with useful information about each group in its slots.

The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.

In the example below, we use initials and hair_colour as distinguishing characteristics.

missing_staff_id$p1 <- links(criteria = missing_staff_id$initials)
missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour)
missing_staff_id[c("initials", "hair_colour", "p1", "p2")]
#>   initials hair_colour            p1            p2
#> 1     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001)
#> 2     B.G.        Teal P.2 (No hits) P.2 (No hits)
#> 3     X.P.        <NA> P.3 (CRI 001) P.3 (No hits)
#> 4     X.P.       Green P.3 (CRI 001) P.4 (CRI 001)
#> 5     <NA>       Green P.5 (No hits) P.4 (CRI 001)
#> 6     G.D.  Dark brown P.1 (CRI 001) P.6 (No hits)
#> 7     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001)

Unsurprisingly, the uniqueness of identifiers p1 and p2 correspond to the uniqueness of the initials and hair_colour respectively. Both identifiers represent different outcomes - p1 identifies records 3 and 4 as the same person, while p2 has it as records 4 and 5.

To maximise coverage, links() can implement an ordered multistage deterministic approach to record linkage. For example, we can say that records with matching initials should be linked to each other first, then other records with a matching hair_colour should then be added to each group. This is referred to as group expansion.

missing_staff_id$p3 <- links(
  criteria = as.list(missing_staff_id[c("initials", "hair_colour")])
  )
missing_staff_id[c("initials", "hair_colour", "p1", "p2", "p3")]
#>   initials hair_colour            p1            p2            p3
#> 1     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2     B.G.        Teal P.2 (No hits) P.2 (No hits) P.2 (No hits)
#> 3     X.P.        <NA> P.3 (CRI 001) P.3 (No hits) P.3 (CRI 001)
#> 4     X.P.       Green P.3 (CRI 001) P.4 (CRI 001) P.3 (CRI 001)
#> 5     <NA>       Green P.5 (No hits) P.4 (CRI 001) P.3 (CRI 003)
#> 6     G.D.  Dark brown P.1 (CRI 001) P.6 (No hits) P.1 (CRI 001)
#> 7     G.D.       Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)

We see that p3 now identifies records 3, 4 and 5 as the same person. The logic here is that since record 4 has the same initial as record 3 and also has the same hair_colour as record 5, all three are therefore linked as part of the same entity. Note that records 3 and 5 have only been linked due to their shared link with record 4. If record 4 is removed from this dataset, and the analysis repeated, records 3 and 5 will not be linked and therefore remain separate entities.

At each stage, additional match criteria can be specified. This is done through a sub_criteria object. This is an S3 class containing attributes to be compared and functions for the comparisons. A sub_criteria object is used for evaluated, fuzzy and/or nested matches.

For example, we can compare hair_colour and branch_office without any order (priority) to them. This is the equivalent of saying matching hair color OR/AND branch office.

scri_1 <- sub_criteria(
  missing_staff_id$hair_colour, 
  missing_staff_id$branch_office, 
  operator = "or"
  )
scri_2 <- sub_criteria(
  missing_staff_id$hair_colour, 
  missing_staff_id$branch_office, 
  operator = "and"
  )
missing_staff_id$p4 <- links(
  criteria = "place_holder", 
  sub_criteria = list(cr1 = scri_1), 
  recursive = TRUE
  )
missing_staff_id$p5 <- links(
  criteria = "place_holder", 
  sub_criteria = list(cr1 = scri_2), 
  recursive = TRUE
  )
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5")]
#>   hair_colour     branch_office            p4            p5
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)

There is no limit to the number of sub_criteria that can be specified but each sub_criteria must be paired to a criteria. Any unpaired sub_criteria will be ignored.

As mentioned, a sub_criteria can be nested. For example, scri_3 below is the equivalent of saying (scri_1; matching hair colour OR branch office) AND (matching initials OR branch office).

scri_3 <- sub_criteria(
  scri_1, 
  sub_criteria(
    missing_staff_id$initials, 
    missing_staff_id$branch_office,
    operator = "or"),
  operator = "and"
  )
missing_staff_id$p6 <- links(
  criteria = "place_holder", 
  sub_criteria = list(cr1 = scri_3), 
  recursive = TRUE
  )
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6")]
#>   hair_colour     branch_office            p4            p5            p6
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits) P.2 (No hits)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.4 (No hits) P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)

Evaluated matches can be implemented with user-defined functions. The only requirement for this is that they:

For example, there are variations of the same hair_colour and branch_office values in missing_staff_id. A quick look and we see that using the last word of each value will improve the linkage result. We can create and pass a function to the sub_criteria object that will make this comparison. After doing this below (p7), we see that record 6 has now been linked with records 1 and 7, which was not the case earlier.

# A function to extract the last word in a string
last_word_wf <- function(x) tolower(gsub("^.* ", "", x))
# A logical test using `last_word_wf`.
last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y)

scri_4 <- sub_criteria(
  missing_staff_id$hair_colour, 
  missing_staff_id$branch_office,
  match_funcs = c(last_word_cmp, last_word_cmp),
  operator = "or"
  )
missing_staff_id$p7 <- links(
  criteria = "place_holder", 
  sub_criteria = list(cr1 = scri_4), 
  recursive = TRUE
  )
missing_staff_id[c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
#>   hair_colour     branch_office            p4            p5            p6
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits) P.2 (No hits)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.4 (No hits) P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 6  Dark brown             Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#>              p7
#> 1 P.1 (CRI 001)
#> 2 P.2 (CRI 001)
#> 3 P.3 (No hits)
#> 4 P.4 (No hits)
#> 5 P.2 (CRI 001)
#> 6 P.1 (CRI 001)
#> 7 P.1 (CRI 001)

A sub_criteria can provide a lot of flexibly in terms of how attributes are compared however, it comes at the cost of processing time. This is because links() is an iterative function, comparing batches of record-pairs in iterations. This generally leads to a lower maximum memory usage but longer run times needed to analyse the multiple batches. There are three modes of a batched analysis with links() - "yes", "semi" and "no". These help manage the maximum memory usage or maximum number of iterations expended to complete the analyses.

For instance, below is a match criteria for a rolling match of records within three days of each other. With print(), we can see the record-pair batches compared at each iteration.

dfr <- data.frame(x = 1:5)
roll_window_funx <- function(x, y){
  match <- abs(x - y) <= 2
  print(data.frame(y, x, match))
  cat("\n")
  return(match)
  }
roll_window_scri <- sub_criteria(
  dfr$x,
  match_funcs = roll_window_funx
  )

With the "yes" option, the linkage takes 5 iterations (run time) but only creates 5 record-pairs (max memory usage) are compared at each iteration.

dfr$b.p1 <- links(
  criteria = "place_holder",
  sub_criteria = list(cr1 = roll_window_scri),
  batched = "yes",
  recursive = TRUE
  )
#>   y x match
#> 1 1 1  TRUE
#> 2 1 2  TRUE
#> 3 1 3  TRUE
#> 4 1 4 FALSE
#> 5 1 5 FALSE
#> 
#>   y x match
#> 1 2 2  TRUE
#> 2 2 3  TRUE
#> 3 2 4  TRUE
#> 4 2 5 FALSE
#> 5 2 1  TRUE
#> 
#>   y x match
#> 1 3 3  TRUE
#> 2 3 4  TRUE
#> 3 3 5  TRUE
#> 4 3 1  TRUE
#> 5 3 2  TRUE
#> 
#>   y x match
#> 1 4 4  TRUE
#> 2 4 5  TRUE
#> 3 4 1 FALSE
#> 4 4 2  TRUE
#> 5 4 3  TRUE
#> 
#>   y x match
#> 1 5 5  TRUE
#> 2 5 1 FALSE
#> 3 5 2 FALSE
#> 4 5 3  TRUE
#> 5 5 4  TRUE

Conversely, the "no" option completes the linkage in 1 iteration but creates 15 record-pairs in that single iteration.

dfr$b.p2 <- links(
  criteria = "place_holder",
  sub_criteria = list(cr1 = roll_window_scri),
  batched = "no"
  )
#>    y x match
#> 1  1 1  TRUE
#> 2  1 2  TRUE
#> 3  1 3  TRUE
#> 4  1 4 FALSE
#> 5  1 5 FALSE
#> 6  2 2  TRUE
#> 7  2 3  TRUE
#> 8  2 4  TRUE
#> 9  2 5 FALSE
#> 10 3 3  TRUE
#> 11 3 4  TRUE
#> 12 3 5  TRUE
#> 13 4 4  TRUE
#> 14 4 5  TRUE
#> 15 5 5  TRUE

The "semi" option is a balance between the "yes" and "no" options. The number of record-pairs increases as matches are identified. This generally leads to a lower maximum memory usage compared to the "no" option and fewer number of iterations compared to the "yes" options.

dfr$b.p3 <- links(
  criteria = "place_holder",
  sub_criteria = list(cr1 = roll_window_scri),
  batched = "semi",
  recursive = TRUE
  )
#>   y x match
#> 1 1 1  TRUE
#> 2 1 2  TRUE
#> 3 1 3  TRUE
#> 4 1 4 FALSE
#> 5 1 5 FALSE
#> 
#>   y x match
#> 1 2 2  TRUE
#> 2 2 3  TRUE
#> 3 2 4  TRUE
#> 4 2 5 FALSE
#> 5 2 1  TRUE
#> 6 3 3  TRUE
#> 7 3 4  TRUE
#> 8 3 5  TRUE
#> 9 3 1  TRUE
#> 
#>   y x match
#> 1 4 4  TRUE
#> 2 4 5  TRUE
#> 3 4 1 FALSE
#> 4 4 2  TRUE
#> 5 4 3  TRUE
#> 6 5 5  TRUE
#> 7 5 1 FALSE
#> 8 5 2 FALSE
#> 9 5 3  TRUE
dfr
#>   x          b.p1          b.p2          b.p3
#> 1 1 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 2 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 3 3 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 4 4 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 5 5 P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)

There are variations of links() such as links_wf_probabilistic(), links_af_probabilistic() and links_wf_episodes() for specific use cases such as probabilistic record linkage and grouping temporal events.

The implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity. In summary, m_probabilities and u_probabilities, which are the probabilities of a true and false match respectively are used to calculate a final match score for each record-pair. Records below or above a certain score_threshold are considered matches or non-matches respectively. See help(links_wf_probabilistic) for a more detailed explanation of the method. Below we see the same analysis as above but as a probabilistic record linkage.

missing_staff_id$p9 <- links_wf_probabilistic(
  attribute = list(missing_staff_id$hair_colour, 
                   missing_staff_id$branch_office), 
  cmp_func = c(last_word_cmp, last_word_cmp), 
  probabilistic = TRUE
  )
missing_staff_id[c("hair_colour", "branch_office", "p7", "p9")]
#>   hair_colour     branch_office            p7            p9
#> 1       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2        Teal            France P.2 (CRI 001) P.2 (No hits)
#> 3        <NA>              <NA> P.3 (No hits) P.3 (No hits)
#> 4       Green              <NA> P.4 (No hits) P.4 (No hits)
#> 5       Green            France P.2 (CRI 001) P.5 (No hits)
#> 6  Dark brown             Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 7       Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.