Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That’s how you get clean data and make sure the link-up goes smoothly.
This vignette shows you:
How to perform plausibility checks on different SGIC components.
How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers.
How to detect duplicate cases using a combination of variables as unique identifiers.
To check the plausibility of ID-related variables in a dataset,
trustmebro provides several functions beginning with the
prefix inspect. Every inspect-function returns a
boolean value, indicating whether a value has passed or failed the
plausibility check.
We`ll start by loading trustmebro and dplyr:
The survey data we use is the
trustmebro::sailor_students dataset. It contains fictional
student assessment data from students of the sailor moon universe.
sailor_students
#> # A tibble: 12 × 6
#>    sgic             school class   gender  testscore_langauge testscore_calculus
#>    <chr>            <chr>  <chr>   <chr>                <dbl>              <dbl>
#>  1 "MUC__0308"      54321  "3-B "  "Male"                 425                394
#>  2 "HÄT 2701"       22345  "2-A"   "???"                 4596                123
#>  3 "MUK3801"        22345  "  2-B" "Femal…               2456               9485
#>  4 "SAM10"          22345  "3-B"   "Femal…               2345                  3
#>  5 "T0601"          65432  "1-C"   "Femal…               1234                 NA
#>  6 "      UIT3006 " 12345  "3-3"    <NA>                  123                394
#>  7 "@@@@@@"         <NA>   "3_2  " "Femal…                 56               2938
#>  8  <NA>            12345  "3@41"  "   Fe…                986               3948
#>  9 " "              unkown  <NA>   "Femal…                284                205
#> 10 "MOA2210"        12345  " "     "Femal…                105                 21
#> 11 "MUK3801"        22345  "2-B"   "Femal…               9586                934
#> 12 "T0601"          65432  "1-C"   "Femal…                 NA                764The variable sgic stores SGICs created by students. Each
SGIC is a seven-character string created according to the following
instructions:
Characters 1-3 (letters):
First letter of given name (1st character)
Last letter of given name (2nd character)
First letter of family name (3rd character)
Characters 4-7 (digits):
Birthday (4th and 5th character)
Month of birth (6th and 7th character)
We can use trustmebro::inspect_characterid to check if
the provided SGICs adhere to the expected pattern of three letters
followed by four digits. The expected structure can be defined using the
regular expression "^[A-Za-z]{3}[0-9]{4}$", which we can
then pass to the function using the pattern = argument. For
seamless integration into your data workflow, this function can be
conveniently combined with dplyr::mutate:
sailor_students %>% 
  mutate(structure_check = 
           inspect_characterid(
             sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>%
  select(sgic, structure_check)
#> # A tibble: 12 × 2
#>    sgic             structure_check
#>    <chr>            <lgl>          
#>  1 "MUC__0308"      FALSE          
#>  2 "HÄT 2701"       FALSE          
#>  3 "MUK3801"        TRUE           
#>  4 "SAM10"          FALSE          
#>  5 "T0601"          FALSE          
#>  6 "      UIT3006 " FALSE          
#>  7 "@@@@@@"         FALSE          
#>  8  <NA>            FALSE          
#>  9 " "              FALSE          
#> 10 "MOA2210"        TRUE           
#> 11 "MUK3801"        TRUE           
#> 12 "T0601"          FALSEWe created trustmebro::inspect_characterid with SGICs in
mind, but of course, any other non-SGIC strings can also be checked
using a specified regular expression.
Since the SGIC should end with a date of birth, you can verify the
plausibility of this date of birth using
trustmebro::inspect_birthdaymonth. This function checks if
a string contains exactly four digits representing a valid date of
birth. As before, you can combine
trustmebro::inspect_birthdaymonth with
dplyr::mutate to generate a plausibility check
variable:
sailor_students %>% 
  mutate(birthdate_check = 
           inspect_birthdaymonth(sgic)) %>%
  select(sgic, birthdate_check)
#> # A tibble: 12 × 2
#>    sgic             birthdate_check
#>    <chr>            <lgl>          
#>  1 "MUC__0308"      TRUE           
#>  2 "HÄT 2701"       TRUE           
#>  3 "MUK3801"        FALSE          
#>  4 "SAM10"          FALSE          
#>  5 "T0601"          TRUE           
#>  6 "      UIT3006 " TRUE           
#>  7 "@@@@@@"         FALSE          
#>  8  <NA>            FALSE          
#>  9 " "              FALSE          
#> 10 "MOA2210"        TRUE           
#> 11 "MUK3801"        FALSE          
#> 12 "T0601"          TRUESome SGICs only use the single day or month a person was born. In
this case, you can use of trustmebro::inspect_birthday or
trustmebro::inspect_birthmonth accordingly.
Besides a SGIC, other variables in a given dataset might be used to
identify cases. As mentioned above,
trustmebro::inspect_characterid can be used for any string
that should follow a specific pattern. Furthermore, this package also
provides functions for checking other data types beyond strings.
We can use trustmebro::inspect_numberid to check if a
number matches an expected length. In our dataset, school
should be a five-digit number. combined with dplyr::mutate,
we can add a plausibility variable for the schoolnumber, just as we did
before:
sailor_students %>% 
  mutate(school_check = 
           inspect_numberid(school, 5)) %>%
  select(school, school_check)
#> # A tibble: 12 × 2
#>    school school_check
#>    <chr>  <lgl>       
#>  1 54321  TRUE        
#>  2 22345  TRUE        
#>  3 22345  TRUE        
#>  4 22345  TRUE        
#>  5 65432  TRUE        
#>  6 12345  TRUE        
#>  7 <NA>   FALSE       
#>  8 12345  TRUE        
#>  9 unkown FALSE       
#> 10 12345  TRUE        
#> 11 22345  TRUE        
#> 12 65432  TRUEIn the process of using non-SGIC variables as identifiers,
categorical data is often recoded to ensure consistency within a
workflow. We can use trustmebro::inspect_valinvec to check
if a value exists in a recode map. The recode map should be a named
vector, where the names represent the keys. In our dataset, we want to
inspect if all values in gender conform to this recode
map:
The function checks if a value is present as a key. Combine with
dplyr::mutate to add a variable that contains the check
results:
sailor_students %>% 
  mutate(gender_check = 
           inspect_valinvec(gender, recode_gender)) %>%
  select(gender, gender_check)
#> # A tibble: 12 × 2
#>    gender      gender_check
#>    <chr>       <lgl>       
#>  1 "Male"      TRUE        
#>  2 "???"       FALSE       
#>  3 "Female"    TRUE        
#>  4 "Female  "  FALSE       
#>  5 "Female"    TRUE        
#>  6  <NA>       FALSE       
#>  7 "Female"    TRUE        
#>  8 "   Female" FALSE       
#>  9 "Female"    TRUE        
#> 10 "Female"    TRUE        
#> 11 "Female"    TRUE        
#> 12 "Female"    TRUESo far, we’ve checked if SGIC, school and
gender contain plausible values. Last, we want to ensure
that these variables, when used together as identifiers, uniquely
identify a single case and that there are no duplicate entries based on
these variables. trustmebro::find_dupes checks whether the
combination of identifiers is unique by adding a has_dupes variable to
the dataset. To find duplicates in your data, use it like this:
sailor_students %>% find_dupes(school, sgic, gender) %>%
  select(school, sgic, gender, has_dupes)
#> # A tibble: 12 × 4
#>    school sgic             gender      has_dupes
#>    <chr>  <chr>            <chr>       <lgl>    
#>  1 54321  "MUC__0308"      "Male"      FALSE    
#>  2 22345  "HÄT 2701"       "???"       FALSE    
#>  3 22345  "MUK3801"        "Female"    TRUE     
#>  4 22345  "SAM10"          "Female  "  FALSE    
#>  5 65432  "T0601"          "Female"    TRUE     
#>  6 12345  "      UIT3006 "  <NA>       FALSE    
#>  7 <NA>   "@@@@@@"         "Female"    FALSE    
#>  8 12345   <NA>            "   Female" FALSE    
#>  9 unkown " "              "Female"    FALSE    
#> 10 12345  "MOA2210"        "Female"    FALSE    
#> 11 22345  "MUK3801"        "Female"    TRUE     
#> 12 65432  "T0601"          "Female"    TRUE