You can use scrutiny to analyze duplicate values in data. Duplications can go a long way in assessing the reliability of published research.
This vignette walks you through scrutiny’s tools for finding,
counting, and summarizing duplications. It uses the pigs4
dataset as a simple example:
pigs4
#> # A tibble: 5 × 3
#> snout tail wings
#> <chr> <chr> <chr>
#> 1 4.73 6.88 6.09
#> 2 8.13 7.33 8.27
#> 3 4.22 5.17 4.40
#> 4 4.22 7.57 5.92
#> 5 5.17 8.13 5.17
duplicate_count()
A good first step is to create a frequency table. To do so, use
duplicate_count()
:
pigs4 %>%
duplicate_count()
#> # A tibble: 11 × 4
#> value count locations locations_n
#> <chr> <int> <chr> <int>
#> 1 5.17 3 snout, tail, wings 3
#> 2 4.22 2 snout 1
#> 3 8.13 2 snout, tail 2
#> 4 4.73 1 snout 1
#> 5 6.88 1 tail 1
#> 6 7.33 1 tail 1
#> 7 7.57 1 tail 1
#> 8 4.40 1 wings 1
#> 9 5.92 1 wings 1
#> 10 6.09 1 wings 1
#> 11 8.27 1 wings 1
It returns a tibble (data frame) that lists all unique
value
s. It is ordered by the count
of values
in the input data frame, so the values that appear most often are at the
top. The locations
are the names of the column or columns
in which a given value appears. They are counted by
locations_n
.
For larger datasets, summary statistics can be helpful. Just run
audit()
after duplicate_count()
:
duplicate_count_colpair()
Sometimes, a sequence of data may be repeated in multiple columns.
duplicate_count_colpair()
helps find such cases:
pigs4 %>%
duplicate_count_colpair()
#> # A tibble: 3 × 7
#> x y count total_x total_y rate_x rate_y
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 snout tail 2 5 5 0.4 0.4
#> 2 snout wings 1 5 5 0.2 0.2
#> 3 tail wings 1 5 5 0.2 0.2
x
and y
represent all combinations of
columns in pigs4
. The count
is the number of
values that appear in both respective columns. This is different from
duplicate_count()
, where count
displays total
frequencies.
snout
and tail
are the column pair with the
most overlap: 2 out of 5 values are the same, a rate of 0.4. If there
are no missing values, total_x
and total_y
are
the same. The same applies to rate_x
and
rate_y
.
Again, you can get summary statistics with audit()
:
pigs4 %>%
duplicate_count_colpair() %>%
audit()
#> # A tibble: 5 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 count 1.33 0.577 1 1 2 0 0
#> 2 total_x 5 0 5 5 5 0 0
#> 3 total_y 5 0 5 5 5 0 0
#> 4 rate_x 0.267 0.115 0.2 0.2 0.4 0 0
#> 5 rate_y 0.267 0.115 0.2 0.2 0.4 0 0
duplicate_tally()
Unlike the other two functions, duplicate_tally()
preserves the structure of the original data frame. It adds an
_n
column next to each original column. The newly added
columns count how often each value appears in the data frame as a
whole:
pigs4 %>%
duplicate_tally()
#> # A tibble: 5 × 6
#> snout snout_n tail tail_n wings wings_n
#> <chr> <int> <chr> <int> <chr> <int>
#> 1 4.73 1 6.88 1 6.09 1
#> 2 8.13 2 7.33 1 8.27 1
#> 3 4.22 2 5.17 3 4.40 1
#> 4 4.22 2 7.57 1 5.92 1
#> 5 5.17 3 8.13 2 5.17 3
In snout
, for example, 4.22
appears twice,
so its entries in snout_n
are 2
. But likewise,
8.13
appears in both snout
and
tail
, so both observations are marked 2
in the
_n
columns.
When following duplicate_tally()
up with
audit()
, it shows summary statistics for each
_n
column. The last row summarizes all of these columns
together.