The first versions of the rhoR package calculated rho using a matrix of two columns, referred to as a codeset. Even the methods that accepted a contingency table, actually converted that table to a codeset and off-loaded the work to the functions accepting a codeset. This was all well and good until the codesets became large. A critical step in the calculation of rho is the sampling, and then re-sampling, a high number of times (default: 800). It is at this point that passing a matrix of say 10,000 rows and sampling numerous times, becomes a burden.

The alternative, as being introduced in version 1.3.0 of the rhoR package, is to convert sets to contingency tables and sample from that. The motivation here is that even as the size grows, the contingency table still only consists of four numbers:

rhoR::random_contingency_table(setLength=100,   baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
##      [,1] [,2]
## [1,]   22    8
## [2,]    1   69
rhoR::random_contingency_table(setLength=1000,  baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
##      [,1] [,2]
## [1,]  183  117
## [2,]   16  684
rhoR::random_contingency_table(setLength=10000, baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
##      [,1] [,2]
## [1,] 2411  589
## [2,]  281 6719

The important piece to note is that for each contingency table the corresponding codeset is setLength * 2 in size.

So in regards to the size of the matrices, it’s clear the contingency tables are much smaller, but the tricky piece comes in to play when updating the rho operations themselves. A main component of rho is generating samples that contain a minimum number of positive matches from the first-rater and kappa agreement within a desired range. DOing so with a codeset, we are able to sample directly from the corresponding rows of the codeset that meet certain criteria (e.g. both raters agree.)

baserate = 0.3
kappaMin=0.4
kappaMax = 0.8
precisionMin=0
precisionMax=1

set = rhoR::createSimulatedCodeSet(length= 10, baserate, kappaMin, kappaMax, precisionMin, precisionMax)
kappa.precision.combos = rhoR::generateKPs(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax)
##       gold silver
##  [1,]    1      1
##  [2,]    1      1
##  [3,]    1      1
##  [4,]    0      1
##  [5,]    0      1
##  [6,]    0      0
##  [7,]    0      0
##  [8,]    0      0
##  [9,]    0      0
## [10,]    0      0
## $precision
## [1] 0.8486849
## 
## $kappa
## [1] 0.6926069
curr.kappa = kappa.precision.combos[[1]]$kappa
curr.precision = kappa.precision.combos[[1]]$precision

curr.recall = getR(curr.kappa, baserate, curr.precision)
simulated.set = prset(precision = curr.precision, recall = curr.recall, length = 10000, baserate = baserate);

Given a simulated set, we now get to the point where we generate a sampled handset, matching the minimum desired baserate, and calculate the kappa agreement. NOTE: this step, along with the previous, is what happens, by default 800 times and we will see slows downs significantly when running on larger codesets.

handset = getHandSet(set = simulated.set, handSetLength = 20, handSetBaserate = baserate, returnSet = T);
##       gold silver
##  [1,]    1      1
##  [2,]    1      1
##  [3,]    1      1
##  [4,]    1      1
##  [5,]    1      1
##  [6,]    1      0
##  [7,]    0      0
##  [8,]    0      0
##  [9,]    0      0
## [10,]    0      0
## [11,]    0      0
## [12,]    0      0
## [13,]    1      1
## [14,]    0      0
## [15,]    1      1
## [16,]    0      0
## [17,]    0      0
## [18,]    0      1
## [19,]    0      0
## [20,]    0      0
calcKappa(handset)
## [1] 0.7916667

Where is it slow?

library(microbenchmark)

microbenchmark(
  kps.r = rhoR::generateKPs(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax),
  kps.c = rhoR::generateKPs_c(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax),
  times = 10, unit = "ms"
)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval cld
##  kps.r 45.06487 46.67495 49.50887 48.73819 52.40435 54.52137    10   b
##  kps.c 17.36404 18.18673 18.36917 18.37064 18.71605 19.22127    10  a
## Unit: milliseconds
##  expr      min       lq      mean    median       uq      max neval cld
##  kp.r 0.051149 0.058769 0.0888845 0.0801445 0.097031 0.173652    10   a
##  kp.c 0.069832 0.080798 0.1705617 0.0931820 0.185334 0.701289    10   a

Create Sets From a Precision and Kappa

kp.r = rhoR::genPKcombo(kappaDistribution = as.numeric(kps.df[,2]), kappaProbability = NULL, precisionDistribution = as.numeric(kps.df[,1]), precisionProbability = NULL, baserate = baserate)
kappa = kp.r$kappa
precision = kp.r$precision
recall = getR(kappa, baserate, precision)

microbenchmark(
  fullSet.r = prset(precision = precision, recall = recall, length = 10000, baserate = baserate),
  fullSet.c = contingency_table(precision = precision, recall = recall, length = 10000, baserate = baserate),
  times = 10, unit = "ms"
)
## Unit: milliseconds
##       expr      min       lq      mean   median       uq      max neval
##  fullSet.r 0.325200 0.336136 0.3509523 0.343360 0.362609 0.386388    10
##  fullSet.c 0.002617 0.003009 0.0090903 0.004205 0.007240 0.030534    10
##  cld
##    b
##   a

Calculate Kappa on Codeset vs. Contingency Table

codeset = prset(precision = precision, recall = recall, length = 10000, baserate = baserate)
ctset = contingency_table(precision = precision, recall = recall, length = 10000, baserate = baserate)

microbenchmark(
  kappa.cs = kappaSet(codeset),
  kappa.ct = kappa_ct(ctset),
  times = 100, unit = "ms"
)
## Unit: milliseconds
##      expr      min        lq       mean   median        uq      max neval
##  kappa.cs 1.131216 1.2241695 1.65177405 1.293007 1.3845420 8.657673   100
##  kappa.ct 0.001201 0.0015815 0.00539746 0.002398 0.0088465 0.017602   100
##  cld
##    b
##   a

Benchmarks

# run.bench(0.8, 0.3, 20, 1000)
# run.bench(0.8, 0.3, 200, 10000)
# run.bench(0.8, 0.3, 20, 100000, times = 10)