The first versions of the rhoR package calculated rho using a matrix of two columns, referred to as a codeset. Even the methods that accepted a contingency table, actually converted that table to a codeset and off-loaded the work to the functions accepting a codeset. This was all well and good until the codesets became large. A critical step in the calculation of rho is the sampling, and then re-sampling, a high number of times (default: 800). It is at this point that passing a matrix of say 10,000 rows and sampling numerous times, becomes a burden.
The alternative, as being introduced in version 1.3.0 of the rhoR package, is to convert sets to contingency tables and sample from that. The motivation here is that even as the size grows, the contingency table still only consists of four numbers:
rhoR::random_contingency_table(setLength=100, baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
## [,1] [,2]
## [1,] 22 8
## [2,] 1 69
rhoR::random_contingency_table(setLength=1000, baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
## [,1] [,2]
## [1,] 183 117
## [2,] 16 684
rhoR::random_contingency_table(setLength=10000, baserate = 0.3, kappaMin = 0.4, kappaMax = 0.8)
## [,1] [,2]
## [1,] 2411 589
## [2,] 281 6719
The important piece to note is that for each contingency table the corresponding codeset is setLength * 2 in size.
So in regards to the size of the matrices, it’s clear the contingency tables are much smaller, but the tricky piece comes in to play when updating the rho operations themselves. A main component of rho is generating samples that contain a minimum number of positive matches from the first-rater and kappa agreement within a desired range. DOing so with a codeset, we are able to sample directly from the corresponding rows of the codeset that meet certain criteria (e.g. both raters agree.)
baserate = 0.3
kappaMin=0.4
kappaMax = 0.8
precisionMin=0
precisionMax=1
set = rhoR::createSimulatedCodeSet(length= 10, baserate, kappaMin, kappaMax, precisionMin, precisionMax)
kappa.precision.combos = rhoR::generateKPs(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax)
## gold silver
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 0 1
## [5,] 0 1
## [6,] 0 0
## [7,] 0 0
## [8,] 0 0
## [9,] 0 0
## [10,] 0 0
## $precision
## [1] 0.8486849
##
## $kappa
## [1] 0.6926069
curr.kappa = kappa.precision.combos[[1]]$kappa
curr.precision = kappa.precision.combos[[1]]$precision
curr.recall = getR(curr.kappa, baserate, curr.precision)
simulated.set = prset(precision = curr.precision, recall = curr.recall, length = 10000, baserate = baserate);
Given a simulated set, we now get to the point where we generate a sampled handset, matching the minimum desired baserate, and calculate the kappa agreement. NOTE: this step, along with the previous, is what happens, by default 800 times and we will see slows downs significantly when running on larger codesets.
handset = getHandSet(set = simulated.set, handSetLength = 20, handSetBaserate = baserate, returnSet = T);
## gold silver
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
## [5,] 1 1
## [6,] 1 0
## [7,] 0 0
## [8,] 0 0
## [9,] 0 0
## [10,] 0 0
## [11,] 0 0
## [12,] 0 0
## [13,] 1 1
## [14,] 0 0
## [15,] 1 1
## [16,] 0 0
## [17,] 0 0
## [18,] 0 1
## [19,] 0 0
## [20,] 0 0
calcKappa(handset)
## [1] 0.7916667
library(microbenchmark)
microbenchmark(
kps.r = rhoR::generateKPs(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax),
kps.c = rhoR::generateKPs_c(numNeeded = 800, baserate, kappaMin, kappaMax, precisionMin, precisionMax),
times = 10, unit = "ms"
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## kps.r 45.06487 46.67495 49.50887 48.73819 52.40435 54.52137 10 b
## kps.c 17.36404 18.18673 18.36917 18.37064 18.71605 19.22127 10 a
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## kp.r 0.051149 0.058769 0.0888845 0.0801445 0.097031 0.173652 10 a
## kp.c 0.069832 0.080798 0.1705617 0.0931820 0.185334 0.701289 10 a
kp.r = rhoR::genPKcombo(kappaDistribution = as.numeric(kps.df[,2]), kappaProbability = NULL, precisionDistribution = as.numeric(kps.df[,1]), precisionProbability = NULL, baserate = baserate)
kappa = kp.r$kappa
precision = kp.r$precision
recall = getR(kappa, baserate, precision)
microbenchmark(
fullSet.r = prset(precision = precision, recall = recall, length = 10000, baserate = baserate),
fullSet.c = contingency_table(precision = precision, recall = recall, length = 10000, baserate = baserate),
times = 10, unit = "ms"
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## fullSet.r 0.325200 0.336136 0.3509523 0.343360 0.362609 0.386388 10
## fullSet.c 0.002617 0.003009 0.0090903 0.004205 0.007240 0.030534 10
## cld
## b
## a
codeset = prset(precision = precision, recall = recall, length = 10000, baserate = baserate)
ctset = contingency_table(precision = precision, recall = recall, length = 10000, baserate = baserate)
microbenchmark(
kappa.cs = kappaSet(codeset),
kappa.ct = kappa_ct(ctset),
times = 100, unit = "ms"
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## kappa.cs 1.131216 1.2241695 1.65177405 1.293007 1.3845420 8.657673 100
## kappa.ct 0.001201 0.0015815 0.00539746 0.002398 0.0088465 0.017602 100
## cld
## b
## a
# run.bench(0.8, 0.3, 20, 1000)
# run.bench(0.8, 0.3, 200, 10000)
# run.bench(0.8, 0.3, 20, 100000, times = 10)