The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

ONEST

Gang Han, Baihong Guo

2021-07-26

1 General Information

The Observers Needed to Evaluate Subjective Tests software implements a statistical method in Reisenbichler et al. (2020¹), to determine the minimum number of evaluators needed to estimate agreement involving a large number of raters. This method could be utilized by regulatory agencies, such as the FDA, when evaluating agreement levels of a newly proposed subjective laboratory test. Input to the program should be binary(1/0) pathology data, where “0” may stand for negative and “1” for positive. The example datasets in this software are from Rimm et al. (2017²) (the SP142 assay), and Reisenbichler et al. 2020. This program can run in R version 3.5.0 and above.

2 Model and Inference

We briefly introduce the statistical model and inference implemented by this program. Let p* denote the proportion of concordant (i.e., identical) reads among a group of raters, and the group size can be two or more. We let “p⁺” denote the proportion of tissue cases that will always be evaluated positive by all the raters, and “p^-” a proportion that will always be evaluated negative. Among the proportion of “1-p⁺-p^-” cases that could be rated either positive or negative, each case has the probability “p” of being rated positive from any pathologist. Then the proportion of consistent reads among k pathologists can be written as p*^(k) = p⁺+p^-+(1-p⁺-p^-)[p^k+(1-p)^k].

Let “I” denote the minimal sufficient number of pathologists in the sense that “I” is the minimum integer value to satisfy p* ⁽ⁱ⁾ - p*⁽ⁱ⁺¹⁾ < pᵟ with a large probability (e.g., 95%), where pᵟ is a threshold of the change in the percentage agreement due to including one additional pathologist. Let p_c = p⁺ + p^-.

The statistical inference is based on the joint likelihood function of parameters p⁺, p^-, and p. For n cases and k pathologists, we have the data {y_ij; i=1,…,n, j=1,…,k}. Each observation y_ij is binary, where y_ij =1 if the read is positive and y_ij =0 if the read is negative. The probabilities of y_ij=1 and y_ij=0 can be written as P(y_ij=1) = p⁺+ p(1-p⁺-p^-) and P(y_ij=0) = p^-+ (1-p)(1-p⁺-p^-), respectively. We assume all {yij} are independently and identically distributed. The likelihood function can be written as L(p, p⁺, p-|{y_ij}) = [p⁺+ p(1-p⁺-p^-)]^T [p^-+ (1-p)(1-p⁺-p^-)]^nk-T, where T is the total number of reading equal to 1 among all “nk” reads. With k pathologists, we let n_c denote the number of consistent reads among n cases, so n_c ~ Bin(n, p_c). Similarly, we have n⁺ ~ Bin(n, p⁺) and n^- ~ Bin(n, p^-), where n⁺ and n^- denote the numbers of cases that all pathologists read positive and negative, respectively.

Based on the binomial maximum likelihood estimation, the estimates are p⁺ = n⁺/n, p^- = n^-/n, p⁺+ p(1-p⁺-p^-) = T/(nk), and p = [T/(nk) - p⁺]/(1-p⁺-p^-). We then estimate p* by plugging the estimates of {p_c , p} into the equation p* ^(k) = p_c +(1-p⁺-p^-)[p^k+(1-p)^k]. We define the objective function as D⁽ⁱ⁾ = p* ⁽ⁱ⁾ - p*⁽ⁱ⁺¹⁾=(1-p⁺-p^-)[pⁱ(1-p)+ p(1-p)ⁱ]. The estimate of “p” depends on the product of n and k, and the estimate of p_c is n_c/n. We use 95% as the probability threshold. Based on the central limit theorem, the asymptotic 95% lower bound of p_c is: n_c/n-1.645[n_c(n-n_c)/n³]^1/2. By plugging in this lower bound of p_c we can compute the upper bound of D⁽ⁱ⁾ with 95% confidence level. If the upper bound of D⁽ⁱ⁾ is less than pᵟ. We conclude “i” is the sufficient number of pathologists.

3 Inputs and Outputs

3.1 Inputs

This software has one driver file ONEST_main. Input to ONEST_main include

‘data’ = dataset, a matrix containing the binary pathology data. Each row is the data from one case, and each column is the data from one rater. Missing values are allowed and can be denoted as NA or left blank. If there are n cases and k raters, the input ‘data’ is a matrix with dimension n by k.

3.2 Outputs

Meanings of the output values are listed below.

consist_p: a vector of length k-1, indicating proportion of identical reads among a set of pathologists. For example, the first element of “consist_p” is the estimate of agreement percentage for 2 raters. The k-1 th element is the estimate of agreement percentage for k raters.
consist_low: a vector of length k-1, indicating the lower bound of the agreement percentage with 95% confidence level corresponding to “consist_p”.
diff_consist: a vector of length k-2, indicating the difference between the consist_p. For example, the first element of “diff_consist” is the estimated difference of agreement percentage after increasing from 2 to 3 raters. The k-2 th element is the difference of agreement percentage after increasing from k-1 to k raters.
diff_high: a vector of length k-2, indicating the upper bound of the change of agreement percentage corresponding to “diff_consist” with 95% confidence level.
size_case: number of cases n.
size_rater: number of raters k.
p: the probability of of being rated positive among the proportion of ‘1-p_plus-p_minus’ cases.
p_plus: proportion of the cases rated positive by all raters.
p_minus: proportion of the cases rated negative by all raters.
empirical: a matrix of dimension k-1 by 3, including the empirical estimate of the agreement percentage, and the empirical 95% confidence intervals (CI) of the agreement percentage with equal tail probabilities on the two sides. The empirical estimate and CI were calculated by permuting the raters with 1000 random permutations, and using the mean, 2.5^th percentile, and 97.5^th percentile.

All the outputs were saved in the following structure.

consistency: This output includes “consist_p” and “consist_low,” where the data are used to plot figure(5).
difference: This output includes “diff_consist” and “diff_high”, where the data are used to plot figure(6) that can be used to determine the minimum number of evaluators needed to estimate agreement.
estimates: This output includes the ONEST estimates “size_case”, “size_case”, “p”, “p_plus”, and “p_minus”.
empirical: This output has the empirical estimation data for plotting figure(3). The first and third columns are the 2.5% and 97.5% lower and upper bounds of the empirical CI, respectively. The second column is the estimated agreement percentage using the empirical mean.

4 Example with dataset sp142_bin

The dataset “sp142_bin” is a pathology dataset of triple negative breast cancer in Reisenbichler et al. (2020) in a 68 by 18 matrix. An element in position (i, j) having value of 0 means negative for the i-th case, j-th rater, and a value of 1 means a positive evaluation.

Details about other datasets in the package can be found in the reference manual.

4.1 Load data

library(ONEST)
data("sp142_bin")

4.2 Plot the data and get the outputs

The following code is equivalent to ONEST_main(sp142_bin) and can only be applied to the example dataset sp142_bin to decrease the time to build the vignettes. Please use the ONEST_main function instead in practice.

# figure(1): Plot of the agreement percentage in the order of columns in the inputs;
# figure(2): Plot of the 100 randomly chosen permutations;
# figure(3): Plot of the empirical confidence interval;
# figure(4): Barchart: the x axis is the case number and the Y axis is the number of pathologists that called that case positive, sorted from lowest to highest on the y axis;
# figure(5): Plot of the proportion of identical reads among a set of pathologists;
# figure(6): Plot of the difference between the proportion of identical reads among a set of pathologists;

# ONEST_main(sp142_bin)
data('empirical')
ONEST_vignettes(sp142_bin,empirical)

#> $consistency
#>       consist_p consist_low
#>  [1,] 0.6911795   0.6427088
#>  [2,] 0.5367693   0.4640632
#>  [3,] 0.4595634   0.3747395
#>  [4,] 0.4209597   0.3300768
#>  [5,] 0.4016573   0.3077448
#>  [6,] 0.3920057   0.2965783
#>  [7,] 0.3871797   0.2909948
#>  [8,] 0.3847665   0.2882029
#>  [9,] 0.3835598   0.2868068
#> [10,] 0.3829564   0.2861087
#> [11,] 0.3826547   0.2857597
#> [12,] 0.3825039   0.2855851
#> [13,] 0.3824284   0.2854978
#> [14,] 0.3823907   0.2854542
#> [15,] 0.3823718   0.2854324
#> [16,] 0.3823624   0.2854214
#> [17,] 0.3823577   0.2854160
#> 
#> $difference
#>        diff_consist    diff_high
#>  [1,] -1.544102e-01 1.786456e-01
#>  [2,] -7.720588e-02 8.932368e-02
#>  [3,] -3.860371e-02 4.466273e-02
#>  [4,] -1.930243e-02 2.233203e-02
#>  [5,] -9.651598e-03 1.116646e-02
#>  [6,] -4.826038e-03 5.583506e-03
#>  [7,] -2.413163e-03 2.791919e-03
#>  [8,] -1.206665e-03 1.396057e-03
#>  [9,] -6.033806e-04 6.980838e-04
#> [10,] -3.017172e-04 3.490731e-04
#> [11,] -1.508736e-04 1.745539e-04
#> [12,] -7.544503e-05 8.728646e-05
#> [13,] -3.772701e-05 4.364843e-05
#> [14,] -1.886594e-05 2.182703e-05
#> [15,] -9.434279e-06 1.091503e-05
#> [16,] -4.717841e-06 5.458327e-06
#> 
#> $estimates
#>      size_case size_rater         p    p_plus   p_minus
#> [1,]        68         18 0.4984245 0.2794118 0.1029412
#> 
#> $empirical
#>       lower_bound      mean upper_bound
#>  [1,]   0.6029412 0.7898235   0.9264706
#>  [2,]   0.5294118 0.6951176   0.8529412
#>  [3,]   0.4558824 0.6306912   0.7941176
#>  [4,]   0.4264706 0.5833529   0.7352941
#>  [5,]   0.3970588 0.5447941   0.6911765
#>  [6,]   0.3823529 0.5124412   0.6617647
#>  [7,]   0.3676471 0.4878088   0.6176471
#>  [8,]   0.3676471 0.4642941   0.5882353
#>  [9,]   0.3529412 0.4468235   0.5735294
#> [10,]   0.3529412 0.4298824   0.5441176
#> [11,]   0.3529412 0.4145588   0.5147059
#> [12,]   0.3529412 0.4013088   0.5000000
#> [13,]   0.3529412 0.3902059   0.4852941
#> [14,]   0.3529412 0.3786765   0.4705882
#> [15,]   0.3529412 0.3684853   0.4558824
#> [16,]   0.3529412 0.3608382   0.4411765
#> [17,]   0.3529412 0.3529412   0.3529412

4.3 The ONEST score test

A small p-value from this score test indicates significant evidence that the observers’ agreement will converge to a non-zero proportion.

data("sp142_bin")
ONEST_inflation_test(sp142_bin)
#> p_value 
#>       0

4.4 Code to run other examples

# (1) With example dataset sp263_bin:
# data("sp263_bin") ONEST_main(sp263_bin) ONEST_inflation_test(sp263_bin)

# (2) With example dataset NCNN_sp142:
# data("NCCN_sp142") ONEST_main(NCCN_sp142) ONEST_inflation_test(NCCN_sp142)

# (3) With example dataset NCNN_sp142_t:
# data("NCCN_sp142_t") ONEST_main(NCCN_sp142_t) ONEST_inflation_test(NCCN_sp142_t)

# (4) With example dataset NCCN_22c3_t:
# data("NCCN_22c3_t") ONEST_main(NCCN_22c3_t) ONEST_inflation_test(NCCN_22c3_t)

Reisenbichler, E. S., Han, G., Bellizzi, A., Bossuyt, V., Brock, J., Cole, K., Fadare, O., Hameed, O., Hanley, K., Harrison, B. T., Kuba, M. G., Ly, A., Miller, D., Podoll, M., Roden, A. C., Singh, K., Sanders, M. A., Wei, S., Wen, H., Pelekanou, V., Yaghoobi, V., Ahmed, F., Pusztai, L., and Rimm, D. L. (2020) “Prospective multi-institutional evaluation of pathologist assessment of PD-L1 assays for patient selection in triple negative breast cancer,” Mod Pathol, DOI: 10.1038/s41379-020-0544-x; PMID: 32300181.↩︎
Rimm, D. L., Han, G., Taube, J. M., Yi, E. S., Bridge, J. A., Flieder, D. B., Homer, R., West, W. W., Wu, H., Roden, A. C., Fujimoto, J., Yu, H., Anders, R., Kowalewski, A., Rivard, C., Rehman, J., Batenchuk, C., Burns, V., Hirsch, F. R., and Wistuba,, II (2017) “A Prospective, Multi-institutional, Pathologist-Based Assessment of 4 Immunohistochemistry Assays for PD-L1 Expression in Non-Small Cell Lung Cancer,” JAMA Oncol, 3(8), 1051-1058, DOI: 10.1001/jamaoncol.2017.0013, PMID: 28278348.↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.