Classification performance metrics and indices

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Luciana Nieto & Adrian Correndo

2026-03-18

Description

The metrica package compiles +80 functions to assess regression (continuous) and classification (categorical) prediction performance from multiple perspectives.

For classification (binomial and multinomial) tasks, it includes a function to visualize the confusion matrix using ggplot2, and 27 functions of prediction scores including: accuracy, error rate, precision, recall, specificity, balanced accuracy (balacc), F-score (fscore), adjusted F-score (agf), G-mean (gmean), Bookmaker Informedness (bmi, a.k.a. Youden’s J-index), Markedness (deltaP), Matthews Correlation Coefficient (mcc), Cohen’s Kappa (khat), negative predictive value (npv), positive and negative likelihood ratios (posLr, negLr), diagnostic odds ratio (dor), prevalence (preval), prevalence threshold (preval_t), critical success index (csi, a.k.a. threat score), false positive rate (FPR), false negative rate (FNR), false detection rate (FDR), false omission rate (FOR), area under the ROC curve (AUC_roc), and the P4-metric (p4).

For supervised models, always keep in mind the concept of “cross-validation” since predicted values should ideally come from out-of-bag samples (unseen by training sets) to avoid overestimation of the prediction performance.

Using the functions

There are two basic arguments common to all metrica functions: (i) obs(Oi; observed, a.k.a. actual, measured, truth, target, label), and (ii) pred (Pi; predicted, a.k.a. simulated, fitted, modeled, estimate) values.

Optional arguments include data that allows to call an existing data frame containing both observed and predicted vectors, and tidy, which controls the type of output as a list (tidy = FALSE) or as a data.frame (tidy = TRUE).

For binary classification (two classes), functions also require to check the pos_level arg., which indicates the alphanumeric order of the “positive level”. Normally, the most common binary denominations are c(0,1), c(“Negative”, “Positive”), c(“FALSE”, “TRUE”), so the default pos_level = 2 (1, “Positive”, “TRUE”). However, other cases are also possible, such as c(“Crop”, “NoCrop”) for which the user needs to specify pos_level = 1.

For multiclass classification tasks, some functions present the atom arg. (logical TRUE / FALSE), which controls the output to be an overall average estimate across all classes, or a class-wise estimate. For example, user might be interested in obtaining estimates of precision and recall for each possible class of the prediction.

List of classification metrics* (categorical variables)

Note: All classification functions automatically recognize the number of classes and adjust estimations for binary or multiclass cases. However, for binary classification tasks, the user would need to check the alphanumeric order of the level considered as positive. By default “pos_level = 2” based on the most common denominations being c(0,1), c(“Negative”,“Positive”), c(“TRUE”, “FALSE”).

#	Metric	Definition	Details	Formula
1	`accuracy`	Accuracy	It is the most commonly used metric to evaluate classification quality. It represents the number of corrected classified cases with respect to all cases. However, be aware that this metric does not cover all aspects about classification quality. When classes are uneven in number, it may not be a reliable metric.	\(accuracy = \frac{TP+TN}{TP+FP+TN+FN}\)
2	`error_rate`	Error Rate	It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst	\(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
3	`precision`, `ppv`	Precision	Also known as positive predictive value (ppv), it represents the proportion of well classified cases with respect to the total of cases predicted with a given class (multinomial) or the true class (binomial)	\(precision = \frac{TP}{TP + FP}\)
4	`recall`, `sensitivity`, `TPR`, `hitrate`	Recall	Also known as sensitivity, hit rate, or true positive rate (TPR) for binary cases. It represents the proportion of well predicted cases with respect to the total number of observed cases for a given class (multinomial) or the positive class (binomial)	\(recall = \frac{TP}{P} = 1 - FNR\)
5	`specificity`, `selectivity`, `TNR`	Specificity	Also known as selectivity or true negative rate (TNR). It represents the proportion of well classified negative values with respect to the total number of actual negatives	\(specificity = \frac{TN}{N} = 1 - FPR\)
6	`balacc`	Balanced Accuracy	This metric is especially useful when the number of observations across classes is imbalanced	\(b.accuracy = \frac{recall + specificity}{2}\)
7	`fscore`	F-score	F1-score, F-measure	\(fscore = \frac{(1 + B ^ 2) * precision * recall}{(B ^ 2 * precision) + recall)}\)
8	`agf`	Adjusted F-score	The agf adjusts the fscore for datasets with imbalanced classes	\(agf = \sqrt{F_2 * invF_{0.5}}\), where \(F_2 = 5 * \frac{precision~~recall}{(4precision)~+~recall}\), and \(invF_{0.5} = (\frac{5}{4}) * \frac{npv~~specificity}{(0.5^2 ~~ npv)~+~specificity}\)
9	`gmean`	G-mean	The Geometric Mean (gmean) is a measure that considers a balance between the performance of both majority and minority classes. The higher the value the lower the risk of over-fitting of negative and under-fitting of positive classes	\(gmean = \sqrt{recall~*~specificity}\)
10	`khat`	K-hat or Cohen’s Kappa Coefficient	The khat is considered a more robust metric than the classic `accuracy`. It normalizes the accuracy by the possibility of agreement by chance. It is positively bounded to 1, but it is not negatively bounded. The closer to 1, the better the classification quality	\(khat = \frac{2 * (TP * TN - FN * FP)}{(TP+FP) * (FP+TN) + (TP+FN) * (FN + TN)}\)
11	`mcc`, `phi_coef`	Matthews Correlation Coefficient	Also known as phi-coefficient. It is particularly useful when the number of observations belonging to each class is uneven. It varies between 0-1, being 0 the worst and 1 the best. Currently, the mcc estimation is only available for binary cases (two classes)	\(mcc = \frac{TP * TN - FP * FN}{\sqrt{(TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)}}\)
12	`fmi`	Fowlkes-Mallows Index	The fmi is a metric that measures the similarity between two clusters (predicted and observed). It is equivalent to the square root of the product between precision (PPV) and recall (TPR). It varies between 0-1, being 0 the worst and 1 the best.	\(fmi = \sqrt{precision * recall} = \sqrt{PPV * TPR}\)
13	`bmi`, `jindex`	Informedness	Also known as the Bookmaker Informedness, or as the Youden’s J-index. It is a suitable metric when the number of cases for each class is uneven. It varies between	\(bmi = recall + specificity -1 = TPR + TNR - 1 = \frac{FP+FN}{TP+FP+TN+FN}\)
14	`posLr`	Positive Likelihood Ratio	The posLr, also known as LR(+) represents the odds of obtaining a positive prediction for actual positives.	\(posLr = \frac{recall}{1+specificity} = \frac{TPR}{FPR}\)
15	`negLr`	Negative Likelihood Ratio	The negLr, also known as LR(-) indicates the odds of obtaining a negative prediction for actual positives (or non-negatives in multiclass) relative to the probability of actual negatives of obtaining a negative prediction	\(negLr = \frac{1-recall}{specificity} = \frac{FNR}{TNR}\)
16	`dor`	Diagnostic Odds Ratio	The dor is a metric summarizing the effectiveness of classification. It represents the odds of a positive case obtaining a positive prediction result with respect to the odds of actual negatives obtaining a positive result	\(dor = \frac{posLr}{negLr}\)
17	`npv`	Negative predictive Value	It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst	\(npv = \frac{TP}{PP} = \frac{TP}{TP + FP}\)
18	`FPR`	False Positive Rate	It represents the complement of `specificity`. It could vary between 0 and 1. The lower the better.	\(FPR = 1 - specificity = 1 - TNR = \frac{FP}{N}\)
19	`FNR`	False Negative Rate	It represents the complement of `recall`. It could vary between 0 and 1. The lower the better.	\(FNR = 1 - recall = 1 - TPR = \frac{FN}{P}\)
20	`FDR`	False Detection Rate	It represents the complement of `precision` (or positive predictive value -`ppv`-). It could vary between 0 and 1, being 0 the best and 1 the worst	\(FDR = 1 - precision = \frac{FP}{PP} = \frac{FP}{TP + FP}\)
21	`FOR`	False Omission Rate	It represents the complement of the `npv`. It could vary between 0 and 1, being 0 the best and 1 the worst	\(FOR = 1 - npv = \frac{FN}{PN} = \frac{FN}{TN + FN}\)
22	`preval`	Error Rate	It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst	\(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
23	`preval_t`	Error Rate	It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst	\(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
24	`csi`, `jaccardindex`	Critical Success Index	The `csi` is also known as the threat score (TS) or Jaccard’s Index. It could vary between 0 and 1, being 0 the worst and 1 the best	\(csi = \frac{TP}{TP+FP+TN}\)
25	`deltap`, `mk`	Markedness or deltap	The `deltap` (a.k.a. Markedness -`mk`-) is a metric that quantifies the probability that a condition is marked by the predictor with respect to a random chance	\(deltap = precision+npv-1 = PPV + NPV -1\)
26	`AUC_roc`	Area Under the Curve	The `AUC_roc` estimates the area under the receiving operator characteristic curve following the trapezoid approach. It is bounded between 0 and 1. The closet to 1 the better. AUC_roc = 0.5 means the models predictions are the same than a random classifier.	\(AUC_{roc} = precision+npv-1 = PPV + NPV -1\)
27	`p4`	P4-metric	The `p4` estimates the P-4 following Sitarz (2023) as an extension of the `F-score`. It is bounded between 0 and 1. The closet to 1 the better. It integrates four metrics: `precision`, `recall`, `specificity`, and `npv`.	\(p4 = \frac{4} {\frac{1}{precision} + \frac{1}{recall} + \frac{1}{specificity} + \frac{1}{npv} }\)

List of additional abbreviations:

P = positive (true + false)

N = negative (true + false)

TP = true positive

TN = true negative

FP = false positive

FN = false negative

TPR = true positive rate

TNR = true negative rate

FPR = false positive rate

FNR = false negative rate

ppv = positive predictive value

npv = negative predictive value

B = coefficient B (a.k.a. beta) indicating the weight to be applied to the estimation of fscore (as \(B^2\)).

References:

Ting K.M. (2017). Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Accuracy. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining . Springer, Boston, MA.
García, V., Mollineda, R.A., Sánchez, J.S. (2009). Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2009. Lecture Notes in Computer Science, vol 5524. Springer-Verlag Berlin Heidelberg.
Ting K.M. (2017). Precision and Recall. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Sensitivity. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Ting K.M. (2017). Sensitivity and Specificity. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Trevethan, R. (2017). Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front. Public Health 5:307
Goutte, C., Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In: D.E. Losada and J.M. Fernandez-Luna (Eds.): ECIR 2005. Advances in Information Retrieval LNCS 3408, pp. 345–359, 2. Springer-Verlag Berlin Heidelberg.
Maratea, A., Petrosino, A., Manzo, M. (2014). Adjusted-F measure and kernel scaling for imbalanced data learning. Inf. Sci. 257: 331-341.
De Diego, I.M., Redondo, A.R., Fernández, R.R., Navarro, J., Moguerza, J.M. (2022). General Performance Score for classification problems. Appl. Intell. (2022).
Fowlkes, Edward B; Mallows, Colin L (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 78 (383): 553–569.
Chicco, D., Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
Youden, W.J. (1950). Index for rating diagnostic tests. Cancer 3: 32-35.
Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies 2(1): 37–63.
Chicco, D., Tötsch, N., Jurman, G. (2021). The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min 14(1): 13.
GlasaJeroen, A.S., Lijmer, G., Prins, M.H., Bonsel, G.J., Bossuyta, P.M.M. (2009). The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology 56(11): 1129-1135.
Wang H., Zheng H. (2013). Negative Predictive Value. In: Dubitzky W., Wolkenhauer O., Cho KH., Yokota H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY.
Freeman, E.A., Moisen, G.G. (2008). A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecol. Modell. 217(1-2): 45-58.
Balayla, J. (2020). Prevalence threshold (φe) and the geometry of screening curves. Plos one, 15(10):e0240215.
Schaefer, J.T. (1990). The critical success index as an indicator of warning skill. Weather and Forecasting 5(4): 570-575.
Hanley, J.A., McNeil, J.A. (2017). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1): 29-36
Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45: 171-186
Mandrekar, J.N. (2010). Receiver operating characteristic curve in diagnostic test assessment. J. Thoracic Oncology 5(9): 1315-1316
Sitarz, M. (2023). Extending F1 metric, probabilistic approach. Adv. Artif. Intell. Mach. Learn., 3 (2):1025-1038.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.