Using melt

Eunseop Kim

This vignette demonstrates the basic usage of the melt package, including replication code from a paper by Kim, MacEachern, and Peruggia (2024) published in the Journal of Statistical Software. For more details on the package and its applications, readers are encouraged to refer to the paper.

Model building

For a simple illustration of building a model, we apply el_mean() to the synthetic classification problem data synth.tr from the MASS package. The synth.tr object is a data.frame with 250 rows and three columns. We select two columns xs and ys, the \(x\) and \(y\) coordinates, to build an EL model with two-dimensional mean parameter.

library(MASS)
library(dplyr)
data("synth.tr", package = "MASS")
data <- dplyr::select(synth.tr, c(xs, ys))

We specify c(0, 0.5) as par in el_mean() and build an EL object with the data.

fit_mean <- el_mean(data, par = c(0, 0.5))

The data object is implicitly coerced into a matrix since el_mean() takes a numeric matrix as an input for the data. Basic print() and show() methods display relevant information about an EL object.

fit_mean
#> 
#>  Empirical Likelihood
#> 
#> Model: mean 
#> 
#> Maximum EL estimates:
#>       xs       ys 
#> -0.07276  0.50436 
#> 
#> Chisq: 6.158, df: 2, Pr(>Chisq): 0.04601
#> EL evaluation: converged

The asymptotic chi-square statistic is displayed, along with the associated degrees of freedom and the \(p\) value.

Next, we consider an infeasible parameter value c(1, 0.5) outside the convex hull of the data to show how el_control() interacts with the model fitting functions through control argument. The evaluation algorithm continues until the iteration reaches maxit_l or the negative empirical log-likelihood ratio exceeds th. Setting a large th for the infeasible value, we observe that the algorithm hits the maxit with each element of lambda diverging quickly.

ctrl <- el_control(maxit_l = 50, th = 10000)
fit2_mean <- el_mean(data, par = c(1, 0.5), control = ctrl)
logL(fit2_mean)
#> [1] -10001.14
logLR(fit2_mean)
#> [1] -8620.776
getOptim(fit2_mean)
#> $par
#>  xs  ys 
#> 1.0 0.5 
#> 
#> $lambda
#> [1] -9.908531e+14  2.757135e+14
#> 
#> $iterations
#> [1] 50
#> 
#> $convergence
#> [1] FALSE
#> 
#> $cstr
#> [1] FALSE

In addition, melt contains another function el_eval() to perform the EL evaluation for other general estimating functions.

mu <- 0
sigma <- 1
set.seed(123526)
x <- rnorm(100)
g <- matrix(c(x - mu, (x - mu)^2 - sigma^2), ncol = 2)
fit_eval <- el_eval(g)
fit_eval$pval
#> [1] 0.4645579

Linear regression analysis

A similar process applies to the other model fitting functions, except that el_lm() and el_glm() require a formula object for model specification. We illustrate the use of el_lm() for regression analysis with the crime rates data UScrime available in MASS. Here we update the control parameters for significance tests of the coefficients.

data("UScrime", package = "MASS")
ctrl <- el_control(maxit = 1000, nthreads = 2)
(fit_lm <- el_lm(y ~ Pop + Ineq, data = UScrime, control = ctrl))
#> 
#>  Empirical Likelihood
#> 
#> Model: lm 
#> 
#> Maximum EL estimates:
#> (Intercept)         Pop        Ineq 
#>    1046.749       3.251      -1.344 
#> 
#> Chisq: 13.95, df: 2, Pr(>Chisq): 0.0009332
#> Constrained EL: converged

The print() method also applies and shows the MELE, the overall model test result, and the convergence status. The estimates are obtained from lm.fit(). The hypothesis for the overall test is that all the parameters except the intercept are zero. The convergence status shows that a constrained optimization is performed in testing the hypothesis. The EL evaluation applies to the test and the convergence status if the model does not include an intercept. The large chi-square value above implies that the data do not support the hypothesis, regardless of the convergence.

Note that failure to converge does not necessarily indicate unreliable test results. Most commonly, the algorithm fails to converge if the additional constraint imposed by a hypothesis is incompatible with the convex hull constraint. The control parameters affect the test results as well. The summary() method reports more details, such as the results of significance tests, where each test involves solving a constrained EL problem.

summary(fit_lm)
#> 
#>  Empirical Likelihood
#> 
#> Model: lm 
#> 
#> Call:
#> el_lm(formula = y ~ Pop + Ineq, data = UScrime, control = ctrl)
#> 
#> Number of observations: 47 
#> Number of parameters: 3 
#> 
#> Parameter values under the null hypothesis:
#> (Intercept)         Pop        Ineq 
#>        1047           0           0 
#> 
#> Lagrange multipliers:
#> [1]  3.504e-03  1.420e-05 -2.618e-05
#> 
#> Maximum EL estimates:
#> (Intercept)         Pop        Ineq 
#>    1046.749       3.251      -1.344 
#> 
#> logL: -187.9 , logLR: -6.977 
#> Chisq: 13.95, df: 2, Pr(>Chisq): 0.0009332
#> Constrained EL: converged 
#> 
#> Coefficients:
#>             Estimate   Chisq Pr(>Chisq)    
#> (Intercept) 1046.749 447.645    < 2e-16 ***
#> Pop            3.251   4.925    0.02647 *  
#> Ineq          -1.344  13.654    0.00022 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These tests are all asymptotically pivotal without explicit studentization. As a result, the output does not have standard errors.

By iteratively solving constrained EL problems for a grid of parameter values, confidence intervals for the parameters can be calculated with confint(). The chi-square calibration is the default, but the user can specify a critical value cv() optionally. Below we calculate asymptotic 95% confidence intervals.

confint(fit_lm)
#>                   lower       upper
#> (Intercept) 579.7584201 1698.919267
#> Pop           0.3491718    6.352967
#> Ineq         -1.9453327   -0.687159

Similarly, we obtain confidence regions for two parameters with confreg().

Hypothesis testing

Now we consider elt() for hypothesis testing, where the arguments rhs and lhs define a linear hypothesis. Either one or the other must be provided. The argument lhs takes a numeric matrix or a vector. Alternatively, a character vector can be supplied to symbolically specify a hypothesis, which is convenient when there are many variables. When lhs is NULL, it performs the EL evaluation at rhs When rhs is NULL, on the other hand, rhs is set to the zero vector automatically, and the EL optimization is performed with lhs. Technically, elt() can reproduce the test results from fit_mean(). Note the equivalence between the optimization results.

elt_mean <- elt(fit_mean, rhs = c(0, 0.5))
all.equal(getOptim(elt_mean), getOptim(fit_mean))
#> [1] TRUE
elt_lm <- elt(fit_lm, lhs = c("Pop", "Ineq"))
all.equal(getOptim(elt_lm), getOptim(fit_lm))
#> [1] TRUE

In addition to specifying an arbitrary linear hypothesis through rhs and lhs, extra arguments alpha and calibrate expand options for testing. The argument alpha controls the significance level determining the critical value, and calibrate chooses the calibration method. We apply the \({F}\) and bootstrap calibrations to fit_mean() at a significance level of 0.05. The number of threads is increased to four with 100000 bootstrap replicates in el_control().

ctrl <- el_control(
  maxit = 10000, tol = 1e-04, nthreads = 4, b = 100000, step = 1e-05
)
(elt_mean_f <- elt(fit_mean,
  rhs = c(0, 0.5), calibrate = "F", control = ctrl
))
#> 
#>  Empirical Likelihood Test
#> 
#> Hypothesis:
#> xs = 0.0
#> ys = 0.5
#> 
#> Significance level: 0.05, Calibration: F 
#> 
#> Statistic: 6.158, Critical value: 6.089
#> p-value: 0.04835 
#> EL evaluation: converged
(elt_mean_boot <- elt(fit_mean,
  rhs = c(0, 0.5), calibrate = "boot", control = ctrl
))
#> 
#>  Empirical Likelihood Test
#> 
#> Hypothesis:
#> xs = 0.0
#> ys = 0.5
#> 
#> Significance level: 0.05, Calibration: Bootstrap 
#> 
#> Statistic: 6.158, Critical value: 6.064
#> p-value: 0.04756 
#> EL evaluation: converged

Multiple testing

We illustrate performing multiple comparisons and constructing simultaneous confidence intervals with the thiamethoxam data, a data.frame with 165 observations and 11 variables. We fit a quasi-Poisson regression model with a log link function using el_glm() to obtain a QGLM model object.

data("thiamethoxam")
fit_glm <- el_glm(visit ~ trt + var + fruit + defoliation,
  family = quasipoisson(link = "log"), data = thiamethoxam,
  control = ctrl
)
print(summary(fit_glm), width.cutoff = 50)
#> 
#>  Empirical Likelihood
#> 
#> Model: glm (quasipoisson family with log link)
#> 
#> Call:
#> el_glm(formula = visit ~ trt + var + fruit + defoliation, 
#>     family = quasipoisson(link = "log"), data = thiamethoxam, 
#>     control = ctrl)
#> 
#> Number of observations: 165 
#> Number of parameters: 7 
#> 
#> Parameter values under the null hypothesis:
#> (Intercept)    trtSpray   trtFurrow     trtSeed       varGZ       fruit 
#>       1.972       0.000       0.000       0.000       0.000       0.000 
#> defoliation         phi 
#>       0.000       1.726 
#> 
#> Lagrange multipliers:
#> [1] -0.20319 -0.18634  0.01835  0.14497 -0.17456  0.10961 -0.04870 -0.08773
#> 
#> Maximum EL estimates:
#> (Intercept)    trtSpray   trtFurrow     trtSeed       varGZ       fruit 
#>     1.97228    -0.11281     0.08001     0.31794    -0.21088     0.05142 
#> defoliation 
#>    -0.02044 
#> 
#> logL: -909.6 , logLR: -67.16 
#> Chisq: 134.3, df: 6, Pr(>Chisq): < 2.2e-16
#> Constrained EL: converged 
#> 
#> Coefficients:
#>             Estimate   Chisq Pr(>Chisq)    
#> (Intercept)  1.97228 421.866    < 2e-16 ***
#> trtSpray    -0.11281   1.680   0.194885    
#> trtFurrow    0.08001   1.014   0.314039    
#> trtSeed      0.31794  11.951   0.000546 ***
#> varGZ       -0.21088   9.498   0.002057 ** 
#> fruit        0.05142  14.470   0.000142 ***
#> defoliation -0.02044  27.147   1.89e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion for quasipoisson family: 1.726303

We assess the significance of trt by testing whether the coefficients are all zero. The output of summary() reports a small \({p}\) value with a different solution from the overall model test.

elt_glm <- elt(fit_glm, lhs = c("trtSpray", "trtFurrow", "trtSeed"))
summary(elt_glm)
#> 
#>  Empirical Likelihood Test
#> 
#> Hypothesis:
#> trtSpray = 0
#> trtFurrow = 0
#> trtSeed = 0
#> 
#> Significance level: 0.05, Calibration: Chi-square 
#> 
#> Parameter values under the null hypothesis:
#> (Intercept)    trtSpray   trtFurrow     trtSeed       varGZ       fruit 
#>     1.97324     0.00000     0.00000     0.00000    -0.21019     0.05958 
#> defoliation         phi 
#>    -0.02535     1.72700 
#> 
#> Lagrange multipliers:
#> [1] -0.097865 -0.158722  0.123355  0.251704  0.009850 -0.002071  0.007687
#> [8]  0.020678
#> 
#> logL: -849.8, logLR: -7.34
#> Statistic: 14.68, Critical value: 7.815
#> p-value: 0.002112 
#> Constrained EL: converged

Finally, we extend the framework to multiple testing with elmt(), which can be directly applied to the fitted model object. Its syntax is similar to elt(), where rhs and lhs now specify multiple hypotheses. For general hypotheses involving separate matrices, elmt() accepts list objects for rhs and lhs. The elmt() function employs a multivariate chi-square calibration technique based on Monte Carlo simulations to determine the common critical value. Details of multiple testing procedures are provided in Kim, MacEachern, and Peruggia (2023). Continuing on the previous test result, we perform comparisons with the control with the overall significance level at 0.05.

elmt_glm <- elmt(fit_glm, lhs = list("trtSpray", "trtFurrow", "trtSeed"))
summary(elmt_glm)
#> 
#>  Empirical Likelihood Multiple Tests
#> 
#> Overall significance level: 0.05 
#> 
#> Calibration: Multivariate chi-square 
#> 
#> Hypotheses:
#>               Estimate  Chisq Df   p.adj   
#> trtSpray = 0  -0.11281  1.680  1 0.46470   
#> trtFurrow = 0  0.08001  1.014  1 0.66341   
#> trtSeed = 0    0.31794 11.951  1 0.00171 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Common critical value: 5.646

Note the use of a list for lhs by elmt(). While a character vector lhs acts as a single hypothesis for elt(), elements of lhs in elmt() define distinct hypotheses for convenience. The Df column shows the marginal chi-square degrees of freedom for each hypothesis. For an object of class ELMT, confint() uses the common critical value computed by elmt().

confint(elmt_glm)
#>                    lower      upper
#> trtSpray = 0  -0.3686207 0.08064878
#> trtFurrow = 0 -0.1138335 0.26505233
#> trtSeed = 0    0.1043046 0.51360075

References

Kim, Eunseop, Steven N. MacEachern, and Mario Peruggia. 2023. “Empirical Likelihood for the Analysis of Experimental Designs.” Journal of Nonparametric Statistics 35 (4): 709–32. https://doi.org/10.1080/10485252.2023.2206919.
———. 2024. melt: Multiple Empirical Likelihood Tests in R.” Journal of Statistical Software 108 (5): 1–33. https://doi.org/10.18637/jss.v108.i05.