The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The csurvey package extends the well-known survey package by allowing users to impose order constraints on domain means (or on the systematic component for GLMs). When such constraints are reasonable (e.g., cholesterol increases with age), they can:
Reduce variance and shorten confidence intervals
Allow estimation and inference in small or even empty domains
Provide one-sided hypothesis tests and a cone information criterion (CIC) for model selection
This vignette gives a brief workflow using the examples that are implemented with the package data sets:
nhdat: This is a subset of the 2009 to 2010 NHANES data. It includes participants aged 21 to 45 years, selected for illustrating the estimation of the probability of high cholesterol using order constrained survey methods (binary outcome).
nhdat2: This is a subset of the 2009 to 2010 NHANES data. It includes participants aged 21 to 45 years, selected to demonstrate estimation of domain means using order constrained methods.
For more complete details and additional examples (including NSCG data), see the accompanying R Journal article on csurvey.
The csurvey package relies on the functions in the
survey package such as svydesign and
svrepdesign, which allow users to specify the survey
design. This function produces an object that contains information about
the sampling design, allowing the user to specify strata, sampling
weights, etc.
The csurvey package provides users with symbolic
functions, such as incr, decr, etc, to impose
ordering constraints.
The National Health and Nutrition Examination Survey (NHANES) combines in-person interviews and physical examinations to produce a comprehensive data set from a probability sample of residents of the U.S. The data are made available to the public at this CDC NHANES website.
The subset used in for the first two examples is derived from the
2009–2010 NHANES cycle and is provided in the csurvey
package as the nhdat2 data set. We consider the task of
estimating the average total cholesterol level (mg/dL), originally
recorded as LBXTC in the raw NHANES data, by age (in
years), corresponding to the original variable RIDAGEYR.
Focusing on young adults aged 21 to 45, we analyze a sample of \(n = 1,933\) participants and specify the
survey design using the associated weights and strata available in the
data.
library(csurvey)
data(nhdat2, package = "csurvey")
dstrat <- svydesign(ids = ~id, strata = ~str, data = nhdat2, weight = ~wt)Then, to get the proposed constrained domain mean estimate, we use
the csvy function in the csurvey package. In
this function, incr is a symbolic function used to define
that the population domain means of chol are increasing
with respect to the predictor age.
The n.mix parameter controls the number of simulations
used to estimate the mixture covariance matrix, with a default of
n.mix = 100. To speed up computation, users can set
n.mix to be a smaller number, e.g., 10.
We can extract from ans the CIC value for the
constrained estimator as
and the CIC value for the unconstrained estimator as
We see that for this example, the constrained estimator has a smaller CIC value and this implies it is a better fit.
The csvy function produces both the constrained fit and
the corresponding unconstrained fit using methods from the
survey package. A visual comparison between the two
fits can be easily obtained by applying the plot method to
the resulting csvy object with specifying the argument
type = "both".
For illustration, we display the estimated domain means along with \(95\%\) confidence intervals by the following code:
The confint function can be used to extract the
confidence interval for domain mean. When the response is not Gaussian,
then type="link" produces the confidence interval for
average systematic component over domains, while
type="response" produces the confidence interval for the
domain mean.
For this nhdat2 data set, the sample sizes for each of
the twenty-five ages range from 54 to 99, so that none of the domains is
“small.” To demonstrate csvy in the case of small domains,
we next provide domain mean estimates and confidence intervals for \(400\) domains, arranged in a grid with 25
ages (age), four waist-size categories (wcat),
and four income categories (icat).
The domain sample sizes average only 4.8 observations, and there are
16 empty domains. The sample size for each domain can be checked by
ans$nd. To estimate the population means we assume average
cholesterol level is increasing in both age and waist size, but no
ordering is imposed on income categories:
To extract estimates and confidence intervals for specific domains
defined by the model, the user can use the predict function
as follows:
domains <- data.frame(age = c(24, 35), wcat = c(2, 4), icat = c(2, 3))
pans <- predict(ans, newdata = domains, se.fit = TRUE)
cat("Predicted values, confidence intervals and standard errors for specified domains:\n")
#> Predicted values, confidence intervals and standard errors for specified domains:
print (pans)
#> $fit
#> [1] 162.9599 202.0061
#>
#> $lwr
#> [1] 148.4113 183.8585
#>
#> $upp
#> [1] 177.8921 219.4409
#>
#> $se.fit
#> [1] 7.520757 9.077306The plot function displays the domain mean estimates
along with \(95\%\) confidence
intervals, highlighting in red the two domains specified in the
predict function. The code to create this plot is as below.
The control argument is used let the user adjust aesthetics
of the plot.
ctl <- list(x1lab = "waist", x2lab = "income", subtitle.size = 8)
plot(ans, x1 = "wcat", x2 = "icat", control = ctl, domains = domains)The user can visualize the unconstrained fit by specifying
type = "unconstrained" in the plot
function.
plot(ans, x1 = "wcat", x2 = "icat", control = ctl, type="unconstrained")
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 16 rows containing missing values or values outside the scale range
#> (`geom_point()`).We use another subset of the NHANES 2009–2010 data to demonstrate how
our method applies when the outcome is binary. This subset is included
in the csurvey package as nhdat. It contains
\(n = 1,680\) observations with
complete records on total cholesterol, age, height, and waist
circumference for adults aged 21–40. The binary outcome indicates
whether an individual has high total cholesterol, coded as 1 if total
cholesterol exceeds 200 mg/dL, and 0 otherwise. We estimate the
population proportion with high cholesterol by age, waist, and gender (1
= male, 2 = female). The waist variable, denoted as wcat,
is a 4-level categorized ordinal variable representing waist-to-height
ratios.
It is reasonable to assume that, on average, the proportion of individuals with high cholesterol increases with both age and waist. The model is specified using the following code:
data(nhdat, package = "csurvey")
dstrat <- svydesign(ids = ~ id, strata = ~ str, data = nhdat, weight = ~ wt)
set.seed(1)
ans <- csvy(chol ~ incr(age) * incr(wcat) * gender, design = dstrat,
family = binomial(link = "logit"), test = TRUE)The CIC of the constrained estimator is smaller than that of the unconstrained estimator, and the one-sided hypothesis test has a \(p\)-value close to zero.
summary(ans)
#> Call:
#> csvy(formula = chol ~ incr(age) * incr(wcat) * gender, design = dstrat,
#> family = binomial(link = "logit"), test = TRUE)
#>
#> Null deviance: 2054.947 on 159 degrees of freedom
#> Residual deviance: 1906.762 on 126 degrees of freedom
#>
#> Approximate significance of constrained fit:
#> edf mixture.of.Beta p.value
#> incr(age):incr(wcat):gender 34 0.5174 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> CIC (constrained estimator): 0.3199
#> CIC (unconstrained estimator): 1.2778The combination of age, waist and gender gives 160 domains. This implies that the average sample size for each domain is only around 10. Due to the small sample sizes, the unconstrained estimator shows unlikely jumps as age increases within each waist category. On the other hand, the constrained estimator is more stable and tend to have smaller confidence intervals compared with the unconstrained Hájek estimator.
ctl <- list(angle = 0, x1size = 2, x2size = 2, x1lab = "waist", x2_labels = c("male", "female"),
subtitle.size=6)
plot(ans, x1 = "wcat", x2 = "gender", type="both", control = ctl)The csurvey paper with the R Journal also includes more complex examples based on the 2019 National Survey of College Graduates (NSCG). These data sets are too large to include in the CRAN package, so they are provided separately in a GitHub data repository:
GitHub repo: xliaosdsu/csurvey-data
Files: nscg19.rda, nscg19_2.rda
To reproduce the NSCG examples on your own machine, you can download the data as follows:
url1 <- "[https://github.com/xliaosdsu/csurvey-data/raw/main/nscg19.rda](https://github.com/xliaosdsu/csurvey-data/raw/main/nscg19.rda)"
url2 <- "[https://github.com/xliaosdsu/csurvey-data/raw/main/nscg19_2.rda](https://github.com/xliaosdsu/csurvey-data/raw/main/nscg19_2.rda)"
dest1 <- tempfile(fileext = ".rda")
dest2 <- tempfile(fileext = ".rda")
download.file(url1, destfile = dest1, mode = "wb")
download.file(url2, destfile = dest2, mode = "wb")
load(dest1) # creates object nscg19
load(dest2) # creates object nscg19_2You can then follow the detailed NSCG analysis shown in the accompanying the R Journal article for csurvey, which includes:
Block orderings (e.g., STEM vs non-STEM fields)
One-sided tests for salary increasing with father’s education
Plots using plot() and plotpersp() over
multiple predictors
Because CRAN checks do not allow network access and the NSCG data are large, the NSCG examples are not executed in this vignette, but the paper and GitHub repo together provide a reproducible workflow for interested users.
X. Liao and and M. C. Meyer. csurvey: Implementing Order Constraints in Survey Data Analysis. (accepted by The R Journal) 2025.
T. Lumley. Analysis of complex survey samples. Journal of Statistical Software, 9(8):1–19, 2004. doi: 10.18637/jss.v009.i08. [p1, 2]
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.