The Statistical Analysis of Misreporting on Sensitive Survey Questions

Using the misreport package in R

Gregory Eady (December 04, 2016)

1. Introduction

This document provides a brief introduction to the R package misreport. The package implements the method introduced in Eady (2016) to permit researchers to statistically examine the predictors of misreporting on sensitive survey questions. In brief, the goal of the method is to model whether survey respondents provide one response to a sensitive item in a list experiment — a measurement technique designed to elicit a truthful response — but answer otherwise when asked to reveal that response openly on a direct question.1 misreport is made available through the Comprehensive R Archive Network (CRAN).

2. Covariates

Before turning to the data-generating process that characterizes the setup for the list experiment and direct question, we first simulate survey respondents and their characteristics. These respondent characteristics will later be used as predictors of the control items, sensitive belief, and misreporting.

To begin, let’s say that we have \(10000\) respondents \(i = 1, \ldots, 10000\) in a data.frame A:

n <- 10000
A <- data.frame(i = 1:n)

For concreteness, we generate respondent characteristics that represent age, gender, and education:

A$age <- round(runif(n, 18, 90))
A$gender <- sample(c("Man", "Woman"), n, replace = TRUE, prob = c(0.52, 0.48))
A$education <- sample(c("Below high school",
                        "High school",
                        "College"),
                      n, replace = TRUE, prob = c(0.25, 0.35, 0.4))
A$education <- factor(A$education, levels = c("Below high school",
                                              "High school",
                                              "College"))

The data now appear as follows:

# Display first 10 rows of the data
print(A[1:10, ], row.names = FALSE)
  i age gender         education
  1  50    Man           College
  2  58  Woman Below high school
  3  75    Man Below high school
  4  23    Man       High school
  5  62  Woman       High school
  6  19  Woman           College
  7  66    Man Below high school
  8  63  Woman           College
  9  61  Woman           College
 10  72  Woman           College

3. Data-generating process

We now simulate responses to a list experiment and a direct question. In this vignette, we’ll assume that answering affirmatively to the sensitive item is to give the socially unacceptable response. In other words, \(Z_i = 1\) or \(D_i = 1\) indicates providing the socially unacceptable response.

To begin, we’ll first assign each respondent at random to the treatment group or the control group:

# Assign to treatment with probability 0.5
A$treatment <- rbinom(n, 1, 0.5)

Now we’ll define the population parameters in the sensitive-item sub-model and simulate responses to the sensitive item:

param_sensitive <- c("Intercept" = -0.5,
                     "gender (Woman)" = 0.25,
                     "age" = 0.01,
                     "education (High school)" = -0.3,
                     "education (College)" = -0.5)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College") %*% param_sensitive

A$true_belief <- rbinom(n, 1, prob = plogis(lin_pred))

The proportion of respondents in our sample that holds the sensitive belief is the following:

prop.table(table(A$true_belief))

     0      1 
0.5358 0.4642 

We’ll now simulate whether each respondent who holds the sensitive belief misreports it when asked directly. For the purpose of this vignette, we’ll set the effect of treatment assignment on misreporting to 0. In other words, respondents who receive the treatment list are neither more nor less likely to misreport on the direct question (this constraint could also be relaxed).

param_misreport <- c("Intercept" = -0.5,
                     "gender (Woman)" = -0.3,
                     "age" = -0.01,
                     "education (High school)" = 0.3,
                     "education (College)" = 0.5,
                     "treatment" = 0)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College",
                  A$treatment) %*% param_misreport

A$misreport <- rbinom(n, 1, prob = plogis(lin_pred))

# By the monotonicity assumption, only those who hold the sensitive belief misreport it. Therefore, if true_belief = 0, so too does the misreport indicator.
A$misreport[A$true_belief == 0] <- 0

Lastly, for the control-items sub-model, we’ll set the number of control items to \(J = 4\). We’ll also set the parameter \(U\) in the control-items sub-model to 0. In other words, those who misreport do not respond to the control items differently from those who do not (this too could be relaxed).

J <- 4

param_control <- c("Intercept" = -0.25,
                   "gender (female)" = 0.25,
                   "age" = 0.01,
                   "education (high school)" = -0.25,
                   "education (college)" = -0.5,
                   "U" = 0,
                   "Z" = 0.25)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College",
                  A$misreport,
                  A$true_belief) %*% param_control

# Response to the control items
A$y_star <- rbinom(n, J, prob = plogis(lin_pred))

Putting it all together, we can calculate respondents’ answers to the list experiment and direct question:

# List experiment response
A$y <- A$y_star + A$true_belief * A$treatment

# Direct question response
A$direct <- A$true_belief
A$direct[A$misreport == 1] <- 0

Our data now appear as follows:

# Display first 10 rows of the data
print(A[1:10, ], row.names = FALSE)
  i age gender         education treatment true_belief misreport y_star y direct
  1  50    Man           College         1           0         0      2 2      0
  2  58  Woman Below high school         0           1         1      3 3      0
  3  75    Man Below high school         1           1         0      1 2      1
  4  23    Man       High school         0           1         0      2 2      1
  5  62  Woman       High school         0           0         0      3 3      0
  6  19  Woman           College         0           1         0      1 1      1
  7  66    Man Below high school         1           1         1      2 3      0
  8  63  Woman           College         1           0         0      1 1      0
  9  61  Woman           College         0           0         0      2 2      0
 10  72  Woman           College         0           1         0      4 4      1

As we can see in these data, the first three respondents happen to represent the three types of respondents that we are aiming to model:

4. Analysis

To model the data, we use the function listExperiment(). This function is the workhorse of the misreport package and aims to model (1) the response to the control items, (2) the response to the sensitive item, and (3) whether the response to the direct question is equivalent to a respondent’s true belief.

We run the function as follows:

library(misreport)

model.1 <- listExperiment(y ~ 1 + gender + age + education,
                          data = A, J = J,
                          treatment = "treatment",
                          direct = "direct",
                          sensitive.response = 1,

                          control.constraint = "partial",
                          misreport.treatment = FALSE)

There are two notable arguments here. First, control.constraint is set to "partial". This corresponds to the simulation data being set up such that those who misreport do not respond differently to the control items compared to those who do not misreport. To test this, we could set this argument to "none", thereby including a parameter for \(U\) in the control-items sub-model. Alternatively, setting control.constraint to "full" would remove both the parameters \(U\) and \(Z\) from the control-items sub-model. Doing so would mean that we are assuming that neither misreporting nor holding the sensitive belief predicts responses to the control items.

Second, misreport.treatment is set to FALSE. This corresponds to our data-generating process in which treatment assignment does not affect whether respondents misreport. Theoretically, such a relationship might exist because respondents in the treatment group recall their response to the sensitive item in the list experiment and provide the same response to the direct question for reasons of cognitive ease or to be consistent on principle. It is therefore advised that researchers separate the list experiment and direct question far apart in a survey when possible to help avoid this possibility. If treatment assignment does affect misreporting, setting misreport.treatment to TRUE will add a predictor representing treatment status to the misreport sub-model.

After model fitting, summary output from the model can be obtained using summary() as follows:

# Show 3 significant digits
summary(model.1, digits = 3)

List experiment sub-models

Call: listExperiment(formula = y ~ 1 + gender + age + education, data = A, 
    treatment = "treatment", J = J, direct = "direct", sensitive.response = 1, 
    control.constraint = "partial", misreport.treatment = FALSE)

CONTROL ITEMS Pr(Y* = y)
                       est.    se       z     p
(Intercept)          -0.283 0.044  -6.377 0.000
genderWoman           0.268 0.025  10.869 0.000
age                   0.011 0.001  18.830 0.000
educationHigh school -0.301 0.032  -9.436 0.000
educationCollege     -0.550 0.031 -17.729 0.000
Z                     0.250 0.025   9.850 0.000
---

SENSITIVE ITEM Pr(Z* = 1)
                       est.    se      z     p
(Intercept)          -0.508 0.154 -3.295 0.005
genderWoman           0.254 0.087  2.934 0.010
age                   0.008 0.002  3.715 0.002
educationHigh school -0.191 0.105 -1.821 0.087
educationCollege     -0.461 0.104 -4.443 0.000
---

MISREPORT Pr(U* = 1)
                       est.    se      z     p
(Intercept)          -0.236 0.302 -0.781 0.446
genderWoman          -0.464 0.180 -2.580 0.020
age                  -0.018 0.004 -4.113 0.001
educationHigh school  0.327 0.230  1.422 0.174
educationCollege      0.601 0.222  2.702 0.016
---

Observations: 10,000 (0 of 10,000 observations removed due to missingness)
Log-likelihood -20266.67
# Recalling the population parameters:
Control:   -0.25  0.25  0.01 -0.25 -0.50  0.25 
Sensitive: -0.50  0.25  0.01 -0.30 -0.50 
Misreport: -0.50 -0.30 -0.01  0.30  0.50

4. Useful quantities of interest

4.1. Predicted probabilities

It will often be the case that reseachers will want to summarize model output by generating predicted probabilities. The simplest way to do this is using predict(). To demonstrate this function, we’ll use it below by focusing on the variable gender and calculating the mean difference in the predicted probability of misreporting between men and women:

# Predicted probabilities of misreporting (covariates held at their observed values)
pred.woman <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Woman"))$u.hat

pred.man <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Man"))$u.hat
# Mean predicted probability of misreporting (women)
mean(pred.woman)
[1] 0.2209041
# Mean predicted probability of misreporting (men)
mean(pred.man)
[1] 0.3072176
# Mean difference in predicted probabilities of misreporting between women and men
mean(pred.woman - pred.man)
[1] -0.08631352

Because misreporting, by assumption, occurs only among those who hold the sensitive belief, we might prefer to calculate differences in predicted probabilities only among the sub-sample of respondents who hold the sensitive belief. By the nature of the list experiment, we don’t know who among the sample truly holds the sensitive belief, but we can instead weight by the posterior predicted probabilities that each respondent holds that belief. These probabilities fall out naturally from the EM procedure and are available in the fitted model object as the component w:

# Display first 10 rows of posterior predicted probabilities
model.1$w[1:10, ]
      Misreport sensitive Truthful sensitive Non-sensitive
 [1,]          0.12833926                  0     0.8716607
 [2,]          0.16345291                  0     0.8365471
 [3,]          0.00000000                  1     0.0000000
 [4,]          0.00000000                  1     0.0000000
 [5,]          0.17927092                  0     0.8207291
 [6,]          0.00000000                  1     0.0000000
 [7,]          0.13951683                  0     0.8604832
 [8,]          0.02192377                  0     0.9780762
 [9,]          0.14740616                  0     0.8525938
[10,]          0.00000000                  1     0.0000000

If we sum across the first two columns of each row, we have the probabilities that each respondent holds the sensitive belief.2 We can therefore weight the predicted probabilities of misreporting as follows:

weighted.mean(pred.man - pred.woman, rowSums(model.1$w[, 1:2]))
[1] 0.08406952

Note further that whether a respondent is a man or woman also affects responses to the sensitive item itself. We could therefore alternatively calculate predicted probabilities of whether each respondent both holds the sensitive response and misreports it.

Lastly, we can also compare the predicted proportion of the sample who misreports to the true proportion in the sample:

# Mean predicted probability of misreporting
# among sub-sample holding the sensitive belief
weighted.mean(predict(model.1)$u.hat, rowSums(model.1$w[, 1:2]))
[1] 0.2518653
# Proportion of those who misreport in the simulated sample
mean(A$misreport[A$true_belief == 1])
[1] 0.277682

4.2. Simulating predicted probabilities

To quantify uncertainty in the predicted probabilities, we can turn to simulation (King, Tomz, and Wittenberg 2000). First, we’ll simulate 500 batches of model parameters:

n_sims <- 500

# Simulate model parameters
coefs <- c(model.1$par.control, model.1$par.sens, model.1$par.misreport)
par_sim <- mvtnorm::rmvnorm(n_sims, coefs, model.1$vcov.mle)

# The matrix of parameters for the misreport submodel
par_sim_misreport <- par_sim[, (length(coefs)-length(model.1$par.misreport)+1):length(coefs)]

We now have 500 batches of parameters for the misreport sub-model, the first 10 rows of which appear as follows:

par_sim_misreport[1:10, ]
      (Intercept) genderWoman         age educationHigh school educationCollege
 [1,] -0.24831962  -0.4568064 -0.01424072          0.194078523        0.3760071
 [2,]  0.23180059  -0.2754316 -0.02439730          0.155030225        0.5514485
 [3,]  0.29086779  -0.5903968 -0.02349122          0.079863161        0.3236970
 [4,] -0.30660578  -0.1629836 -0.02269328          0.343485861        0.9181253
 [5,]  0.07974329  -0.8445375 -0.02107409          0.401709755        0.7968517
 [6,] -0.02850963  -0.5043731 -0.02370183          0.704224214        0.6895712
 [7,] -0.46006380  -0.1715734 -0.02218951          0.592626713        0.8415346
 [8,] -0.13773697  -0.2991940 -0.01777390          0.175176477        0.3368902
 [9,] -0.12687926  -0.3666494 -0.01668053          0.008915554        0.1979390
[10,]  0.07765489  -0.9983039 -0.01969899          0.404203518        0.5402056

We can calculate the mean difference in predicted probabilities for each batch of parameters using predict(), setting gender first to "Woman" and then to "Man":

pp_diff <- rep(NA, n_sims)

# For each row of parameters, calculate the mean difference in predicted probabilities
for(i in 1:n_sims) {

  pp_woman <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Woman"), par.misreport = par_sim_misreport[i, ])$u.hat

  pp_man <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Man"), par.misreport = par_sim_misreport[i, ])$u.hat

  pp_diff[i] <- mean(pp_woman - pp_man)

}

Alternatively, we could use weighted.mean() instead of mean() to weight these differences by the posterior predicted probabilities of each respondent holding the sensitive belief or those calculated using the parameters in the sensitive-item sub-model.

Finally, we use these simulated differences in predicted probabilities to quantify our uncertainty in the predicted difference in misreporting between men and women:

mean(pp_diff)
[1] -0.09017887
quantile(pp_diff, c(0.05, 0.95)) # 90%
         5%         95% 
-0.15057799 -0.03164911 

References


Blair, Graeme, Kosuke Imai, Bethany Park, Alexander Coppock, and Winston Chou. 2016. “list: Statistical Methods for the Item Count Technique and List Experiment.” The Comprehensive R Archive Network (CRAN).

Eady, Gregory. 2016. “The Statistical Analysis of Misreporting on Sensitive Survey Questions,” 1–22.

King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341–55.


  1. Another useful vignette for analysis of a list experiment and direct question can be found here by the authors of the list package (Blair et al. 2016).

  2. Note that the column of respondents of the type “Truthful sensitive” have probabilities of 0 or 1 because these are respondents who have either answered affirmatively (or not) to the direct question. By the monotonicity assumption, those who openly admit to holding the sensitive belief are assumed not to be misreporting.