The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Imputation Method based on Iterative EM PCA

Birgit Karlhuber

2024-07-08

This vignette showcases the function impPCA(), which can be used to impute missing values with the help of a greedy algorithm for EM-PCA including robust methods.

Data

The following example demonstrates the functionality of impPCA() using a subset of sleep. The columns have been selected deliberately to include some interactions between the missing values

library(VIM)

data(iris)
df <- iris[1:30,c(1,4)]
colnames(df) <- c("S.Length", "P.Width") # select two numerical variables
df_na <- df

# randomly produce some missing values in the column P.Width
set.seed(1)
nbr_missing <- 10
index_na <- sample(nrow(df_na),size = nbr_missing,replace = T)
index_na <- index_na[!duplicated(index_na)]
df_na[index_na,2] <- NA

w <- is.na(df_na$`P.Width`)

aggr(df_na)

Imputation

By setting method to “mcd” the robust estimation is used (instead of the default “classical”). With boot=FALSE imputed data set would be a data.frame else (boot=TRUE) it is a list where each list element contains a data.frame.

imputed <- impPCA(df_na, method = "mcd", boot=TRUE)[[1]]
aggr(imputed)

The plot shows that all missing values of the variable P.Width were imputed by the impPCA() function.

Performance of method

In the next plot the non-missing data points, the ones set to missing and the associated imputed values are visualized. Furthermore the MAPE short for Mean Absolute Percentage Error and the NRMSE short for Normalized Root Mean Squared Error are visualized in the plot.

# create plot
plot(`P.Width` ~ `S.Length`, data = df, type = "n", ylab = "P.Width", xlab="S.Length")
mtext(text = "impPCA robust", side = 3)
points(df$`S.Length`[!w], df$`P.Width`[!w])
points(df$`S.Length`[w], df$`P.Width`[w], col = "grey", pch = 17)
points(imputed$`S.Length`[w], imputed$`P.Width`[w], col = "red", pch = 20, cex = 1.4)
segments(x0 = df$`S.Length`[w], x1 = imputed$`S.Length`[w], y0 = df$`P.Width`[w],
y1 = imputed$`P.Width`[w], lty = 2, col = "grey")
legend("topleft", legend = c("non-missings", "set to missing", "imputed values"),
pch = c(1,17,20), col = c("black","grey","red"), cex = 0.7)
mape <- round(100* 1/sum(is.na(df_na$`P.Width`)) * sum(abs((df$`P.Width` -
imputed$`P.Width`) / df$`P.Width`)), 2)
s2 <- var(df$`P.Width`)
nrmse <- round(sqrt(1/sum(is.na(df_na$`P.Width`)) * sum(abs((df$`P.Width` -
imputed$`P.Width`) / s2))), 2)
text(x = 5.6, y = 0.16, labels = paste("MAPE =", mape))
text(x = 5.6, y = 0.12, labels = paste("NRMSE =", nrmse))

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.