The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
A major challenge in big data statistical analysis is the demand for computing resources. For example, when fitting a logistic regression model to binary response variable with \(N \times d\) dimensional covariates, the computational complexity of estimating the coefficients using the IRLS algorithm is \(O(\zeta N d^2)\), where \(\zeta\) is the number of iteriation. When \(N\) is large, the cost can be prohibitive, especially if high performance computing resources are unavailable. Subsampling has become a widely used technique to balance the trade-off between computational efficiency and statistical efficiency.
The R package subsampling
provides optimal subsampling
methods for various statistical models such as generalized linear models
(GLM), softmax (multinomial) regression, rare event logistic regression
and quantile regression model. Specialized subsampling techniques are
provided to address specific challenges across different models and
datasets.
You can install the development version of subsampling from GitHub with:
# install.packages("devtools")
::install_github("dqksnow/subsampling") devtools
The Online document provides a guidance for quick start.
This is an example of subsampling method on logistic regression:
library(subsampling)
set.seed(1)
<- 1e4
N <- rep(-0.5, 7)
beta0 <- length(beta0) - 1
d <- 0.5
corr <- matrix(corr, d, d) + diag(1-corr, d)
sigmax <- MASS::mvrnorm(N, rep(0, d), sigmax)
X colnames(X) <- paste("V", 1:ncol(X), sep = "")
<- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
P <- rbinom(N, 1, P)
Y <- as.data.frame(cbind(Y, X))
data <- Y ~ .
formula <- 200
n.plt <- 600
n.ssp <- ssp.glm(formula = formula,
ssp.results data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted"
)summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "poisson",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 635
#> 4 Unique Subsample Size 635
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.35%
#> 7 Unique Subample Rate 6.35%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.4149 0.0803 -5.1694 <0.0001
#> V1 -0.5874 0.0958 -6.1286 <0.0001
#> V2 -0.4723 0.1086 -4.3499 <0.0001
#> V3 -0.5492 0.1014 -5.4164 <0.0001
#> V4 -0.4044 0.1012 -3.9950 <0.0001
#> V5 -0.3725 0.1045 -3.5649 0.0004
#> V6 -0.6703 0.0973 -6.8859 <0.0001
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.