Introduction

The wsrf package is a parallel implementation of the Weighted Subspace Random Forest algorithm (wsrf) of Xu et al. (2012). A novel variable weighting method is used for variable subspace selection in place of the traditional approach of random variable sampling. This new approach is particularly useful in building models for high dimensional data — often consisting of thousands of variables. Parallel computation is used to take advantage of multi-core machines and clusters of machines to build random forest models from high dimensional data with reduced elapsed times.

Requirements and Installation Notes

Currently, wsrf requires R (>= 3.0.0), Rcpp (>= 0.10.2) (Eddelbuettel and François 2011; Eddelbuettel 2013). For the use of multi-threading, a C++ compiler with C++11 standard support of threads or the Boost C++ library with version above 1.54 is required. The choice is available at installation time depending on what is available to the user. To install the latest version of the package, from within R run:

install.packages("wsrf")

By default, multi-threading functionality is not enabled, which can be configured through the argument configure.args=.

install.packages("wsrf",
                 type="source",
                 configure.args="--enable-c11=yes")

We recommend using C++11 standard library for accessing multi-threaded functionality, which will be our main focus for development in the future. Though support for compiling C++11 code in packages is not fully supported in current release of R, it has been tested that it can be compiled if the user has already installed the latest version of GCC and C++ standard library1.

Besides the default installation for C++11, we also provide another implementations of wsrf, which implements parallelism using Boost.

The choice of version to install is available at installation. The version without parallelism, as required when C++11 is not available nor is Boost, and is the recommended and only possible choice for Microsoft Windows platform with the current version of R (3.2.0) (the same as the first installation method above):

install.packages("wsrf",
                 configure.args="--enable-c11=no")

Finally the version using Boost for multithreading can be installed with the appropriate configuration options. This is suitable when the version of C++ available does not support C++11.

install.packages("wsrf",
                 type="source",
                 configure.args="--with-boost-include=<Boost include path>
                                 --with-boost-lib=<Boost lib path>")

Usage

This section demonstrates how to use wsrf, especially on a cluster of machines.

The example uses a small dataset weather from rattle (G. J. Williams 2011). See the help page of rattle in R (?weather) for more details of weather. Below are the basic information of it.

library("rattle")
ds <- weather
dim(ds)
## [1] 366  24
names(ds)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"

Before building the model we need to prepare the training dataset. First we note the various roles played by the different variables, including identifying the irrelevant variables.

target <- "RainTomorrow"
ignore <- c("Date", "Location", "RISK_MM")
(vars <- setdiff(names(ds), ignore))
##  [1] "MinTemp"       "MaxTemp"       "Rainfall"      "Evaporation"  
##  [5] "Sunshine"      "WindGustDir"   "WindGustSpeed" "WindDir9am"   
##  [9] "WindDir3pm"    "WindSpeed9am"  "WindSpeed3pm"  "Humidity9am"  
## [13] "Humidity3pm"   "Pressure9am"   "Pressure3pm"   "Cloud9am"     
## [17] "Cloud3pm"      "Temp9am"       "Temp3pm"       "RainToday"    
## [21] "RainTomorrow"
dim(ds[vars])
## [1] 366  21

Next we deal with missing values, using na.roughfix() from randomForest to take care of them.

library("randomForest")
if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])
ds[target] <- as.factor(ds[[target]])
(tt <- table(ds[target]))
## 
##  No Yes 
## 300  66

We construct the formula that describes the model which will predict the target based on all other variables.

(form <- as.formula(paste(target, "~ .")))
## RainTomorrow ~ .

Finally we create the randomly selected training and test datasets, setting a seed so that the results can be exactly replicated.

seed <- 42
set.seed(seed)
length(train <- sample(nrow(ds), 0.7*nrow(ds)))
## [1] 256
length(test <- setdiff(seq_len(nrow(ds)), train))
## [1] 110

The signature of the function to build a weighted random forest model in wsrf is:

wsrf(formula, 
     data, 
     ntrees=500, 
     nvars=NULL,
     weights=TRUE, 
     parallel=TRUE)

We use the training dataset to build a random forest model. All parameters, except formula and data, use their default values: 500 for ntrees — the number of trees; TRUE for weights — weighted subspace random forest or random forest; TRUE for parallel — use multi-thread or other options, etc.

library("wsrf")
model.wsrf.1 <- wsrf(form, data=ds[train, vars])
print(model.wsrf.1)
## A Weighted Subspace Random Forest model with 500 trees.
## 
##   No. of variables tried at each split: 5
##                  Out-of-Bag Error Rate: 0.15
##                               Strength: 0.62
##                            Correlation: 0.19
## 
## Confusion matrix:
##      No Yes class.error
## No  209   6        0.03
## Yes  32   9        0.78
print(model.wsrf.1, 1)  # Print tree 1.
## Tree 1 has 20 tests (internal nodes), with OOB error rate 0.1932:
## 
##  1) Pressure3pm <= 1010.4
##  .. 2) MinTemp <= 15
##  .. .. 3) MinTemp <= 3.2   [Yes] (0 1) *
##  .. .. 3) MinTemp >  3.2
##  .. .. .. 4) Cloud9am <= 1   [No] (1 0) *
##  .. .. .. 4) Cloud9am >  1
##  .. .. .. .. 5) WindSpeed3pm <= 17
##  .. .. .. .. .. 6) WindGustSpeed <= 35   [No] (1 0) *
##  .. .. .. .. .. 6) WindGustSpeed >  35   [Yes] (0 1) *
##  .. .. .. .. 5) WindSpeed3pm >  17   [No] (1 0) *
##  .. 2) MinTemp >  15   [Yes] (0 1) *
##  1) Pressure3pm >  1010.4
##  .. 7) Sunshine <= 8.8
##  .. .. 8) Cloud3pm <= 7
##  .. .. .. 9) Temp3pm <= 14.6   [No] (1 0) *
##  .. .. .. 9) Temp3pm >  14.6
##  .. .. .. .. 10) WindGustSpeed <= 46
##  .. .. .. .. .. 11) Sunshine <= 8.6
##  .. .. .. .. .. .. 12) Evaporation <= 1.6
##  .. .. .. .. .. .. .. 13) Evaporation <= 1.2   [No] (0.67 0.33) *
##  .. .. .. .. .. .. .. 13) Evaporation >  1.2   [Yes] (0 1) *
##  .. .. .. .. .. .. 12) Evaporation >  1.6
##  .. .. .. .. .. .. .. 14) Pressure3pm <= 1015.8
##  .. .. .. .. .. .. .. .. 15) Pressure9am <= 1018.1   [No] (1 0) *
##  .. .. .. .. .. .. .. .. 15) Pressure9am >  1018.1   [Yes] (0 1) *
##  .. .. .. .. .. .. .. 14) Pressure3pm >  1015.8   [No] (1 0) *
##  .. .. .. .. .. 11) Sunshine >  8.6   [Yes] (0 1) *
##  .. .. .. .. 10) WindGustSpeed >  46
##  .. .. .. .. .. 16) WindDir3pm == N   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == NNE   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == NE   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == ENE   [Yes] (0 1) *
##  .. .. .. .. .. 16) WindDir3pm == E   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == ESE   [No] (1 0) *
##  .. .. .. .. .. 16) WindDir3pm == SE   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == SSE   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == S   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == SSW   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == SW   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == WSW   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == W   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == WNW   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == NW   [Yes] (0.14 0.86) *
##  .. .. .. .. .. 16) WindDir3pm == NNW   [Yes] (0 1) *
##  .. .. 8) Cloud3pm >  7
##  .. .. .. 17) Pressure9am <= 1017.4   [No] (1 0) *
##  .. .. .. 17) Pressure9am >  1017.4
##  .. .. .. .. 18) WindDir9am == N   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == NNE   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == NE   [Yes] (0 1) *
##  .. .. .. .. 18) WindDir9am == ENE   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == E   [Yes] (0 1) *
##  .. .. .. .. 18) WindDir9am == ESE   [Yes] (0 1) *
##  .. .. .. .. 18) WindDir9am == SE   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == SSE   [No] (1 0) *
##  .. .. .. .. 18) WindDir9am == S   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == SSW   [Yes] (0 1) *
##  .. .. .. .. 18) WindDir9am == SW   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == WSW   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == W   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == WNW   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == NW   [Yes] (0.1 0.9) *
##  .. .. .. .. 18) WindDir9am == NNW   [Yes] (0.1 0.9) *
##  .. 7) Sunshine >  8.8
##  .. .. 19) Pressure3pm <= 1011.7
##  .. .. .. 20) Humidity3pm <= 30   [No] (1 0) *
##  .. .. .. 20) Humidity3pm >  30   [Yes] (0 1) *
##  .. .. 19) Pressure3pm >  1011.7   [No] (1 0) *

Then, predict the classes of test data.

cl <- predict(model.wsrf.1, newdata=ds[test, vars], type="class")
actual <- ds[test, target]
(accuracy.wsrf <- sum(cl == actual, na.rm=TRUE)/length(actual))
## [1] 0.8363636

Thus, we have built a model that is around 84% accurate on unseen testing data.

Using different random seed, we obtain another model.

# Here we build another model without weighting.
model.wsrf.2 <- wsrf(form, data=ds[train, vars], weights=FALSE)
print(model.wsrf.2)
## A Weighted Subspace Random Forest model with 500 trees.
## 
##   No. of variables tried at each split: 5
##                  Out-of-Bag Error Rate: 0.15
##                               Strength: 0.59
##                            Correlation: 0.20
## 
## Confusion matrix:
##      No Yes class.error
## No  211   4        0.02
## Yes  35   6        0.85

We can also derive a subset of the forest from the model or a combination of multiple forests.

submodel.wsrf <- subset.wsrf(model.wsrf.1, 1:150)
print(submodel.wsrf)
## A Weighted Subspace Random Forest model with 150 trees.
## 
##   No. of variables tried at each split: 5
##                  Out-of-Bag Error Rate: 0.14
##                               Strength: 0.61
##                            Correlation: 0.19
## 
## Confusion matrix:
##      No Yes class.error
## No  210   5        0.02
## Yes  32   9        0.78
bigmodel.wsrf <- combine.wsrf(model.wsrf.1, model.wsrf.2)
print(bigmodel.wsrf)
## A Weighted Subspace Random Forest model with 1000 trees.
## 
##   No. of variables tried at each split: 5
##                  Out-of-Bag Error Rate: 0.14
##                               Strength: 0.61
##                            Correlation: 0.19
## 
## Confusion matrix:
##      No Yes class.error
## No  211   4        0.02
## Yes  33   8        0.80

Next, we will specify building the model on a cluster of servers.

servers <- paste0("node", 31:40)
model.wsrf.3 <- wsrf(form, data=ds[train, vars], parallel=servers)

All we need is a character verctor specifying the hostnames of which nodes to use, or a named integer vector, whose values of the elements give how many threads to use for model building, in other words, how many trees built simultaneously. More detail descriptions about wsrf are presented in the manual.

References

Eddelbuettel, Dirk. 2013. Seamless R and C++ Integration with Rcpp. New York: Springer.

Eddelbuettel, Dirk, and Romain François. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. http://www.jstatsoft.org/v40/i08/.

Williams, Graham J. 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Use R! Springer. http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896.

Xu, Baoxun, Joshua Zhexue Huang, Graham Williams, Qiang Wang, and Yunming Ye. 2012. “Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces.” International Journal of Data Warehousing and Mining (IJDWM) 8 (2). IGI Global: 44–63.


  1. C++11 support is experimental in R-devel but not yet finished, see Daily News about R-devel on 2013-12-02 and MinGW-w64 Notes by Duncan Murdoch