---
title: "Getting started with the gbm package"
author: "James Hickey"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{GBM Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, message=FALSE, echo=FALSE}
library("gbm3")
```

The `gbm` package implements gradient tree boosting for a wide variety of statistical models and loss functions; including those commonly used in statistics, such as the Cox model, but not usually associated with boosted models.  When presented with data, the `gbm` package offers the user the ability to build predictive models and explore the influence of different variables on the response, akin to a data mining or exploration task.  

To get started a user must:

* have a dataset in a `data.frame`.
* select and make an appropriate distribution object.
* set the training parameters for the gbm model.
* set any additional variable and fitting parameters.

Once these steps have been completed, a user can fit their `gbm` model by a call to `gbmt` and subsequently: evaluate its performance, make predictions, fit additional trees and plot the relative influence of the various predictor variables

### Example dataset 
To begin, this vignette will work with simulated data where the response obeys a `Bernoulli` distribution and is determined by 3 predictor variables.  Every row of the data corresponds to an individual observation and to these observations random weights are assigned.  These weights determine the importance of the associated observation in the fit.  At this stage it is also assumed that the fit will be made directly to the response and **no offset** is supplied.  If the offset vector for each observation was non-zero the `gbm` fit would be to the `observation response + offset` rather than the bare response.  Currently the package supports responses which either consist of: numerics, factors (i. e. `Bernoulli`) or are a survival object (more specifically a counting or right centered survival object) in the case of Cox proportional hazards models. **NB:** Predictor variables may have missing data (specified by `NA`) but the responses must have no missing observations.

```{r}
# create some binomial response data - N observations 
N <- 1000
X1 <- runif(N)
X2 <- runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
mu <- c(-1,0,1,2)[as.numeric(X3)]

p <- 1/(1+exp(-(sin(3*X1) - 4*X2 + mu)))
Y <- rbinom(N,1,p) # response

# Normalized weights for each observation
w <- rexp(N)
w <- N*w/sum(w)

# Offset is set to 0
offset <- rep(0, N)

# Collate data
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3)
```


## Creating a gbm distribution object
The `gbm` package has a number of distributions available to user.  To view what distributions are currently available simply call `available_distributions()`.

```{r}
available_distributions()
```

From the list of available distributions a default `GBMDist` object for each distribution can be created with using the function `gbm_dist` and passing the desired distribution's name as follows:
```{r}
# Create a default Bernoulli 
bern_dist <- gbm_dist("Bernoulli")
```
Certain distributions have distribution specific parameters, such as the number of degrees of freedom associated with the distribution.  These parameters are passed as arguments to `gbm_dist` and how this is done is described in another vignette entitled "model-specific-parameters".

## Setting training parameters
Once the data has been imported and the distribution object created, the next task is to specify the 
training parameters for the `gbm` package.  These parameters are passed to `gbmt` in the form of a `GBMTrainParams` object.  This S3 object can be created using the `training_params` constructor and passing it the appropriate data to the fields.  The named parameters in the constructor are as follows:

* `num_trees`: the number of iterations to use in the fitting.
* `interaction_depth`: the max number of terminal nodes in each tree - the larger this is the more likely overfitting is to occur - depths of ~ 3-10 are usually recommended.
* `min_num_obs_in_node`: the minimum number of observations that a node must have to be considered in fitting a tree.  The default of 10 observations which is usually adequate but if your dataset is small then this could be reduced.
* `shrinkage`: acts as regularization for additional iterations - the smaller the shrinkage generally the better the performance of the fit.  However, smaller shrinkage implies that the number of trees may need to be increased to achieve a certain performance.  In practice the default of `0.001` usually suffices.
* `bag_fraction`: the fraction of the training observations randomly sub-sampled to fit a tree in each iteration.  This has the effect of reducing the variance of the boosted model.  The default `bag_fraction=0.5` is recommended.
* `num_train`: the number of observations to be used in training the model.  This defaults to the minimum value such that the terminal nodes of a fitted tree can be created.
* `id`: a sequence of integers specifying the id of each observation in the data.  This defaults to `seq_len(num_train)`.
* `num_features`: the number of features randomly selected, on each iteration, from the data when growing a tree. 

Once the parameters have been decided the appropriate object is created. For further details on how these parameters and what they represent, please view: *The Elements of Statistical Learning by Friedman et al.*

```{r}
# Creating an appropriate training parameters object
train_params <- training_params(num_trees = 2000, interaction_depth = 3, 
                                num_train=nrow(data), num_features = 3)
```

## Additional variable and fitting parameters
Before fitting our boosted model there are some other parameters to consider. The monotonicity of each predictor variable with the outcome variable can be specified with through the `var_monotone` parameter. In this case the relationship is assumed to be arbitrary and the parameter is unspecified.  The names of the predictor variables may also be specified via `var_names`.  

### Cross Validation
Other parameters relating to cross-validation of the model may be provided.  Firstly, the number of folds to be used in cross-validation is set by `cv_folds`, by default no cross-validation is done by `gbmt`.  An optional vector specifying what fold each observation belongs to can also be specified through `fold_id` and if the distribution selected is `Bernoulli` the cross-validation can be stratified across the response.  

### Parallelisation
The core algorithm can be parallelized across numerous threads by passing the output of `gbmParallel()` to the `par_details` handle.  In this example, more than one thread for processing is not required and so the default is used.  

```{r}
# Example of a gbmParallel with more threads
par_detail <- gbmParallel(num_threads = 2) # Pass to par_details in gbmt
```

## Putting it all together
With the data defined, the distribution object created, the training parameters and additional parameters specified, it is now time to fit the model.  This is done by passing these objects to `gbmt` along with the model formula.  

```{r, message=FALSE}
# Create a gbm fit
fit <- gbmt(Y ~ X1 + X2 + X3, distribution=bern_dist, data=data, weights = w, 
            offset=offset, train_params = train_params,
            cv_folds=5, keep_gbm_data=TRUE)
```

The `keep_gbm_data` flag indicates that the outputted `GBMFit` object will contain the data passed to the fit within a `GBMData` object.

### Default behaviour
The process outlined above is slightly cumbersome.  Thankfully, `gbmt` has numerous default behaviours that are very useful.  Firstly, the distribution will be "guessed" if none is provided - currently it can identify Bernoulli and Cox models, otherwise it defaults to Gaussian.  The "weights" parameter defaults to be uniform across all the data rows and as the offset also defaults to 0 across all data rows.  The training parameters also specify 2000 trees, an interaction depth of 3, identify one row per observation, use half the available data in training and bag half the training data as well as use all of the predictors in the fitting.  Moreover, the weights are normalized by the routine if the distribution is not "Pairwise".

```{r, message=FALSE}
# A default fit to gaussian data
# Create data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
X4 <- ordered(sample(letters[1:6],N,replace=T))
X5 <- factor(sample(letters[1:3],N,replace=T))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# create a bunch of missing values
X1[sample(1:N,size=100)] <- NA
X3[sample(1:N,size=300)] <- NA

# Gaussian data
gauss_data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# Perform a cross-validated fit
gauss_fit <- gbmt(Y ~ X1 + X2 + X3 + X4 + X5 + X6, 
                  data=gauss_data, cv_folds =5, keep_gbm_data = TRUE)
```

## Identifying the optimal iteration
A fitted model contains a number of measures of its performance.  Using these measures, the optimal iteration of a `gbm` fit can be identified. The package offers three methods for identifying the optimal iteration of the fit, these are: `'test'`, `'cv'` and `'OOB'`.  The `'test'` method identifies the optimal iteration using the performance of the fit on a set-aside independent test set, this is only available if the number of observations used to fit the model is less than the number of observations in the original data.  The `'cv'` method determines the best iteration number using the cross-validation error provided and finally `'OOB'` selects the best iteration using the out-of-bag error estimate.  Again this is only available if `bag_fraction > 0` and it will under-estimate the optimal number of iterations in the fit.  The optimal iteration number as well as a plot of the performance can then be obtained through a call to `gbmt_performance`.  In these performance plots the cross-validation error, test set error and out-of-bag improvement are represented by green, red and blue lines respectively.  The error on the training set is shown as a black line and the optimal iteration is marked with a dashed blue line.

```{r, message=FALSE}
# Use gaussian fit and determine the performance
# red line is testing error
# green line is cv error
# blue line is OOB error
best_iter_test <- gbmt_performance(gauss_fit, method='test')
best_iter_cv <- gbmt_performance(gauss_fit, method='cv')
best_iter_oob <- gbmt_performance(gauss_fit, method='OOB')
```

## Fitting additional trees
After evaluating the performance it may be the case that the optimal number of iterations is the `num_trees` parameter set.  In this instance additional trees may want to be fitted to the original dataset; this will not alter the cross-validation fit but will update the test set and OOB errors.

```{r, message=FALSE}
# Fit additional trees to gaussian fit and recalculate optimal number of iterations
gauss_fit_more <- gbm_more(gauss_fit, num_new_trees = 5000) # This is a large number of 
                                                            #extra trees!
best_iter_test_more <- gbmt_performance(gauss_fit_more, method='test')
```

## Predictions on other data
With the model fitted and the optimal number of iterations determined, the user can now make predictions on other datasets using the `gbm` package's `predict` functionality.

```{r, message=FALSE}
# Given additional data - of the same shape as our gaussian example
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=TRUE))
X4 <- ordered(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

Y <- X1**1.5 + 2 * (X2**.5) + mu
Y <- Y + rnorm(N,0,sigma)

data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# Then predictions can be made
preds <- predict(gauss_fit_more, newdata=data2, n.trees=best_iter_test_more)
```

## Relative influence of predictors
`gbm` also offers a measure of the relative influence of each predictor in the fitted model.  The influence of a variable $x_j$ is determined by the sum across all iterations of the improvements of a tree fit when split on that variable.

```{r}
# Relative influence of predictors - both examples
print(relative_influence(gauss_fit_more, best_iter_test_more))
print(relative_influence(fit, gbmt_performance(fit, method='cv')))
```

## Plotting the marginal effects of selected variables
As well as calculating the relative influence from the fitted trees, `gbm` also offers the ability to calculate the predictions of specific variables and make the appropriate marginal plots using `plot`.

```{r}
# Examine the marginal effects of the first two variables - gaussian example
# Shows the fitted model predictions as a function of 1st two predictors
# with all the others integrated out
plot(gauss_fit_more, var_index = 1:2)
```

## Additional useful functions
Finally there are some additional useful functions within the `gbm` package to remember.  A `GBMFit` object can be summarised and printed using `summary` and `print` respectively. Each individually fitted tree can be "prettified" using `pretty_gbm_tree` and the loss for a given distribution can be calculated using the `loss` S3 method, a `GBMDist` object and the appropriate data.