Many advanced machine learning algorithms can be cast under the same framework that minimize the empirical risk (\(R_{emp}(w)\)) under the control of a regularization term (\(\Omega(w)\)):
\[\min_{w} J(w) := \lambda \Omega(w) + R_{emp}(w),\]
where
\[R_{emp} := \frac{1}{m}\sum{l(x_i,y_i,w)}, \lambda > 0.\]
@Teo_JMLR_2010 and @Do_JMLR_2012 have proposed efficient algorithms to solve this minimization problem. The bmrm
package implements there solution together with several adapter functions that implement popular classification, regression and structure prediction algorithms. The adapter functions take the form of loss functions \(l(w,x_i,y_i)\) that compute a loss value on each training example \((x_i,y_i)\). The algorithms currenty implemented are listed in the table below:
Learning Algorithm Description | Type | Loss Function | Loss Value |
---|---|---|---|
Support Vector Machine (SVM) | Linear Classifier | hingeLoss() |
\(\max(0,1-ywx)\) |
Maximizing ROC area | Linear Classifier | rocLoss() |
@Teo_JMLR_2010, §A.3.1 |
Maximizing fbeta score | Linear Classifier | fbetaLoss() |
@Teo_JMLR_2010, §A.3.5 |
Logistic Regression | Regressor | logisticLoss() |
\(\log(1+e^{-ywx})\) |
Least mean square regression | Linear Regressor | lmsRegressionLoss() |
\((wx-y)^2/2\) |
Least absolute deviation regression | Linear Regressor | ladRegressionLoss() |
\(abs(wx-y)\) |
\(\epsilon\)-insensitive regression | Linear Regressor | epsilonInsensitiveRegressionLoss() |
\(\max(0,abs(wx-y)-\epsilon)\) |
Quantile regression | Linear Regressor | quantileRegressionLoss() |
@Teo_JMLR_2010, Table 5 |
Multiclass SVM | Structure Predictor | softMarginVectorLoss() |
@Teo_JMLR_2010, Table 6 |
Ontology classification | Structure Predictor | ontologyLoss() |
@Teo_JMLR_2010, §A.4.2 |
Ordinal regression | Structure Predictor | ordinalRegressionLoss() |
@Teo_JMLR_2010, §A.3.2 |
Table: List of learning algorithms implemented in bmrm
package. \label{tab:bmrm_learning_algorithms}
In addition to this implemented algorithms, the package is flexible enought to allow easy implementation of custom methods adapted to your learning problem.
Regarding regularization, bmrm
package can handle both L1 and L2 regularization of the parameter vector \(w\). L1 regularization is obtained by computing the L1-norm of the parameter vector (\(\Omega(w)=|w|\)); while L2 regularization is computed by using the L2-norm of the parameter vector (\(\Omega(w)=||w||\)). In theory, L1 regularization yield better model sparsity and may be prefered. However, the implementation available for L2-regularizer is much more memory efficient. The parameter \(\lambda\) control the tradeoff between model fitting and model simplicity, and should be tuned to account for overfitting.
Most of the time, the loss functions are convex and all the ones implemented in the package are. However, non-convex losses can also be handle by the package if necessary. See section Choosing the Optimization Algorithms for more details.
In this quick start guide, we show how to train a multiclass-SVM on iris dataset.
library(bmrm)
x <- cbind(intercept=100,data.matrix(iris[c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]))
y <- iris$Species
w <- nrbm(softMarginVectorLoss(x,iris$Species))
w <- matrix(w,ncol(x),dimnames=list(colnames(x),levels(iris$Species)))
predictions <- colnames(w)[max.col(x %*% w)]
table(target=y,prediction=predictions)
bmrm
package implements two algorithms proposed by @Teo_JMLR_2010 and @Do_JMLR_2012 to solve the above minimization problem. The methods are respectively called bmrm()
and nrbm()
. nrbm()
is a memory optimized version of bmrm()
that can handle non-convex risk when parameter convexRisk=FALSE
. nrbm()
should always be prefered over bmrm()
, but it is limited to L2-regularization. In contrast, bmrm()
can handle L1-regularization, but is more memory consuming and doesn't support non-convex losses. The table below summarize the recommendend optimization algorithm to use for the possible use cases.
Regularizer | Convex Loss | Optimization method |
---|---|---|
L1 | Yes | bmrm() |
L2 | Yes or No | nrbm() |
Table: Recommended optimization method in the different use cases.
The loss functions has to accept a matrix W (or 0) and return both a vector of loss values and a matrix of gradients at input points W. When W=0,