This is an extremely fast implementation of a Naive Bayes classifier. This package is currently the only package that supports a Bernoulli distribution, a Multinomial distribution, and a Gaussian distribution, making it suitable for both binary features, frequency counts, and numerical features. Another unique feature is the support of a mix of different event models. Only numerical variables are allowed, however, categorical variables can be transformed into dummies and used with the Bernoulli distribution. This implementation offers a huge performance gain compared to the ‘e1071’ implementation in R. The execution times were compared on a data set of tweets and was found to be around 1135 times faster. Compared to other implementations the minimum speed up was found to be 12.5 times faster for the Bernoulli distribution. See the vignette for more details. This performance gain is only realized using a Bernoulli event model. Furthermore, the Multinomial event model implementation is even slightly faster, but incomparable since it was not implemented in ‘e1071’. Compared to other implementations of a Multinomial distribution, this package was found to give a speed up of 12.2 times. The implementation is largely based on the paper “A comparison of event models for Naive Bayes anti-spam e-mail filtering” written by K.M. Schneider (2003).
Any issues can be submitted to: https://github.com/mskogholt/fastNaiveBayes/issues
The purpose of this vignette is to explain some key aspects of this implementation in detail. Firstly, a short introduction to text classification is given as the context for further explanations about the Naive Bayes classifier. It should be noted that the Naive Bayes classifier is not restricted to text classification. The Naive Bayes classifier is a general classification algorithm, but most commonly applied to text classification. Secondly, the general framework of a Naive Bayes classifier is outlined in order to subsequently delve deeper into the different event models. Thirdly, a mathematical explanation is given as to why this particular implementation has such an excellent performance. In the fourth section a description is given about the unique features that sets this implementation of a Naive Bayes classifier apart from other implementations within the R community.
Text classification is the task of classifying documents by their content: that is, by the words of which they are comprised. The documents are often represented as a bag of words. This means that only the occurrence or frequency of the words in the document are taken into account, any information about the syntactic structure of these words is discarded (Hu & Liu, 2012). In many research efforts regarding document classification, Naive Bayes has been successfully applied (McCallum & Nigam, 1998). Furthermore, text classification will serve as the basis for further elaboration on the inner workings of the Naive Bayes classifier and the different event models.
Naive Bayes is a probabilistic classification method based on the Bayes theorem with a strong and naive independence assumption. Naive Bayes assumes independence between all attributes. Despite this so-called “Naive Bayes assumption”, this technique has been proven to be very effective for text classification (McCallum & Nigam, 1998). In the context of text classification, Naive Bayes estimates the posterior probability that a document, consisting out of several words, belongs to a certain class and classifies the document as the class which has the highest posterior probability: \[P(C=k|D) = \frac{P(D|C=k)*P(C=k)}{P(D)}\] Where \(P(C=k|D)\) is the posterior probability that the class equals \(k\) given document, \(D\). The Bayes theorem is applied to rewrite this probability to three components:
To classify a document, \(D\), the class, \(k\), with the highest probability is chosen as the classification. This means that we can simplify the equation a bit, since \(P(D)\) is the same for all classes. By removing the denominator, the focus is now solely on calculating the nominator, i.e. the first 2 components.
The prior probability of class, \(k\), i.e. \(P(C=k)\), is simply the proportion of documents in the training dataset that have class, \(k\). For example, if our training dataset consists of 100 emails that have been labeled as either \(Ham\) or \(Spam\) and there were 63 emails that were labeled \(Ham\) and 37 emails labeled as \(Spam\). In this case, \(P(C=Spam)\) is the proportion of emails that were labeled as \(Spam\), i.e. \(\frac{37}{100}=0.37\). This prior probability estimation is the same regardless of which distribution is used within the Naive Bayes Classifier.
Naive Bayes is a popular classification method, however, within the classification community there is some confusion about this classifier: There are three different generative models in common use, the Multinomial Naive Bayes, Bernoulli Naive Bayes, and finally the Gaussian Naive Bayes. Most confusion is surrounding the Multinomial and Bernoulli event models. Both are called Naive Bayes by their practitioners and both make use of the Naive Bayes assumption. However, they have different assumptions on the distributions of the features that are used. This means that these assumptions lead to two distinct models, which are very often confused (McCallum & Nigam, 1998).
The most commonly used Naive Bayes classifier uses a Bernoulli model. This is applicable for binary features that indicate the presence or absence of a feature(1 and 0, respectively). Each document, \(D\), consists of a set of words, \(w\). Let \(V\) be the vocabulary, i.e. the collection of unique words in the complete dataset. Using the Bernoulli distribution, \(P(D_i|C=k)\) becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] Where \(b_{i,t}=1\) if the document, \(D_i\), contains the word, \(w_t\), and \(0\) otherwise. Furthermore, \(|V|\) is the number of unique words in the dataset and \(P(w_{t}|C=k)\) is the posterior probability of word, \(w_t\) occurring in a document with class, \(k\). This is simply calculated as the proportion of documents of class, \(k\), in which word, \(t\), occurs compared the total number of documents of class, \(k\). In other words: \[P(w_{t}|C=k)=\frac{\sum_{i=1}^{N}{x_{i,t}*z_{i,k}}}{\sum_{i=1}^{N}{z_{i,k}}}\] Where \(x_{i,t}\) equals \(1\) if word, \(t\), occurs in document, \(i\), and \(0\) otherwise. Furthermore, \(z_{i,k}\) equals \(1\) if document, \(i\), is labeled as class, \(k\), and \(0\) otherwise.
The multinomial distribution is used to model features, which represent the frequency of which the events occurred, or in other words it uses word counts in the documents instead of the binary representation. This means that the distribution used to calculate \(P(D_i|C=k)\) changes. This now becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)^{x_{i,t}}}\] Where \(x_{i,t}\) is the frequency of word, \(t\), in document, \(i\). Here: \[P(w_t|C=k)=\frac{\sum_{i=1}^{N}{x_{i,t}*z_{i,k}}}{\sum_{s=1}^{|V|}{\sum_{i=1}^{N}{x_{i,s}z_{i,k}}}}\] Where \(x_{i,t}\) is the frequency of word, \(t\), in document, \(i\) and \(z_{i,k}\) equals \(1\) if document, \(i\), is labeled as class, \(k\), and \(0\) otherwise. Furthermore, \(|V|\) is the length of the vocabulary, i.e. the total number of unique words in the dataset.
A Gaussian distribution can also be used to model numerical features. Quite simply the conditional probabilities are now assumed to follow a normal distribution, where the mean and standard deviation are estimated from the training data. In this case, \(P(D_i|C=k)\) becomes: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)}\] where \[P(w_t|C=k)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] where \(\mu\) and \(\sigma\) are estimated by their sample estimators from the training data.
As was explained, all three event models are part of a general Naive Bayes framework and all three prescribe different ways to estimate \[P(D_i|C=k)\]. Furthermore, all three use the general Naive Bayes approach, which is to assume independence between the features and simply use the product of each individual probability, as follows: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{P(w_t|C=k)}\] A big benefit of this independence assumption is that different event models can be mixed simply by using the individual event models for different features.
Another important aspect of Naive Bayes classifiers is the so-called Laplace smoothing. Consider again the probability calculation: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] If at any point \(P(w_t|C=k)=0\), then \(P(D_i|C=k)\) will also equal \(0\), since it’s a product of the individual probabilities. The same holds for the Multinomial distribution. In order to overcome this, Laplace smoothing is used, which simply adds a small non-zero count to all the word counts, so as to not encounter zero probabilities. There is a very important distinction to be made. A commonly made mistake is to assume that this is also applied to any features in the test set that were not encountered in the training set. This however, is not correct. The Laplace smoothing is applied, such that words that do not occur at all together with a specific class do not yield zero probabilities. Features in the test set that were not encountered in the training set are simply ignored from the equation. This also makes sense, if a word was never encountered in the training set then \(P(w_t|C=k)\) should be the same for every class, \(k\).
As previously explained, when classifying a new document, one needs to calculate \(P(C=k|D_i) = \frac{P(D_i|C=k)*P(C=k)}{P(D_i)}\) for each class, \(k\). However, since the class with the highest posterior probability is used as the classification and \(P(D_i)\) is constant for all classes, the denominator can be ignored. This means that for prediction, only \(P(D_i|C=k)*P(C=k)\) needs to be calculated. As has been shown above this probability in the Bernoulli case can be rewritten to: \[P(D_i|C=k) = \prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}\] By taking the log transformation this becomes: \[log(\prod\limits_{t=1}^{|V|}{b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k))}) = \sum_{t=1}^{|V|}{log(b_{i,t}*P(w_{t}|C=k)+(1-b_{i,t})*(1-P(w_{t}|C=k)))}\] Furthermore, by rearranging some terms this becomes: \[\sum_{t=1}^{|V|}{b_{i,t}*log(P(w_{t}|C=k))} + \sum_{t=1}^{|V|}{(1-b_{i,t})*log((1-P(w_{t}|C=k)))} \] If we zoom in on the first part and keep in mind that our matrix, \(x\), with observations is a matrix where each column represents a word, from \(1\) to \(|V|\), with a \(1\) if the word was observed and \(0\) otherwise. This means that the matrix of observations has \(b_{i,t}\) as the values. The probabilities, \(P(w_t|C=k)\), is a vector of length \(|V|\). We can now use matrix multiplication to derive the sum as follows: \(x * P(w_t|C=k)\) for the first part and \((1-x) * (1-P(w_t|C=k))\) for the second part. After these two parts have been added up, one can simply raise \(e\) to the power of the outcomes to transform it back to the original probabilities. This mathematical trick is what allows one to use matrix multiplication, which in turn is what makes this specific implementation so efficient.
In this section, a brief overview is given of the unique features of this package. This implementation improves upon existing implementations on two points:
In order to demonstrate the power of this package a comparison of estimation and prediction execution times has been done using this package and been compared to different packages. The comparison was made on a dataset consisting of 14640 tweets, where all were used to train the Naive Bayes classifier and all tweets were used to test. After processing a total of 2225 features, i.e. words, were used. In the table below the comparison between execution times is shown
Bernoulli | Execution Time |
---|---|
fastNaiveBayes | 158.5 ms. |
fastNaiveBayes Sparse | 303.3 ms. |
e1071 | 178476 ms. |
bnlearn | 1980.9 |
klaR | 182046 ms. |
naivebayes | 3655.7 ms. |
quanteda | 2059.9 ms. |
As can be seen from this table, the fastNaiveBayes algorithm using a normal R matrix was the fastest. Compared to the ‘e1071’ package the execution time is 1135 times faster. It’s interesting to note that using a sparse dgcMatrix did not result in faster execution times for the Bernoulli distribution. The fastest implementations besides this package comes from ‘bnlearn’ and ‘quanteda’. Compared to these two, this implementation is around 12.5 times faster.
In the table below, the results are shown when using a Multinomial distribution and the word counts instead of binary features.
Multinomial | Execution Time |
---|---|
fastNaiveBayes | 44.12 ms. |
fastNaiveBayes Sparse | 5.4 ms. |
quanteda | 68.66 ms. |
Rfast | 65.80 ms. |
The comparison can not be made with a number of implementations, that have not implemented a Multinomial distribution yet. The interesting thing to note here is that using a sparse dgcMatrix now gives a huge increase in performance. Comparing the mean execution times using a sparse dgcMatrix is 36 times faster than the normal R matrix. Compared to ‘quanteda’ this implementation is still 12.7 times faster using a sparse matrix and compared to Rfast it is 12.2 times faster.
Hu, X., & Liu, H. (2012). Text analytics in social media. In Mining text data (pp. 385-414). Springer, Boston, MA.
McCallum, A., & Nigam, K. (1998, July). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).
Schneider, K. M. (2003, April). A comparison of event models for Naive Bayes anti-spam e-mail filtering. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1 (pp. 307-314). Association for Computational Linguistics.