# Notes on Multinomial Naive Bayes

### Multinomial Naive Bayes Classifier¶

The Naive Bayes classifier often performs remarkably well, despite its simplicity. It is especially useful when we have little data that is of high dimensionality and a good baseline model for text classification problems.

This note presents a short derivation of the Multinomial Naive Bayes classifier, and shows how to interpret the coefficients and how to obtain feature importances from a fitted classifier.

#### Naive Bayes: general form¶

The theory behind naive Bayes is a straightforward application of Bayes' theorem. The probability of an observation $X$ (with $X$ being for instance a vector of word occurrences or word counts) belonging to class $Y_k$ is:

$P(Y_k| X) = P(Y_k) \ P(X| Y_k) \ / \ P(X) $

The first term after the equals sign is the prior class probability, the second term is the likelihood, and the final term, the denominator, is sometimes called the marginal likelihood. This form is the same for all Naive Bayes classifiers: Bernoulli, Multinomial and Gaussian. The only thing that is different between these is how the likelihood $P(X| Y_k)$ is calculated. This is also apparent in the sci-kit learn implementation (Look at the BaseNB class). In this note, only the multinomial model is considered.

The first two terms after the equals sign can also be considered together, as the joint probability of $X$ and $Y$, because

$P(Y_k) \ P(X| Y_k) = P(X, Y_k)$

The denominator is typically not explicitly considered, since its value is given by the constraint that the posterior probabilities of all classes must some to 1.

#### Multinomial formulation¶

The reason that Naive Bayes is called "Naive" is usually quoted as coming from the assumption that all the elements of $X$ are conditionally independent. That is to say, the joint probability is the product of the individual probabilities:

$P(X_1, X_2, ... X_d) = \prod_1^d P(X_i) $

This reduces the likelihood to a particularly simple form, because $P(X_1, X_2, ... X_d | Y) = P(X_1 |Y) P(X_2 | Y) ... P(X_d |Y) $

Whereas this is indeed the ground assumption for Bernoulli and Gaussian Naive Bayes, this is **not** the assumption underlying multinomial Naive Bayes.

Instead, consider the multinomial distribution, where each word is a draw from a (large) number of tokens, each with a probability of being drawn. The naiveness of this approach is that the drawing process itself is independent of previous draws -it only depends on the class $Y_k$-. Introducing $x_i$ as the number of counts for token (word) $i$, $\theta_{k, i}$ the probability of drawing token $i$ given class $Y_k$, the probability of a specific draw is:

$P(X | Y_k) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i \theta_{k, i}^{x_i} $

Note that the division with the factorials is class-independent, and can therefore be ignored: it will be compensated by the marginal likelihood because of the earlier mentioned constraint. The $\theta_{k, i}$ is determined straightforward from the word counts conditional on the class labels, typically with pseudocounts/smoothing to avoid zero probabilities in the prediction due to a single token.

Written out fully, the equation becomes:

$P(X | Y_k) = P(Y_k) \prod_i \theta_{k, i}^{x_i} \ \ / \ (\sum_m P(Y_m) \prod_i \theta_{m, i}^{x_i} ) $

where the denominator takes care of the normalization.

Note that log-transformation reveals a simple linear decision boundary in the case of binary classification:

$\log P(X|Y_1) - \log P(X|Y_0) = 0 = \log P(Y_1) - \log P(Y_0) + \sum_i x_i ( \log \theta_{1i} - \log \theta_{0i} ) $

#### Getting feature importances¶

Feature importances can most easily be obtained from the final term in the previous equation. The largest positive contributors to this term are the largest contributors to the differential log probability of class 1 compared to class 0.

Let's implement a simple function to accomplish this.

```
# Import modules and data
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
cats = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True,
categories=cats, remove=[]) #('headers', 'footers', 'quotes')
#)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle=True,
categories=cats, remove=[]) #('headers', 'footers', 'quotes')
#)
vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\b[a-zA-Z]{3,}\b', lowercase=False,
min_df=5, max_df=0.7, stop_words='english')
```

```
def feature_importances_NB(clf, data):
"""
Returns the importances of words present in data, as measured
in their contribution to the difference in the log probabilities
between the two classes.
clf: a fitted MultiNomialNB classifier, with 2 classes
data: a 1-D numpy array with counts
returns: a 1-D numpy array of the size of data
"""
# difference between \theta's for both classes
delta_log_prob = clf.feature_log_prob_[1, :] - clf.feature_log_prob_[0, :]
return np.multiply(data, delta_log_prob)
def get_top_N_features(clf, vectorizer, text, N=10):
"""
Returns a list of tuples, with the token name and its feature importance to class 1
relative to class 0 in terms of log-probabilities
(positive: contributes positively to the prediction for class 1)
"""
data = np.asarray(vectorizer.transform(text).todense()).reshape(-1) # Generate a non-sparse count vector
feature_importances = feature_importances_NB(clf, data)
topN_features_idx = np.argsort(np.abs(feature_importances))[-10:]
return [(vectorizer.get_feature_names()[i],
'{:.3f}'.format(feature_importances[i])
) for i in topN_features_idx[::-1]]
```

```
train_X = vectorizer.fit_transform(newsgroups_train.data)
clf = MultinomialNB(alpha=1.0)
clf.fit(train_X, newsgroups_train.target)
```

```
# Test the prediction
clf.predict_proba(train_X[1, :])
```

```
get_top_N_features(clf, vectorizer, [newsgroups_train.data[1]])
```

```
get_top_N_features(clf, vectorizer, [newsgroups_test.data[35]])
```

## Comments