Skip to main content

Naive Bayes Classifier

The Naive Bayes classifier is a supervised classification technique that particularly favors qualitative predictors. This does not mean that quantitative variables should be excluded, but these should be transformed using the principle of discretization of continuous variables.

This technique predicts class probabilities and operates under the assumption that variables are conditionally independent from each other when defining the class. It relies on the individual probability ( of each variable ) belonging to a class and multiplies these individual probabilities by the overall class probability.

Bayes' Theorem

Bayes' Theorem focuses on conditional probability, linking the probability of A given B to the probability of B given A :

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}{P(B)}

The probability of A given B is the probability of B given A multiplied by the ratio of the probabilities of A and B

For example :

  • What is the probability that a customer who uses an iPhone will not renew their subscription ?
  • What is the probability that a customer who does not renew their subscription uses an iPhone ?

If A = cancels subscription and B = uses an iPhone, then the probability of canceling the subscription given the use of an iPhone is the probability of having an iPhone given subscription cancellation multiplied by the total probability of canceling a subscription, divided by the total probability of using an iPhone.

Individual Conditional Probability

The Naive Bayes classifier assumes that the variables are independent. A car can be considered a Ferrari if it is red, sporty, and has a max power of 780780 hp. Even though these characteristics are related in reality, the algorithm will determine that the car is a Ferrari by independently considering its color, type, and power as follows :

  • What is the probability that a red car is a Ferrari ?
    (individual conditional probability)
  • What is the probability that a sports car is a Ferrari ?
    (individual conditional probability)
  • What is the probability that a 780 hp engine is a Ferrari ?
    (individual conditional probability)
  • What is the probability of a Ferrari among all cars ?
    (class probability)

The algorithm will then multiply these individual probabilities, which will be divided by the probabilities of all classes to obtain a final probability.

Formula

The Bayes formula attempts to answer the following question :

« What is the probability that a record containing a set of predictive values 𝒙𝟏𝒙_𝟏, 𝒙𝟐𝒙_𝟐, …., 𝒙n𝒙_n belongs to a class among all classes ? »

Pnb(C1x1,,xp)=P_{nb}(C_1 | x_1, \dots, x_p) =
P(C1)[P(x1C1)P(x2C1)P(x3C1)...P(xnC1)][P(C1)P(x1C1)P(x2C1)P(x3C1)]+...+[P(Cm)P(x1Cm)P(x2Cm)P(x3Cm)]\frac{P(C_1) [P(x_1 | C_1) P(x_2 | C_1) P(x_3 | C_1) ... P(x_n | C_1)]}{[P(C_1) P(x_1 | C_1) P(x_2 | C_1) P(x_3 | C_1)] + ... + [P(C_m) P(x_1 | C_m) P(x_2 | C_m) P(x_3 | C_m)]}

This formula can be rewritten as :

Pnb(C1x1,,xp)=P(C1)i=1pP(xiC1)j=1mP(Cj)i=1pP(xiCj)P_{nb}(C_1 | x_1, \dots, x_p) = \frac{P(C_1) \prod_{i=1}^{p} P(x_i | C_1)}{\sum_{j=1}^{m} P(C_j) \prod_{i=1}^{p} P(x_i | C_j)}

The symbol i=1p\prod_{i=1}^{p} represents the product over all classes.

Breaking down this formula, the steps are as follows :

  1. Estimate the individual conditional probability for each predictor belonging to classes 𝐶1,...Cn𝐶_1,...C_n – that is, the probability that the value of the record to be classified belongs to class C1P(xjC1)C_1 \cdot P(x_j | C_1) for the first class, C2P(xjC2)C_2 \cdot P(x_j | C_2) for the second class, and so on. Remember, the features are evaluated independently ;

  2. After independent evaluation, multiply the individual conditional probabilities of belonging to each class CiC_i for each predictor, which is the conditional probability of class C1[P(x1C1)P(x2C1)P(x3C1)...P(xnC1)]C_1 \cdot [P(x_1 | C_1) P(x_2 | C_1) P(x_3 | C_1) ... P(x_n | C_1)], ..., and the conditional probability of class Ci[P(x1Ci)P(x2Ci)P(x3Ci)...P(xnCi)]C_i \cdot [P(x_1 | C_i) P(x_2 | C_i) P(x_3 | C_i) ... P(x_n | C_i)] ;

  3. Estimate the proportion of records belonging to classes CiC_i, i.e., the total records of 𝐶1𝐶_1 relative to the total records of other classes. Similarly for classes 𝐶2,...,Cn𝐶_2,...,C_n relative to the total records of other classes ;

  4. Repeat steps 1, 2, and 3 for all classes (= conditional probability of class 𝑪𝟐,...,Cn𝑪_𝟐,...,C_n... & proportion of records belonging to class 𝐶2,...,Cn𝐶_2,...,C_n) ;

  5. Apply the complete Naive Bayes classifier formula for all classes ;

  6. Assign the record to the class with the highest probability ;

Step Breakdown

Let’s take a simplified example of using the Naive Bayes classifier to detect spam. We have set up a model to differentiate between spam and regular emails. The algorithm is trained ( it can classify ) and receives the following email :

« Dear John, would you be available for lunch with friends tomorrow ? Don’t bring any money, it’s Lucas’s turn to treat ».

It identifies other emails (25)( 25 ) in its training set that contain the words « Dear, Lunch, Friends, Money » for which it already has a label -either mail (17)( 17 ) or spam (8)( 8 ) -.

Each word is considered a distinct variable, and the records are the emails :

x1x_1 = Dear (presence or absence of the word "Dear") ;
x2x_2 = Friends (presence or absence of the word "Friends") ;
x3x_3 = Lunch (presence or absence of the word "Lunch") ;
x4x_4 = Money (presence or absence of the word "Money") ;

Step 1 : Estimate the individual conditional probability for each predictor belonging to classes 𝐶1,...Cn𝐶_1,...C_n.

This involves defining the individual probability of belonging to a class, which implies an individual conditional probability, i.e. the probability of event B given that event A has occurred: 𝑃 (𝐴│𝐵).

In our case, we look at the probability of belonging to class 𝐶1𝐶_1 and C2C_2 given each predictor 𝑥1,x2,x3,x4𝑥_1,x_2,x_3,x_4.

P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]+P(C2)[P(x1C2)P(x2C2)P(x3C2)P(x4C2)]\frac{P(C_1) [\textcolor{blue}{P(x_1 | C_1)}\textcolor{green}{P( x_2 | C_1)} \textcolor{orange}{P(x_3 | C_1)} \textcolor{red}{P(x_4 | C_1)}]}{P(C_1)[P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)] + P(C_2)[P(x_1 | C_2) P(x_2 | C_2) P(x_3 | C_2) P(x_4 | C_2)]}

Bayes Example 1 500

𝑃(𝑥𝑗𝐶1)𝑃 (𝑥_𝑗 │𝐶_1 )
Total 1717 emails :

𝑃 (Dear│Email) 𝑃(817)=0.47𝑃 (8│17) = 0.47 ;
𝑃 (Friends│Email) 𝑃(517)=0.29𝑃 (5│17) = 0.29 ;
𝑃 (Lunch│Email) 𝑃(317)=0.18𝑃 (3│17) = 0.18 ;
𝑃 (Money│Email) 𝑃(117)=0.06𝑃 (1│17) = 0.06 ;

Now, let’s look at the probability of belonging to class 𝐶2𝐶_2 given predictor 𝑥1𝑥_1, i.e., 𝑃(𝑥1𝐶2)𝑃(𝑥_1 │𝐶_2 ).

P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]+P(C2)[P(x1C2)P(x2C2)P(x3C2)P(x4C2)]\frac{P(C_1) [P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)]}{P(C_1)[P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)] + P(C_2)[\textcolor{blue}{P(x_1 | C_2)}\textcolor{green}{P( x_2 | C_2)} \textcolor{orange}{P(x_3 | C_2)} \textcolor{red}{P(x_4 | C_2)}]}

Bayes Example 1 500

𝑃(𝑥𝑗𝐶2)𝑃 (𝑥_𝑗 │𝐶_2 ) Total 88 spam:

𝑃 (Dear│Spam) 𝑃(28)=0.25𝑃 (2│8) = 0.25
𝑃 (Friends│Spam) 𝑃(18)=0.125𝑃 (1│8) = 0.125
𝑃 (Lunch│Spam) 𝑃(38)=0.375𝑃 (3│8) = 0.375
𝑃 (Money│Spam) 𝑃(38)=0.50𝑃 (3│8) = 0.50

Step 2 : Estimate the class conditional probability.

This involves multiplying the individual conditional probabilities for each predictor belonging to classes CiC_i.

P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]+P(C2)[P(x1C2)P(x2C2)P(x3C2)P(x4C2)]\frac{P(C_1) [\textcolor{orange}{P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)}]}{P(C_1)[\textcolor{orange}{P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)}] + P(C_2)[\textcolor{blue}{P(x_1 | C_2)P( x_2 | C_2) P(x_3 | C_2) P(x_4 | C_2)}]}

[ 𝑃 (Dear │Email) 𝑃 (Friends │Email) 𝑃 (Lunch │Email) 𝑃 (Money │Email)]
[ 𝑃 (Dear │Spam) 𝑃 (Friends │Spam) 𝑃 (Lunch │Spam) 𝑃 (Money │Spam)]

PredictorP(Predictor_Email)P(Predictor_Spam)
x1x_1 Dear0.470.25
x2x_2 Friends0.290.125
x3x_3 Lunch0.180.375
x4x_4 Money0.060.50
Product0.47 x 0.29 x 0.18 x 0.060.25 x 0.125 x 0.375 x 0.50

Step 3: Estimate the proportion of records belonging to classes 𝐶1𝐶_1

We must consider the proportion of the class value relative to all class values. In other words, what is the proportion of spams relative to all received emails (spam & email)? This ensures that the proportion is measured on the same scale.

P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]+P(C2)[P(x1C2)P(x2C2)P(x3C2)P(x4C2)]\frac{\textcolor{red}{P(C_1)} [P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)]}{\textcolor{red}{P(C_1)}[P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)] + \textcolor{orange}{P(C_2)}[P(x_1 | C_2) P(x_2 | C_2) P(x_3 | C_2) P(x_4 | C_2)]}

Out of 2525 emails, 1717 were regular emails, and 88 were spam :

  • Proportion of regular emails 𝑃(𝐶1)=17/25=0.68𝑃(𝐶_1) = 17/25 = 0.68 or 68%68\% ;
  • Proportion of spam 𝑃(𝐶2)=8/25=0.32𝑃(𝐶_2) = 8/ 25 = 0.32 or 32%32\% ;

Step 4 : Repeat steps 1, 2, and 3 for all classes. ( Already done in our example ).

Step 5 : Apply the complete Naive Bayes classifier formula for all classes

P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]P(C1)[P(x1C1)P(x2C1)P(x3C1)P(x4C1)]+P(C2)[P(x1C2)P(x2C2)P(x3C2)P(x4C2)]\frac{P(C_1) [P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)]}{P(C_1)[P(x_1 | C_1)P( x_2 | C_1) P(x_3 | C_1) P(x_4 | C_1)] + P(C_2)[P(x_1 | C_2) P(x_2 | C_2) P(x_3 | C_2) P(x_4 | C_2)]}
0.68[0.470.290.180.06]0.68[0.470.290.180.06]+0.32[0.250.1250.3750.50]\frac{\textcolor{green}{0.68 * [0.47 * 0.29 * 0.18 * 0.06]}}{\textcolor{green}{0.68 * [0.47 * 0.29 * 0.18 * 0.06]} + \textcolor{red}{0.32 [0.25 * 0.125 * 0.375 * 0.50]}}

Email :

0.0010009872(0.0010009872+0.000625)=0.00100098720.0016259872=61.44%\frac{0.0010009872}{(0.0010009872 + 0.000625)} = \frac{0.0010009872}{0.0016259872} = 61.44\%

Spams :

10.0010009872(0.0010009872+0.000625)=10.00100098720.0016259872=38.56%1 - \frac{0.0010009872}{(0.0010009872 + 0.000625)} = 1 - \frac{0.0010009872}{0.0016259872} = 38.56\%

Step 6 : Assign the record to the class with the highest probability

y^\hat{y} = Email

Technique Specifics

It is important to note that this technique works well only when variables are all independent of each other, which is rarely the case.

However, its advantage is that it requires little data, and its calculation process is relatively simple. The Naive Bayes classifier is often used for natural language processing (text classification, sentiment analysis, etc.).

Code Python Bayesian Classifier

# Step 1: Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Step 2: Load the dataset with a filter on two categories
# These categories represent texts about baseball and space
categories = ['rec.sport.baseball', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Step 3: Explore the dataset
# Print the total number of documents, available categories, and an example document
print(f"Total number of documents: {len(data.data)}")
print(f"Categories: {data.target_names}")
print(f"Example document: {data.data[0]}")

# Step 4: Split the dataset into training and testing sets
# 70% of the data will be used for training, and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Step 5: Convert text data into a Bag of Words representation
# This creates a sparse matrix where each row corresponds to a document
# and each column corresponds to a word in the vocabulary
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)

# Step 6: Apply TF-IDF weighting to the Bag of Words matrix
"""
TF-IDF stands for Term Frequency-Inverse Document Frequency, a method used in Natural Language Processing (NLP)
to convert textual data into numerical vectors by assigning weights to words based on their importance.

1. Term Frequency (TF)
TF measures how frequently a word appears in a specific document. It is a local measure, specific to a single document.

TF (𝑡,𝑑) = (Number of occurrences of term 𝑡  in document 𝑑 ) / (Total number of terms in document 𝑑)
Words that appear more frequently in a document have a higher TF value.
Example: If the word "moon" appears 5 times in a document containing 100 words, then:
TF("moon")=5/100 =0.05


2. Inverse Document Frequency (IDF)
IDF measures the importance of a word across the entire corpus. It is a global measure, specific to all documents.

IDF(𝑡,𝐷) = log (N / (1+𝑛𝑡) )

N: Total number of documents in the corpus.
nt : Number of documents containing the term
Adding 1 in the denominator avoids division by zero if 𝑛𝑡 = 0

"""

# This adjusts the word frequencies based on their importance across all documents
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Step 7: Train the Naive Bayes model
# Use the training data (TF-IDF matrix and corresponding labels)
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Step 8: Prepare the test data
# Transform the test set into the same format as the training set
X_test_counts = vectorizer.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Step 9: Predict labels for the test data
y_pred = model.predict(X_test_tfidf)

# Step 10: Evaluate the model's performance
# Print the confusion matrix and classification report
cm_test = confusion_matrix(y_test, y_pred)
print(cm_test)

# Calculate overall accuracy
accuracy = cm_test.diagonal().sum() / cm_test.sum()
print(f"\nOverall Accuracy: {accuracy * 100:.2f}%")

# Step 11: Predict the class of a new message
# Example: A message that looks like spam
new_message = ["Bitcoin to the moons!!"]


# Transform the new message into a Bag of Words representation
new_message_counts = vectorizer.transform(new_message)

# Apply TF-IDF weighting to the new message
new_message_tfidf = tfidf_transformer.transform(new_message_counts)

# Step 12: Get the probabilities for each class
# This shows how likely the message is to belong to each category
proba = model.predict_proba(new_message_tfidf)

# Step 13: Predict the most likely class for the new message
prediction = model.predict(new_message_tfidf)


# Step 14: Display the results
print(f"\nMessage: {new_message[0]}")
print(f"Predicted Class: {data.target_names[prediction[0]]}")

# Display class probabilities in percentages
print("\nClass Probabilities:")
for i, category in enumerate(data.target_names):
print(f"Probability of '{category}': {proba[0][i] * 100:.2f}%")