Naive Bayes Algorithm
Naive Bayes is one of the simplest machine learning algorithms. It is supervised algorithm.
Naive Bayes is a classification algorithm and is extremely fast. It uses Bayes theory of probability.
It is called ‘naive’ because the algorithm assumes that all attributes are independent of each other.
Naive Bayes algorithm is commonly used in text classification with multiple classes.
To understand how Naive Bayes algorithm works, it is important to understand Bayes theory of probability. Let’s work through an example to derive Bayes theory.
Let’s assume there is a type of cancer that affects 1% of a population. The test for the cancer, detects the presence of cancer correctly 90% of time. So it gets the remaining 10% wrong. The test also gives a correct negative result 90% of the time. The remaining 10% of time it detects a cancer when there is none. With these probabilities in place, what are the chances that a person actually has cancer when they get a positive result from the test.
A simple way to work through this question is to take some nice round numbers and calculate values.
When a person gets a positive result from the test, the probability that the person actually has cancer =
Probability of a true positive / (Probability of true positive + Probability of false positive)
Now let’s convert this into Bayes theorem.
P( c|x ) = Probability of having cancer (c) given the test (x) is positive = 8.33% in our example
P( x|c ) = Probability of getting positive test (x) given you had a cancer (c) = True positive = 90%
P( c ) = Chances of having a cancer = 1%
P( x|not c) = Probability of getting a positive test (x) given you did not have a cancer (c) = False positive = 10%
P( not c) = Chances of not having a cancer = 99%
In a simpler form, the denominator can be called P( x ). The probability of test being positive, false or true.
Rewriting the equation,
Let’s work through another example with this formula.
Here is some data for when a person, say Joe, plays tennis.
Now let’s validate the statement: when the temperature is mild, Joe will play tennis. Is this statement true?
What we need is the probability that Joe will play tennis given the temperature is mild, i.e., P(Joe Plays | Mild Temperature)
Which is P(Mild Temperature | Joe plays) P(Joe Plays) / P(Mild Temperature)
(4/9) * (0.64) / (0.43) = 0.65
When the temperature is mild, there is a good probability that Joe will play tennis.
Naive Bayes in Python
Let’s expand this example and build a Naive Bayes Algorithm in Python.
The first step is to import all necessary libraries.
import numpy as np import pandas as pd from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
Now load the CSV data file using the pandas read_csv method.
Here is the CSV file used in this post.
play_tennis = pd.read_csv("PlayTennis.csv")
Our data contains details about the weather outlook, temperature, humidity and wind conditions. The last column is the target variable that suggests the possibility of playing tennis.
In this example we use the Python library SKLearn to create a model and make predictions. SKLearn library requires the features to be numerical arrays. So we will need to convert the categorical information in our data into numbers.
There are multiple ways of doing this, we will keep is simple and use a LabelEncoder for this example.
A LabelEncoder converts a categorical data into a number ranging from 0 to n-1, where n is the number of classes in the variable.
For example, in case of Outlook, there are 3 clasess – Overcast, Rain, Sunny. These are represented as 0,1,2 in alphabetical order.
number = LabelEncoder() play_tennis['Outlook'] = number.fit_transform(play_tennis['Outlook']) play_tennis['Temperature'] = number.fit_transform(play_tennis['Temperature']) play_tennis['Humidity'] = number.fit_transform(play_tennis['Humidity']) play_tennis['Wind'] = number.fit_transform(play_tennis['Wind']) play_tennis['Play Tennis'] = number.fit_transform(play_tennis['Play Tennis'])
Now we are ready to create a model.
Let’s define the features and the target variables.
features = ["Outlook", "Temperature", "Humidity", "Wind"] target = "Play Tennis"
To validate the performance of our model, we create a train, test split. We build the model using the train dataset and we will validate the model on the test dataset.
We use SKLearn’s train_test_split to do this.
features_train, features_test, target_train, target_test = train_test_split(play_tennis[features], play_tennis[target], test_size = 0.33, random_state = 54)
Let’s create the model now.
model = GaussianNB() model.fit(features_train, target_train)
Now we are ready to make predictions on the test features.
We will also measure the performance of the model using accuracy score.
Accuracy score measure the number of right predictions.
pred = model.predict(features_test) accuracy = accuracy_score(target_test, pred)