Blog classification Continuous data science Gaussian machine learning NaiveBayes Python

Introduction to Naive Bayes Classifier using R and Python

Naive Bayes Classifier is among the easy Machine Learning algorithm to implement, therefore most of the time it has been taught as the primary classifier to many students. Nevertheless, most of the tutorials are somewhat incomplete and does not provide the right understanding. Hence, as we speak on this Introduction to Naive Bayes Classifier using R and Python tutorial we’ll study this easy but useful concept. Bayesian Modeling is the inspiration of many essential statistical concepts corresponding to Hierarchical Fashions (Bayesian networks), Markov Chain Monte Carlo and so forth.

Naive Bayes Classifier is a special simplified case of Bayesian networks the place we assume that each function value is unbiased to one another. Hierarchical Models can be utilized to define the dependency between features and we will build much complicated and accurate Models using JAGS, BUGS or Stan ( which is out of scope of this tutorial ).

This tutorial anticipate you to already know the Bayes Theorem and some understanding of Gaussian Distributions.

Say, we’ve got a dataset and the courses (label/goal) related to every knowledge. For an instance, if we think about the Iris dataset with only 2 kinds of flower, Versicolor and Virginica then the function ( X ) vector will include 4 kinds of features – Petal size, Petal width, Sepal length, Sepal width. The Versicolor and Virginica will be the class ( Y ) of each pattern of knowledge. Now using the training knowledge we’ll like to construct our Naive Bayes Classifier in order that using any unlabeled knowledge we should always find a way to classify the flower appropriately.

We will write the Bayes Theorem as following the place X is the function vector and Y is the output class/target variable.

p(Y|X) = fracY)p(Y)p(X)

As you already know, the definition of each of the possibilities are:

textual contentposterior = fractextual contentprobability * textual contentprior textual contentmarginal

We’ll now use the above Bayes Theorem to provide you with Bayes Classifier.

Simplify the Posterior Chance:

Say we now have solely two class zero and 1 [ 0 = Versicolor, 1 = Virginica], then our objective might be to find the values of ( p(Y=zero|X) ) and ( p(Y=1|X) ), then whichever chance worth is bigger than the other, we’ll predict the info belongs to that class.

We will outline that mathematically as:

argmax_c (p(y_c|X) )

Simplify the Probability Chance:

By saying Naive, we now have assumed that each function is unbiased. We will then outline the probability as the multiplication of the chance of every of the features given the class.

p(X|Y) &= p(x_1|y_c)p(x_2|y_c)…p(x_n|y_c)
&= prod_i=1^n p(x_i|y_c)
& textual contentthe place y_c textual content is any specific class, zero or 1

Simplify the Prior Chance:

We will outline the prior as (fracm_cm), where (m_c) is the number of pattern for the category (c) and (m) is the whole variety of samples in our dataset.

Simplify the Marginal Chance:

The Marginal Chance shouldn’t be actually helpful to us because it does not depend upon Y, therefore similar for all of the courses. So we will use the following means,

p(Y|X) propto & text p(X|Y)p(Y)
= & textual content p(X|Y)p(Y) + okay
& textthe place okay = some fixed

The okay fixed might be dropped during implementation because it’s the identical for all of the courses.

Ultimate Equation:

The final equation appears like following:

textual contentprediction &= argmax_c (p(y_c|X) )
&= argmax_c prod_i=1^n p(x_i|y_c) p(y_c)

Nevertheless the product may create numerical points. We’ll use log scale in our implementation, since log is a monotonic perform we should always achieve the identical outcome.

log ( textual contentprediction ) =& argmax_c bigg( sum_i=1^n log(p(x_i|y_c))+ log(p(y_c)) bigg)
=& argmax_c bigg( sum_i=1^n log(p(x_i|y_c))+ log(fracm_cm) bigg)

Consider or not, we are completed defining our Naive Bayes Classifier. There is only one thing pending, we’d like to define a mannequin to calculate the probability.

There are totally different methods and it really will depend on the options.

Discrete Variable:

In case the options are discrete variable then we will outline the probability using simply the chance of each function. For an instance, in case we’re making a classifier to detect spam emails, and we’ve got three phrases (low cost, supply and dinner ) as our options, then we will define our probability as:

p(X|Y=text spam )=& p(textual contentdiscount=sure|spam)*p(textual contentsupply=sure|spam) & *p(textdinner=no|spam)
=& (10/15)*( 7/15 )*( 1/15 )

You possibly can then calculate the prior and easily classify the info using the final equation we’ve got.

Word: Typically in exams this comes as a problem to clear up by hand.

Steady Variable:

In case our options are steady ( like we now have in our iris dataset ) we’ve two choices:

  • Quantize the continuous values and use them as categorical variable.
  • Define a distribution and mannequin the probability using it.

I’ll speak about vector quantization in a future video, nevertheless let’s look extra into the 2nd choice.

In the event you plot any function (x_1) the distribution may look Regular/Gaussian, hence we will use regular distribution to define our probability. For simplicity, assume we’ve got only one function and if we plot the info for each the courses, it’d appear to be following:

Introduction to Naive Bayes Classifier using R and Python adeveloperdiary

Within the above case, any new point within the left aspect could have a better in all probability for ( p(x_1|y=zero) ) than ( p(x_1|y=1) ). We will define the chance using the Univariate Gaussian Distribution.

P(x| mu, sigma) = frac1sigma sqrt 2pi e^-(x-mu)^2 / 2 sigma^2

We will simply estimate the mean and variance from our practice knowledge.

So our probability might be, ( P(x| mu, sigma, y_c) )

Essential Observe:

Now, you is perhaps tempted to plot the function and in case they are wanting like exponential distribution, you in all probability want use exponential distribution to define the probability. I need to inform you that you simply shouldn’t do anything like that. There are lots of causes,

  • Restricted knowledge won’t provide correct distribution, therefore prediction can be mistaken.
  • We really don’t want to match the distribution exactly with the info, as long as we will separate them, our classifier will perfectly.

So we principally use Gaussian or Bernoulli distribution for continuous variable.

Enough of concept, let’s now truly construct the classifier using Python from scratch.

First let’s understand the structure of our NaiveBayes class.

We will probably be using seaborn package deal simply to entry the iris knowledge without writing any code. Then we’ll define the skeleton of the NaiveBayes class.

In this example we’ll work on binary classification, hence we wont use the setosa flower sort.There will probably be solely 100 pattern knowledge. We’ll convert the category to numeric worth in line 26-27.

So as to get a greater estimate of our classifier, we’ll run the classification 100 occasions on randomly cut up knowledge and then common them out to get our estimate. Therefore we have now the loop and inside the loop we’re splitting the info into practice/check units.

Lastly we’ll instantiate our class and invoke the fit() perform solely as soon as.


The match perform wont return any worth. Normalization is very important when implementing NaiveBayes classifier because the scale of the info will impression the prediction. Right here we’ll normalize every function in order that the mean is zero and normal deviation is 1.

We’ll start by calling a perform named calculate_mean_sd() which can calculate and store ( mu ) and (sigma) as class variable. Then we’ll call normalize() perform to scale the info.

Subsequent,we’d like to calculate the ( mu ) and (sigma) for each class. Finally calculate the prior for every class and save them to class variable.

Under are the calculate_mean_sd() and normalize() perform.


We’ll move the check knowledge into the predict() perform. First scale the info using normalize() perform, which makes use of the ( mu ) and (sigma) calculated during coaching.

Subsequent go thorugh each row and calculate the probability by looping via every function. Keep in mind, this can be a very in-efficient code, since its not vectorized. In our R code we’ll see a a lot quicker model.

Python doesn’t have a built-in dnorm perform to calculate the density of a Regular Distribution, hence we’ll write our own dnorm() perform.

Lastly, we examine the two output and predict the category based mostly on the bigger value.

Right here is the dnorm() perform.


The accuracy() perform could be very straightforward. Right here is the code:


Full Python Code:

I’m not going by way of the complete code here and offered inline comments. Basically it’s the identical as the python version. Nevertheless listed here are two fundamental variations:

  • Using built-in dnorm() perform.
  • Operations are vectorized

Please discover the complete challenge here: