A Truly Naive, Naive Bayes Estimator

Introduction

Let me guess, someone uses the term “Machine Learning” in a nearby conversation and you scamper over to tell about this cool O(n) method you implemented to estimate the parameters of a Driclet Distribtution. No? Me neither… but I’ve been square-peggin that round hole in attempt to broaden my horizons and get some Machine Learning tools under my belt.

However, everything that I’ve found starts at 10,000 feet and doesn’t really get down to the basics to help me understand what’s going on under the hood. For some of you, that might not be a big deal, but I frankly have a challenging time how to implement algorithms in any way other than “the example” if I don’t dig a little deeper.

My most recent nemesis has been the Naive Bayes Classifier, and I couldn’t really find anything that walked me through (starting from the runway) what was really happening. So for both of you that are dealing with a similar situation, here it is: The Naive, Naive Bayes.

The Dataset

To illustrate the concept I’ve chosen a data set that is very straightforward. It’s the Titanic Dataset that can be found on Vincent Arelbundock’s website.

Why I’m not Going over the Math

There are about 15 different websites where you can see what Bayes Theorem is about, however, in the calculation of a Naive Bayes Classifier, it’s a little different; however, it’s simple enough to show as a very basic example while we’re constructing the algorithm

Quick Look at the Data

Let’s look at the data, first let’s import things (I’m assuming you’ve downloaded the .csv and have $cd-ed into that folder, and are now in iPython.

In [1]: titanic_data = pandas.DataFrame.from_csv("titanic.csv")

Let’s explore our data a little

In [2]: titanic_data.head()
Out[2]: 
    survived  age  sex  class
1         1    1    1      1
2         1    1    1      1
3         1    1    1      1
4         1    1    1      1
5         1    1    1      1

Ah, ok, we can see where this is going… we’re going to be given the following information with this .csv (and you can check the docs of the data page as well).

survived: lived or croaked
age: adult or child
sex: yes or no… errrr. I mean male or femle
class: This is actually 1st class, 2nd class, or 3rd class

So with a Naive Bayes classifier we’re attempting to classify a binary outcome, in this case whether they survived or not — so this data works well. We’ll use the different attributes to determine a likelihood of “whether someone survived.” (of course, we already know that, but it gives you a sense of how you’d apply this on your own). BTW, there’s a nifty little way to finagle Naive Bayes in a Multiclass Classification call “One-vs.-All”.

Making the Data Simple and Readable

I went through the trouble to hash table the data so it was really easy to understand (feel free to implement this on your own without using my solution).

In [3]: index = ['1st class', '2nd class', '3rd class', 'adult', 'child', 'male', 'female']

In [4]: cols = ['survived', 'died']

In [5]: col_map = {'1st class':'class', '2nd class':'class', '3rd class':'class', 
    ...:            'adult':'age', 'child':'age', 'male':'sex', 'female':'sex'}

In [6]: val_map = {'1st class': 1, '2nd class': 2, '3rd class': 3, 
...:            'adult': 1, 'child': 0, 'male': 1, 'female': 0}

In [7]: val_to_key = pandas.DataFrame( numpy.array([['child', 'female', ''], 
...:                                             ['adult', 'male', '1st class'], 
...:                                             ['', '', '2nd class'], 
...:                                             ['', '', '3rd class']]), 
...:                                             index = [0, 1, 2, 3], 
...:                                             columns = ['age','sex','class'] )

#Conditional Probability DataFrame
In [8]: cond_prob_df = pandas.DataFrame( numpy.zeros([7,2]), columns = cols, index =index)

Now all that schmancy code did, was to create a nice neat DataFrame that’s easy to understand and interpret what goes where. It’s empty of course (and next we’re going to fill it), but take a peek just so it makes sense.

Our “Conditional Probability DataFrame”, cond_prob_df.

In [9]: cond_prob_df.head()
Out[9]: 
            survived  died
1st class         0     0
2nd class         0     0
3rd class         0     0
adult             0     0
child             0     0

Train the Algorithm

Now let’s “train the algorithm.” By the way, in this case, that’s a super fancy word for calculating 8 numbers… but it makes us feel better about ourselves, so say it with me, “let’s train that algorithm.”

So here we go. For every single one of these cells, we are going to calculate the conditional probabilities.

By definition:

$\mathbb{P}(X = x | Y = y) \triangleq \frac{\mathbb{P}(X \cap Y)}{\mathbb{P}(Y)}$

I know what you’re thinking… WTF does that mean? Here’s the basic process:

Let’s take an example, say it was an adult (age = 1) and the passenger survived (survived = 1). In this case, we:

Calculate the number of times that both age = 1 and that same person survived, and divide it by the total number in our sample, regardless of whether they survived (that’s $\mathbb{P}(X \cap Y)$ , the numerator)
Calculate the number of times someone survived in our sample and divide it by the total number in our sample (that’s $\mathbb{P}(Y)$ , that’s the denominator).

And that’s it, just doing those two simple steps gives us $\mathbb{P}(X = x | Y = y)$ .

So here’s my fancy schmancy code to do that (again, legibility was my primary concern here).

Calcuate the Conditional Probabilities

#let's train on the first 1,000 data points
In [10]: train = titanic_data.loc[:1000, :]

In [11]: n = len(train)

In [12]: p_survived = (train['survived'] == 1).sum()/float(len(train))

In [13]: p_died = (train['survived'] == 0).sum()/float(len(train))

In [14]: for i, attr in enumerate(index):
....:     cond_prob_df.loc[attr, 'survived'] = ((train[col_map[attr]] == val_map[attr]) & (
....:     train['survived'] == 1)).sum()/float(n)/float(p_survived)
....:     cond_prob_df.loc[attr, 'died'] = ((train[col_map[attr]] == val_map[attr]) & (
....:     train['survived'] == 0)).sum()/float(n)/float(p_died)

We can look at our DataFrame and see that we’ve now computed the conditional probabilities

In [15]:
           survived      died
1st class  0.512626  0.201987
2nd class  0.297980  0.276490
3rd class  0.189394  0.521523
adult      0.924242  1.000000
child      0.075758  0.000000
male       0.409091  0.971854
female     0.590909  0.028146

Using The “Trained Algorithm”

Now we can take the conditional probabilities that we’ve calculated, and estimate the probability that, given someone was adult / child, male / female, etc. what is the probability they survived?

Did you get that part, that’s really the crux of the concept… because when we’re using Naive Bayes “for realz” we are likely trying to estimate something that we don’t have answers to (yet).

So to estimate a new data point (let’s say for example the 1,000th poor-soul on the Titanic), here’s what we see (I’ve cut off whether they surived or not to drive home the point:

In [24]: new_row = titanic_data.loc[1000, :]
In [24]: new_row
Out[25]: 
age         1
sex         1
class       3
Name: 1000, dtype: int64

Ah, so now, given that this person was an adult, was also male, and was in 3rd class, is it more likely that they survived or died, based on the prior data?

Here’s a quick little diddy I wrote to print out all the important information…

In [26]: survived = []
In [27]: died = []
In [28]: for ind in new_row.index:
....:         if ind != 'survived':
....:                  row = val_to_key.loc[new_row[ind], ind]
....:                  survived.append( cond_prob_df.loc[row, 'survived'])
....:                  died.append( cond_prob_df.loc[row, 'died'])


print "Log Score the Survived"
print p_survived + sum(map(lambda x: numpy.log(x), survived))

print "Log Score the Died"
print p_died + sum(map(lambda x: numpy.log(x), died))

print "Actually, they "
if new_row['survived'] == 1:
    print "survived"
else:
    print "died"

Out [29]: Log Score the Survived
    -2.24052485159
    Log Score the Died
    -0.0755509372159
    Actually, they 
    died

So there we have it, the higher of our log scores tells us that, based on our Naive, Naive Bayes estimator, this person didn’t make it… and as it turns out… they didn’t.

If you’re looking for more in depth aspects (i.e. a little more quant-ish-beefy-ness), the following (I think) are great places to start:

Datum Box Tutorial of Naive Bayes
Andrew Moore’s Tutorial… go Tartan’s

As always, criticism, comments and consideration is greatly appreciated!

Benjamin M. Gross

A life of Code, CompFi, Canines & the Corporeal