This website works better with desktop in both themes, for mobile devices please change to light theme.

Bayes Estimation

Bayes Estimation#

References#

[1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import json

plt.style.use('fivethirtyeight')

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
[2]:
dataset = pd.read_csv('/opt/datasetsRepo/smsspamcollection/SMSSpamCollection',
                      sep='\t', names=['label', 'text'])
# dataset['flag'] = dataset['label'].map({ "ham" : 0, "spam" : 1})

df = pd.concat([
    dataset.query("label == 'spam'").sample(50),
    dataset.query("label == 'ham'").sample(50)
], axis = 0).sample(frac=1, random_state=0)

df.head(3)
[2]:
label text
3807 spam URGENT! We are trying to contact you. Last wee...
4788 ham Ü thk of wat to eat tonight.
1930 spam Free 1st week entry 2 TEXTPOD 4 a chance 2 win...
[3]:
tokenizer = RegexpTokenizer(r'\w+')
[4]:
stop_words = stopwords.words('english')
[5]:
freq_dict = df['text'].apply(
    lambda x: [i for i in tokenizer.tokenize(x.lower()) \
                  if i not in stop_words]
)
[6]:
freq_dict.explode()
[6]:
3807         urgent
3807         trying
3807        contact
3807           last
3807       weekends
           ...
2438            net
2438       custcare
2438    08715705022
2438         1x150p
2438             wk
Name: text, Length: 1259, dtype: object
[7]:
freq_dict = df[['label','text']].groupby('label', group_keys=False)['text']\
.apply(lambda x: " ".join(x))\
.apply(lambda x: nltk.FreqDist([i for i in tokenizer.tokenize(x.lower()) \
                  if i not in stop_words]))\
.to_dict()
[8]:
freq_dict['ham']
[8]:
FreqDist({'lor': 6, 'like': 6, 'u': 5, 'got': 5, 'wat': 4, 'go': 4, 'ur': 3, 'come': 3, 'lt': 3, 'gt': 3, ...})

Naive Bayes#

\begin{align} P(Y=y | X=x) &= \frac{P(X=x | Y=y) P(Y=y)}{P(X=x)}\\ \\ &\text{Where } \\ P(X=x | Y=y) &= \prod_{\alpha=1}^{d} P([X]_\alpha = x_\alpha| Y = y) \end{align}

  • Naively assumes that all the features used are independently distrubuted variables given the label Y.

  • for example given that there is an email where all the words are independent given the label spam/ham.

Bayes Classifier#

\begin{align*} h(\vec{x}) &= {argmax\atop{y}} \frac{P(\vec{x} | y) P(y)}{z}\\ \\ &= {argmax\atop{y}} P(y) \prod_{\alpha} P([\vec{X}]_\alpha | y)\\ \\ &= {argmax\atop{y}} ( log(P(y) + \sum_\alpha log P([\vec{X}]_\alpha | y)) \end{align*}

P.S. - In computer science we dont prefer multiplying probabilities due to muliple reasons(see reference section). Hence we take log and convert multiplication to addition.

[ ]: