This website works better with desktop in both themes, for mobile devices please change to light theme.

Bayes Estimation

Contents

Bayes Estimation#

References#

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html

[1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import json

plt.style.use('fivethirtyeight')

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

[2]:

dataset = pd.read_csv('/opt/datasetsRepo/smsspamcollection/SMSSpamCollection',
                      sep='\t', names=['label', 'text'])
# dataset['flag'] = dataset['label'].map({ "ham" : 0, "spam" : 1})

df = pd.concat([
    dataset.query("label == 'spam'").sample(50),
    dataset.query("label == 'ham'").sample(50)
], axis = 0).sample(frac=1, random_state=0)

df.head(3)

[2]:

	label	text
3807	spam	URGENT! We are trying to contact you. Last wee...
4788	ham	Ü thk of wat to eat tonight.
1930	spam	Free 1st week entry 2 TEXTPOD 4 a chance 2 win...

[3]:

tokenizer = RegexpTokenizer(r'\w+')

[4]:

stop_words = stopwords.words('english')

[5]:

freq_dict = df['text'].apply(
    lambda x: [i for i in tokenizer.tokenize(x.lower()) \
                  if i not in stop_words]
)

[6]:

freq_dict.explode()

[6]:

3807         urgent
3807         trying
3807        contact
3807           last
3807       weekends
           ...
2438            net
2438       custcare
2438    08715705022
2438         1x150p
2438             wk
Name: text, Length: 1259, dtype: object

[7]:

freq_dict = df[['label','text']].groupby('label', group_keys=False)['text']\
.apply(lambda x: " ".join(x))\
.apply(lambda x: nltk.FreqDist([i for i in tokenizer.tokenize(x.lower()) \
                  if i not in stop_words]))\
.to_dict()

[8]:

freq_dict['ham']

[8]:

FreqDist({'lor': 6, 'like': 6, 'u': 5, 'got': 5, 'wat': 4, 'go': 4, 'ur': 3, 'come': 3, 'lt': 3, 'gt': 3, ...})

Naive Bayes#

\begin{align} P(Y=y | X=x) &= \frac{P(X=x | Y=y) P(Y=y)}{P(X=x)}\\ \\ &\text{Where } \\ P(X=x | Y=y) &= \prod_{\alpha=1}^{d} P([X]_\alpha = x_\alpha| Y = y) \end{align}

Naively assumes that all the features used are independently distrubuted variables given the label Y.
for example given that there is an email where all the words are independent given the label spam/ham.

Bayes Classifier#

\begin{align*} h(\vec{x}) &= {argmax\atop{y}} \frac{P(\vec{x} | y) P(y)}{z}\\ \\ &= {argmax\atop{y}} P(y) \prod_{\alpha} P([\vec{X}]_\alpha | y)\\ \\ &= {argmax\atop{y}} ( log(P(y) + \sum_\alpha log P([\vec{X}]_\alpha | y)) \end{align*}

P.S. - In computer science we dont prefer multiplying probabilities due to muliple reasons(see reference section). Hence we take log and convert multiplication to addition.

[ ]: