Bayes Estimation#
References#
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
plt.style.use('fivethirtyeight')
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
[2]:
dataset = pd.read_csv('/opt/datasetsRepo/smsspamcollection/SMSSpamCollection',
sep='\t', names=['label', 'text'])
# dataset['flag'] = dataset['label'].map({ "ham" : 0, "spam" : 1})
df = pd.concat([
dataset.query("label == 'spam'").sample(50),
dataset.query("label == 'ham'").sample(50)
], axis = 0).sample(frac=1, random_state=0)
df.head(3)
[2]:
label | text | |
---|---|---|
3807 | spam | URGENT! We are trying to contact you. Last wee... |
4788 | ham | Ü thk of wat to eat tonight. |
1930 | spam | Free 1st week entry 2 TEXTPOD 4 a chance 2 win... |
[3]:
tokenizer = RegexpTokenizer(r'\w+')
[4]:
stop_words = stopwords.words('english')
[5]:
freq_dict = df['text'].apply(
lambda x: [i for i in tokenizer.tokenize(x.lower()) \
if i not in stop_words]
)
[6]:
freq_dict.explode()
[6]:
3807 urgent
3807 trying
3807 contact
3807 last
3807 weekends
...
2438 net
2438 custcare
2438 08715705022
2438 1x150p
2438 wk
Name: text, Length: 1259, dtype: object
[7]:
freq_dict = df[['label','text']].groupby('label', group_keys=False)['text']\
.apply(lambda x: " ".join(x))\
.apply(lambda x: nltk.FreqDist([i for i in tokenizer.tokenize(x.lower()) \
if i not in stop_words]))\
.to_dict()
[8]:
freq_dict['ham']
[8]:
FreqDist({'lor': 6, 'like': 6, 'u': 5, 'got': 5, 'wat': 4, 'go': 4, 'ur': 3, 'come': 3, 'lt': 3, 'gt': 3, ...})
Naive Bayes#
\begin{align} P(Y=y | X=x) &= \frac{P(X=x | Y=y) P(Y=y)}{P(X=x)}\\ \\ &\text{Where } \\ P(X=x | Y=y) &= \prod_{\alpha=1}^{d} P([X]_\alpha = x_\alpha| Y = y) \end{align}
Naively assumes that all the features used are independently distrubuted variables given the label Y.
for example given that there is an email where all the words are independent given the label spam/ham.
Bayes Classifier#
\begin{align*} h(\vec{x}) &= {argmax\atop{y}} \frac{P(\vec{x} | y) P(y)}{z}\\ \\ &= {argmax\atop{y}} P(y) \prod_{\alpha} P([\vec{X}]_\alpha | y)\\ \\ &= {argmax\atop{y}} ( log(P(y) + \sum_\alpha log P([\vec{X}]_\alpha | y)) \end{align*}
P.S. - In computer science we dont prefer multiplying probabilities due to muliple reasons(see reference section). Hence we take log and convert multiplication to addition.
[ ]: