Covariance and Correlations#

Covariance#

Joint variability of two random variables.

\(cov(x,y) = \frac{\sum_{i=0}^{N-1}{(x_i - \bar{x})(y_i - \bar{y})}}{N-1}\)

in numpy cov result it returns a matrix

\(\begin{bmatrix} var(x) && cov(x,y) \\ cov(x,y) && var(y) \\ \end{bmatrix}\)

[1]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style

# style.use('ggplot')

%matplotlib inline

Positive Covariance#

[2]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([5, 6, 7, 8, 9, 10])

[3]:

plt.axvline(x.mean(),c='g',label='mean-x')
plt.axhline(y.mean(),c='y',label='mean-y')
plt.scatter(x,y,label='x-y plot')
plt.legend()

[3]:

<matplotlib.legend.Legend at 0x7f5a6a443130>

../_images/PracticalStatistics_Correlation_6_1.png

[4]:

np.sum((x- x.mean()) * (y - y.mean())) / (x.shape[0] - 1)

[4]:

3.5

[5]:

np.cov(x), np.cov(y), np.cov(x,y)

[5]:

(array(3.5),
 array(3.5),
 array([[3.5, 3.5],
        [3.5, 3.5]]))

Zero Covariance#

[6]:

x = np.array([1, 1, 1, 1, 1, 1])
y = np.array([5, 6, 7, 8, 9, 10])

[7]:

plt.axvline(x.mean(),c='g',label='mean-x')
plt.axhline(y.mean(),c='y',label='mean-y')
plt.scatter(x,y,label='x-y plot')
plt.legend()

[7]:

<matplotlib.legend.Legend at 0x7f5a620c0cd0>

../_images/PracticalStatistics_Correlation_11_1.png

[8]:

np.cov(x), np.cov(y), np.cov(x,y)

[8]:

(array(0.),
 array(3.5),
 array([[0. , 0. ],
        [0. , 3.5]]))

[9]:

np.cov(x), np.cov(y), np.cov(y,x)

[9]:

(array(0.),
 array(3.5),
 array([[3.5, 0. ],
        [0. , 0. ]]))

[10]:

np.sum((x- x.mean()) * (y - y.mean())) / (x.shape[0] - 1)

[10]:

0.0

Negative Covariance#

[11]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([5, 6, 7, 8, 9, 10][::-1])

[12]:

plt.axvline(x.mean(),c='g',label='mean-x')
plt.axhline(y.mean(),c='y',label='mean-y')
plt.scatter(x,y,label='x-y plot')
plt.legend()

[12]:

<matplotlib.legend.Legend at 0x7f5a62033820>

../_images/PracticalStatistics_Correlation_17_1.png

[13]:

np.cov(x), np.cov(y), np.cov(x,y)

[13]:

(array(3.5),
 array(3.5),
 array([[ 3.5, -3.5],
        [-3.5,  3.5]]))

[14]:

np.sum((x- x.mean()) * (y - y.mean())) / (x.shape[0] - 1)

[14]:

-3.5

Correlation Coefficient#

How strong is the relationship between two variables.
1 indicates a strong positive relationship.
-1 indicates a strong negative relationship.
A result of zero indicates no relationship at all.
Not sensitive to the scale of data.

May not be useful if the variables don’t have linear relationship somehow.

Pearson Correlation#

\(\rho_{X,Y}=\frac{cov(X,Y)}{\sigma_X\sigma_Y}\)

\(\rho\) = population correlation coefficient
\(\sigma\) = standard deviation
\(\sigma^2\) = variance
\(cov(x,y)\) = covariance of x and y = \(\sigma_{x,y}\)

Positive Correlation#

[15]:

from scipy.stats import pearsonr

[16]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([15, 16, 17, 18, 19, 20])

[17]:

plt.axvline(x.mean(),c='g',label='mean-x')
plt.axhline(y.mean(),c='y',label='mean-y')
plt.scatter(x,y,label='x-y plot')
plt.legend()

[17]:

<matplotlib.legend.Legend at 0x7f5a4efaa2e0>

../_images/PracticalStatistics_Correlation_25_1.png

[18]:

cov_mat = np.cov(x,y)

print(cov_mat)

[[3.5 3.5]
 [3.5 3.5]]

[19]:

cov_mat[1][0] / np.sqrt(cov_mat[0][0] * cov_mat[1][1])

[19]:

1.0

[20]:

np.corrcoef(x,y)

[20]:

array([[1., 1.],
       [1., 1.]])

[21]:

pearsonr(x,y)

[21]:

(0.9999999999999999, 1.8488927466117464e-32)

Negative Correlation#

[22]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([15, 16, 17, 18, 19, 20][::-1])

[23]:

plt.axvline(x.mean(),c='g',label='mean-x')
plt.axhline(y.mean(),c='y',label='mean-y')
plt.scatter(x,y,label='x-y plot')
plt.legend()

[23]:

<matplotlib.legend.Legend at 0x7f5a4ef1ca90>

../_images/PracticalStatistics_Correlation_32_1.png

[24]:

cov_mat = np.cov(x,y)

print(cov_mat)

[[ 3.5 -3.5]
 [-3.5  3.5]]

[25]:

cov_mat[1][0] / np.sqrt(cov_mat[0][0] * cov_mat[1][1])

[25]:

-1.0

[26]:

np.corrcoef(x,y)

[26]:

array([[ 1., -1.],
       [-1.,  1.]])

[27]:

np.corrcoef(x,y)[0][1]

[27]:

-1.0

[28]:

pearsonr(x,y)

[28]:

(-0.9999999999999999, 1.8488927466117464e-32)

Spearman’s Rank Correlation#

\(\rho = 1 - \frac{6 \sum {d}^2}{n(n^2 - 1)}\)

\(r_x\) = ranks of x

\(r_y\) = ranks of y

d = differences of ranks
d = \(r_x - r_y\)
n = size of array

ranks#

x	ranks
1	1
2	2
3	3
4	4

x	ranks
44	4
2	1
33	3
11	2

[29]:

x = np.array([6, 4, 3, 5, 2, 1])
y = np.array([15, 16, 17, 18, 19, 20])

[30]:

def get_rank(x):
    temp = x.argsort()
    rank = np.empty_like(temp)
    rank[temp] = np.arange(len(x))
    return rank + 1

[31]:

x_rank = get_rank(x)
y_rank = get_rank(y)

print(x,x_rank)
print(y,y_rank)

[6 4 3 5 2 1] [6 4 3 5 2 1]
[15 16 17 18 19 20] [1 2 3 4 5 6]

[32]:

n = x.shape[0]
1 - ((6 * np.square(x_rank - y_rank).sum()) / (n * (n**2 - 1)))

[32]:

-0.8285714285714285

[33]:

from scipy.stats import spearmanr

[34]:

spearmanr(x,y)

[34]:

SpearmanrResult(correlation=-0.8285714285714287, pvalue=0.04156268221574334)

[35]:

plt.scatter(x,y,label='x-y plot')
plt.legend()

[35]:

<matplotlib.legend.Legend at 0x7f5a4eef8c40>

../_images/PracticalStatistics_Correlation_47_1.png

[36]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([15, 16, 17, 18, 19, 20])

spearmanr(x,y)

[36]:

SpearmanrResult(correlation=1.0, pvalue=0.0)

Kendall Rank Correlation (\(\tau\))#

[37]:

from scipy.stats import kendalltau

[38]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([15, 16, 17, 18, 19, 20])

print(kendalltau(x,y))

plt.scatter(x,y,label='x-y plot')
plt.legend()

KendalltauResult(correlation=0.9999999999999999, pvalue=0.002777777777777778)

[38]:

<matplotlib.legend.Legend at 0x7f5a4ee8f850>

../_images/PracticalStatistics_Correlation_51_2.png

[39]:

x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([15, 16, 17, 18, 19, 20][::-1])


print(kendalltau(x,y))

plt.scatter(x,y,label='x-y plot')
plt.legend()

KendalltauResult(correlation=-0.9999999999999999, pvalue=0.002777777777777778)

[39]:

<matplotlib.legend.Legend at 0x7f5a4edc1ca0>

../_images/PracticalStatistics_Correlation_52_2.png

R-Squared score / Coefficient of determination#

\(R^2 = 1 - \frac{RSS}{TSS}\)

RSS = Residual Sum of Squares = \(\sum_{i=0}^{N}{(y_i - f(x_i))^2}\)

TSS = Total Sum of Squares = \(\sum_{i=0}^{N}{(y_i - \bar{y})^2}\)

for best file of regression line will have highest r2 score.

[40]:

y = np.array([1, 2, 4, 6, 7, 9, 12, 8])
f_x = np.array([0.8, 2.2, 4, 5.9, 6.3, 9.1, 11.5, 8.5])

plt.scatter(y, f_x)

[40]:

<matplotlib.collections.PathCollection at 0x7f5a4ed3ed00>

../_images/PracticalStatistics_Correlation_55_1.png

[41]:

RSS = np.square(y - f_x).sum()
TSS = np.square(y - y.mean()).sum()

print(1 - (RSS / TSS))

0.9885111989459816

[42]:

from sklearn.metrics import r2_score

r2_score(y, f_x)

[42]:

0.9885111989459816

[43]:

y = np.array([1, 2, 4, 6, 7, 9, 12, 8])
f_x = np.array([0.8, 2.2, 4, 5.9, 6.3, 9.1, 11.5, 8.5][::-1])

print(r2_score(y, f_x))

plt.scatter(y,f_x)

-2.654176548089592

[43]:

<matplotlib.collections.PathCollection at 0x7f5a4e8cd250>

../_images/PracticalStatistics_Correlation_58_2.png

[44]:

y = np.random.rand(10)
f_x = np.random.rand(10)

print(r2_score(y, f_x))

plt.scatter(y,f_x)

-1.2000772894487417

[44]:

<matplotlib.collections.PathCollection at 0x7f5a4e8a2820>

../_images/PracticalStatistics_Correlation_59_2.png

Covariance and Correlations

Contents