Statsmodels#

Mathematical equation which explains the relationship between dependent variable (Y) and independent variable(X).
```
Y = f(X)
```
Due to uncertainy in result and noise the equation is
```
Y = f(X) + e
```

Linear Models#

\begin{align} Y &= \theta_0 + \theta_1 X_1 + \theta_2 X_2 + ... + \theta_n X_n\\ Y &= \theta_0 + \theta_1 X + \theta_2 X^2 + ... + \theta_n X^n\\ Y &= \theta_0 + \theta_1 sin(X_1) + \theta_2 * cos(X_2) \end{align}

Linear Regression#

\begin{align} Y &= \theta_0 + \theta_1 * X + e\\ & \theta_0, \theta_1 = Coefficient\\ & e = normally distributed residual error \end{align}

Linear Regression model assumes that residuals are independent and normally distributed
Model is fitted to the data using ordinary least squares approach

Non-Linear Models#

Most of the cases, the non-linear models are generalized to linear models
Binomial Regresson, Poisson Regression

Design Matrices#

Once the model is chosen design metrices are constructed. Y = XB + e

Variable	Description
Y	vector/matrix of dependent variable
X	vector/matrix of independent variable
B	vector/matrix of coefficient
e	residual error

Creating a Model#

using statsmodel library#

OLS (oridinart least squares)
GLM (genralized linear model)
WLS (weighted least squares)
ols
glm

wls

Uppercase names take design metrices as args
Lowercase names take Patsy formulas and dataframes as args

Fitting a Model#

fitting method returns a model object for futher methods, attributes and coefficient matrix for analysis

View Model Summary#

Describe the fit description of the model in text.

Construct Design Matrices#

\(Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 + \theta_3 X_1 X_2\)

Design Matrix with Numpy#

[1]:

import numpy as np

Y = np.array([1,2,3,4,5]).reshape(-1,1)

x1 = np.array([6,7,8,9,10])
x2 = np.array([11,12,13,14,15])

X = np.vstack([np.ones(5), x1, x2, x1*x2]).T

print(Y)
print(X)

[[1]
 [2]
 [3]
 [4]
 [5]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]

Design Matrix with patsy#

allows defining a model easily
constructs relevant design matrices (patsy.dmatrices)
takes a formula in string form as arg and a dictionary like object with data arrays for resoponse variables

‘y ~ np.log(x1)’: Often numpy functions can be used to transform terms in the expression.
‘y ~ I(x1 + x2)’: I is the identify function, used to escape arithmetic expressions and are evaluated.
‘y ~ C(x1)’: Treats the variable x1 as a categorical variable.

[2]:

import patsy

y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])
data = {
    'Y' : Y,
    'x1' : x1,
    'x2' : x2,
}

equation = 'Y ~ 1 + x1 + x2 + x1*x2'

Y, X = patsy.dmatrices(equation, data)

print(Y)
print(X)

[[1.]
 [2.]
 [3.]
 [4.]
 [5.]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]

load popular datasets from statsmodels#

[3]:

import statsmodels.api as sm
dataset = sm.datasets.cancer.load()
# dataset = sm.datasets.cancer.load_pandas()
dataset

[3]:

<class 'statsmodels.datasets.utils.Dataset'>

Linear Model Creation using statsmodels#

Using inbuilt Icecream dataset#

[4]:

import statsmodels.api as sm

icecream = sm.datasets.get_rdataset("Icecream","Ecdat")

dataset = icecream.data
dataset.head()

[4]:

	cons	income	price	temp
0	0.386	78	0.270	41
1	0.374	79	0.282	56
2	0.393	81	0.277	63
3	0.425	80	0.280	68
4	0.406	76	0.272	69

[5]:

import statsmodels.formula.api as smf

linearModel1 = smf.ols('cons ~ price + temp',dataset)

fitModel1 = linearModel1.fit()

print(fitModel1.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                   cons   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     23.27
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           1.34e-06
Time:                        21:53:43   Log-Likelihood:                 54.607
No. Observations:                  30   AIC:                            -103.2
Df Residuals:                      27   BIC:                            -99.01
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5966      0.258      2.309      0.029       0.067       1.127
price         -1.4018      0.925     -1.515      0.141      -3.300       0.496
temp           0.0030      0.000      6.448      0.000       0.002       0.004
==============================================================================
Omnibus:                        0.991   Durbin-Watson:                   0.656
Prob(Omnibus):                  0.609   Jarque-Bera (JB):                0.220
Skew:                          -0.107   Prob(JB):                        0.896
Kurtosis:                       3.361   Cond. No.                     6.58e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

[6]:

linearModel2 = smf.ols('cons ~ income + temp',dataset)

fitModel2 = linearModel2.fit()

print(fitModel2.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                   cons   R-squared:                       0.702
Model:                            OLS   Adj. R-squared:                  0.680
Method:                 Least Squares   F-statistic:                     31.81
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           7.96e-08
Time:                        21:53:43   Log-Likelihood:                 57.742
No. Observations:                  30   AIC:                            -109.5
Df Residuals:                      27   BIC:                            -105.3
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.1132      0.108     -1.045      0.305      -0.335       0.109
income         0.0035      0.001      3.017      0.006       0.001       0.006
temp           0.0035      0.000      7.963      0.000       0.003       0.004
==============================================================================
Omnibus:                        2.264   Durbin-Watson:                   1.003
Prob(Omnibus):                  0.322   Jarque-Bera (JB):                1.094
Skew:                           0.386   Prob(JB):                        0.579
Kurtosis:                       3.528   Cond. No.                     1.56e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

[7]:

linearModel3 = smf.ols('cons ~ -1 + income + temp',dataset)

fitModel3 = linearModel3.fit()

print(fitModel3.summary())

                                 OLS Regression Results
=======================================================================================
Dep. Variable:                   cons   R-squared (uncentered):                   0.990
Model:                            OLS   Adj. R-squared (uncentered):              0.990
Method:                 Least Squares   F-statistic:                              1426.
Date:                Thu, 15 Oct 2020   Prob (F-statistic):                    6.77e-29
Time:                        21:53:43   Log-Likelihood:                          57.146
No. Observations:                  30   AIC:                                     -110.3
Df Residuals:                      28   BIC:                                     -107.5
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
income         0.0023      0.000      9.906      0.000       0.002       0.003
temp           0.0033      0.000      8.571      0.000       0.003       0.004
==============================================================================
Omnibus:                        3.584   Durbin-Watson:                   0.887
Prob(Omnibus):                  0.167   Jarque-Bera (JB):                2.089
Skew:                           0.508   Prob(JB):                        0.352
Kurtosis:                       3.798   Cond. No.                         6.45
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[8]:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = sm.datasets.get_rdataset("mtcars").data


model = smf.ols('np.log(wt) ~ np.log(mpg)',df)
trainedModel = model.fit()

print(trainedModel.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:             np.log(wt)   R-squared:                       0.806
Model:                            OLS   Adj. R-squared:                  0.799
Method:                 Least Squares   F-statistic:                     124.4
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           3.41e-12
Time:                        21:53:48   Log-Likelihood:                 18.024
No. Observations:                  32   AIC:                            -32.05
Df Residuals:                      30   BIC:                            -29.12
Df Model:                           1
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       3.9522      0.255     15.495      0.000       3.431       4.473
np.log(mpg)    -0.9570      0.086    -11.152      0.000      -1.132      -0.782
==============================================================================
Omnibus:                        1.199   Durbin-Watson:                   1.625
Prob(Omnibus):                  0.549   Jarque-Bera (JB):                1.159
Skew:                           0.349   Prob(JB):                        0.560
Kurtosis:                       2.381   Cond. No.                         33.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Logistic Regression#

Logit : Logistic Regression
MNLogit : Multinomial Logistic Regression
Poisson : Poisson Regression

\begin{align} h_\theta (x) = g(\theta^T x)\\ y = \theta^T\\ g(x) = \frac{1}{1 + e^{-y}} \end{align}

[9]:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd


df = sm.datasets.get_rdataset('iris').data

## logistic regression takes only two variables as target
dfSubset = df[(df['Species'] == "versicolor") | (df['Species'] == "virginica")].copy()


## preprocessing
## label endoding manually

dfSubset["Species"] = dfSubset['Species'].map({
    "versicolor" : 1,
    "virginica" : 0
})

dfSubset.columns = [column.replace(".","_") for column in dfSubset.columns]

## Creating a model
model = smf.logit('Species ~ Petal_Length + Petal_Width ', data = dfSubset)
trainedModel = model.fit()

print(trainedModel.summary())



## Make Predictions
dfTest = pd.DataFrame({
    "Petal_Length" : np.random.randn(20) * 0.7 + 6,
    "Petal_Width" : np.random.randn(20) * 0.7 + 1
})


dfTest['rawSpecies'] = trainedModel.predict(dfTest)
dfTest['Species'] = dfTest.rawSpecies.apply(lambda x: 1 if x>0.5 else 0)

print("*---- Test data and Predictions ----*")
dfTest.head()

Optimization terminated successfully.
         Current function value: 0.102818
         Iterations 10
                           Logit Regression Results
==============================================================================
Dep. Variable:                Species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       97
Method:                           MLE   Df Model:                            2
Date:                Thu, 15 Oct 2020   Pseudo R-squ.:                  0.8517
Time:                        21:53:53   Log-Likelihood:                -10.282
converged:                       True   LL-Null:                       -69.315
Covariance Type:            nonrobust   LLR p-value:                 2.303e-26
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       45.2723     13.612      3.326      0.001      18.594      71.951
Petal_Length    -5.7545      2.306     -2.496      0.013     -10.274      -1.235
Petal_Width    -10.4467      3.756     -2.782      0.005     -17.808      -3.086
================================================================================

Possibly complete quasi-separation: A fraction 0.34 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
*---- Test data and Predictions ----*

[9]:

	Petal_Length	Petal_Width	rawSpecies	Species
0	6.437427	0.075640	0.999412	1
1	6.710820	1.414629	0.000296	0
2	5.009607	1.536436	0.597176	1
3	5.885385	1.335874	0.072375	0
4	6.339593	0.180516	0.998998	1

Poisson Regression Model#

Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.
describes a process where dependent variable refers to success count of many attempts and each attempt has a very low probability of success.

[10]:

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")

model = smf.poisson('num_awards ~ math + C(prog)',data = df)
trainedModel = model.fit()

print(trainedModel.summary())

Optimization terminated successfully.
         Current function value: 0.913761
         Iterations 6
                          Poisson Regression Results
==============================================================================
Dep. Variable:             num_awards   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Thu, 15 Oct 2020   Pseudo R-squ.:                  0.2118
Time:                        21:53:55   Log-Likelihood:                -182.75
converged:                       True   LL-Null:                       -231.86
Covariance Type:            nonrobust   LLR p-value:                 3.747e-21
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -5.2471      0.658     -7.969      0.000      -6.538      -3.957
C(prog)[T.2]     1.0839      0.358      3.025      0.002       0.382       1.786
C(prog)[T.3]     0.3698      0.441      0.838      0.402      -0.495       1.234
math             0.0702      0.011      6.619      0.000       0.049       0.091
================================================================================

Statsmodels

Contents