Statsmodels#

Statsmodels provides statistical models and tools for data analysis. It is designed to support the estimation and testing of statistical models in various contexts, including econometrics, linear and non-linear regression, time series analysis, and more. The library is divided into different modules that cover a wide range of statistical techniques and allows users to perform hypothesis testing, statistical modeling, and exploration of relationships within datasets.

import statsmodels.api as sm
import numpy as np
import pandas as pd

np.random.seed(0)

data = pd.DataFrame({
    'Target': np.random.normal(100, 10, 10),
    'X1': np.random.normal(0, 5, 10),
    'X2': np.random.uniform(0, 100, 10),
    'X3': np.random.choice(['A', 'B'], 10)
})

data = pd.get_dummies(data, columns=['X3'], drop_first=True)
data['X3_B'] = data['X3_B'].astype(float)

data
Target X1 X2 X3_B
0 117.640523 0.720218 14.335329 0.0
1 104.001572 7.271368 94.466892 0.0
2 109.787380 3.805189 52.184832 0.0
3 122.408932 0.608375 41.466194 1.0
4 118.675580 2.219316 26.455561 1.0
5 90.227221 1.668372 77.423369 0.0
6 109.500884 7.470395 45.615033 1.0
7 98.486428 -1.025791 56.843395 0.0
8 98.967811 1.565339 1.878980 0.0
9 104.105985 -4.270479 61.763550 1.0

Regression & Classification#

Model

Function

Linear

sm.OLS

Logit

sm.Logit

Poisson

sm.GLM with family=sm.families.Poisson()

Gamma

sm.GLM with family=sm.families.Gamma()

Negative Binomial

sm.GLM with family=sm.families.NegativeBinomial()

Zero-Inflated Poisson

ZeroInflatedPoisson

Zero-Inflated Negative Binomial

ZeroInflatedNegativeBinomial

Robust Regression

sm.RLM

Quantile Regression

sm.QuantReg

model = sm.OLS(data['Target'], data[['X1', 'X2', 'X3_B']])
results = model.fit()

results.summary()
C:\Users\silve\AppData\Roaming\jupyterlab-desktop\jlab_server\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
OLS Regression Results
Dep. Variable: Target R-squared (uncentered): 0.802
Model: OLS Adj. R-squared (uncentered): 0.717
Method: Least Squares F-statistic: 9.444
Date: Sun, 03 Dec 2023 Prob (F-statistic): 0.00741
Time: 23:58:13 Log-Likelihood: -52.899
No. Observations: 10 AIC: 111.8
Df Residuals: 7 BIC: 112.7
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
X1 2.7610 5.473 0.505 0.629 -10.180 15.702
X2 1.2587 0.446 2.820 0.026 0.203 2.314
X3_B 54.3494 33.395 1.627 0.148 -24.617 133.315
Omnibus: 1.154 Durbin-Watson: 1.854
Prob(Omnibus): 0.562 Jarque-Bera (JB): 0.864
Skew: 0.599 Prob(JB): 0.649
Kurtosis: 2.203 Cond. No. 99.9


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ANOVA & ANCOVA#

ANOVA

from statsmodels.formula.api import ols

model = ols('Target ~ X3_B', data=data)
results = model.fit()

anova_table = sm.stats.anova_lm(results, typ=2)
anova_table
sum_sq df F PR(>F)
X3_B 263.979903 1.0 3.146072 0.114038
Residual 671.262297 8.0 NaN NaN

ANCOVA

from statsmodels.formula.api import ols

model = ols('Target ~ X3_B + X1', data=data)
results = model.fit()

anova_table = sm.stats.anova_lm(results, typ=2)
anova_table
sum_sq df F PR(>F)
X3_B 276.635923 1.0 2.961864 0.128928
X1 17.467390 1.0 0.187018 0.678420
Residual 653.794908 7.0 NaN NaN

Time Series#

t = np.arange(100)
trend = 0.5 * t
season = 10 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 1, 100)

y = trend + noise
y_season = trend + season + noise
x = 5 * np.sin(2 * np.pi * t)

data = pd.DataFrame({'Time': t, 'Value': y, 'Value_Season': y_season, 'Exo_Var': x})
data
Time Value Value_Season Exo_Var
0 0 1.787484 1.787484 0.000000e+00
1 1 -0.069517 4.930483 -1.224647e-15
2 2 1.175387 9.835641 -2.449294e-15
3 3 1.037494 11.037494 -3.673940e-15
4 4 0.914199 9.574453 -4.898587e-15
... ... ... ... ...
95 95 47.935183 42.935183 -9.760036e-15
96 96 47.954252 47.954252 -1.175661e-13
97 97 48.549898 53.549898 -2.253721e-13
98 98 48.064470 56.724724 -3.331782e-13
99 99 49.787388 59.787388 1.274499e-13

100 rows × 4 columns

ARIMA#

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(data['Value'], order=(1, 1, 1))
results = model.fit()

results.summary()
SARIMAX Results
Dep. Variable: Value No. Observations: 100
Model: ARIMA(1, 1, 1) Log Likelihood -165.516
Date: Sun, 03 Dec 2023 AIC 337.031
Time: 23:58:13 BIC 344.817
Sample: 0 HQIC 340.181
- 100
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.4647 0.222 -2.096 0.036 -0.899 -0.030
ma.L1 0.0899 0.256 0.352 0.725 -0.411 0.591
sigma2 1.6556 0.276 5.990 0.000 1.114 2.197
Ljung-Box (L1) (Q): 10.42 Jarque-Bera (JB): 0.98
Prob(Q): 0.00 Prob(JB): 0.61
Heteroskedasticity (H): 1.40 Skew: -0.24
Prob(H) (two-sided): 0.33 Kurtosis: 2.88


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

SARIMA#

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

results.summary()
SARIMAX Results
Dep. Variable: Value_Season No. Observations: 100
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -125.397
Date: Sun, 03 Dec 2023 AIC 260.794
Time: 23:58:14 BIC 273.123
Sample: 0 HQIC 265.758
- 100
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.0647 0.123 -0.528 0.598 -0.305 0.176
ma.L1 -0.9438 0.076 -12.340 0.000 -1.094 -0.794
ar.S.L12 -0.0445 0.174 -0.256 0.798 -0.385 0.296
ma.S.L12 -0.9854 3.389 -0.291 0.771 -7.628 5.658
sigma2 0.7542 2.483 0.304 0.761 -4.113 5.621
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 1.15
Prob(Q): 0.96 Prob(JB): 0.56
Heteroskedasticity (H): 1.67 Skew: -0.22
Prob(H) (two-sided): 0.17 Kurtosis: 2.66


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ARIMAX & SARIMAX#

model = ARIMA(data['Value'], order=(1, 1, 1), exog=data[['Exo_Var']])
# model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

results.summary()
SARIMAX Results
Dep. Variable: Value No. Observations: 100
Model: ARIMA(1, 1, 1) Log Likelihood -162.957
Date: Sun, 03 Dec 2023 AIC 333.915
Time: 23:58:14 BIC 344.295
Sample: 0 HQIC 338.115
- 100
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
Exo_Var 1.954e+12 2.18e-14 8.97e+25 0.000 1.95e+12 1.95e+12
ar.L1 -0.5121 0.219 -2.341 0.019 -0.941 -0.083
ma.L1 0.1342 0.255 0.526 0.599 -0.366 0.634
sigma2 1.5720 0.268 5.868 0.000 1.047 2.097
Ljung-Box (L1) (Q): 10.99 Jarque-Bera (JB): 1.26
Prob(Q): 0.00 Prob(JB): 0.53
Heteroskedasticity (H): 1.34 Skew: -0.27
Prob(H) (two-sided): 0.40 Kurtosis: 2.91


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.9e+28. Standard errors may be unstable.

VAR#

from statsmodels.tsa.api import VAR

var_data = data[['Value', 'Exo_Var']]
model = VAR(var_data)
results = model.fit()

results.summary()
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Sun, 03, Dec, 2023
Time:                     23:58:14
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                   -59.2358
Nobs:                     99.0000    HQIC:                  -59.3294
Log likelihood:           2665.01    FPE:                1.60673e-26
AIC:                     -59.3930    Det(Omega_mle):     1.51360e-26
--------------------------------------------------------------------
Results for equation Value
========================================================================================
                      coefficient            std. error           t-stat            prob
----------------------------------------------------------------------------------------
const                    0.527636              0.261169            2.020           0.043
L1.Value                 0.992909              0.010100           98.311           0.000
L1.Exo_Var  -1910883832505.852539  1461402552441.010742           -1.308           0.191
========================================================================================

Results for equation Exo_Var
=============================================================================
                coefficient       std. error           t-stat            prob
-----------------------------------------------------------------------------
const             -0.000000         0.000000           -0.093           0.926
L1.Value          -0.000000         0.000000           -4.129           0.000
L1.Exo_Var        -0.114682         0.104965           -1.093           0.275
=============================================================================

Correlation matrix of residuals
              Value   Exo_Var
Value      1.000000  0.156168
Exo_Var    0.156168  1.000000

Panel Data#

Tests#

Category

Type

Function

ANOVA & ANCOVA

sm.stats.anova_lm

Autocorrelation

Ljung-Box

sm.stats.diagnostic.acorr_ljungbox

Homoschedasticity

Breusch–Pagan

sm.stats.het_breuschpagan

White

sm.stats.het_white

Normality

D’Agostino-Pearson

sm.stats.normaltest

Kolmogorov-Smirnov

sm.stats.kstest

Shapiro-Wilk

sm.stats.shapiro

Parameter Values

Likelihood Ratio

compare_lr_test

Lagrange Multipliers

score_test

Wald

wald_test

Seasonality

Kwiatkowski-Phillips-Schmidt-Shin (KPSS)

sm.tsa.kpss

Stationarity

Dickey-Fuller (ADF)

sm.tsa.stattools.adfuller

Time Series Causality

Granger

sm.tsa.stattools.grangercausalitytests