Statsmodels

Statsmodels#

Statsmodels provides statistical models and tools for data analysis. It is designed to support the estimation and testing of statistical models in various contexts, including econometrics, linear and non-linear regression, time series analysis, and more. The library is divided into different modules that cover a wide range of statistical techniques and allows users to perform hypothesis testing, statistical modeling, and exploration of relationships within datasets.

import statsmodels.api as sm

import numpy as np
import pandas as pd

np.random.seed(0)

data = pd.DataFrame({
    'Target': np.random.normal(100, 10, 10),
    'X1': np.random.normal(0, 5, 10),
    'X2': np.random.uniform(0, 100, 10),
    'X3': np.random.choice(['A', 'B'], 10)
})

data = pd.get_dummies(data, columns=['X3'], drop_first=True)
data['X3_B'] = data['X3_B'].astype(float)

data

	Target	X1	X2	X3_B
0	117.640523	0.720218	14.335329	0.0
1	104.001572	7.271368	94.466892	0.0
2	109.787380	3.805189	52.184832	0.0
3	122.408932	0.608375	41.466194	1.0
4	118.675580	2.219316	26.455561	1.0
5	90.227221	1.668372	77.423369	0.0
6	109.500884	7.470395	45.615033	1.0
7	98.486428	-1.025791	56.843395	0.0
8	98.967811	1.565339	1.878980	0.0
9	104.105985	-4.270479	61.763550	1.0

Regression & Classification#

Model	Function
Linear	`sm.OLS`
Logit	`sm.Logit`
Poisson	`sm.GLM` with `family=sm.families.Poisson()`
Gamma	`sm.GLM` with `family=sm.families.Gamma()`
Negative Binomial	`sm.GLM` with `family=sm.families.NegativeBinomial()`
Zero-Inflated Poisson	`ZeroInflatedPoisson`
Zero-Inflated Negative Binomial	`ZeroInflatedNegativeBinomial`
Robust Regression	`sm.RLM`
Quantile Regression	`sm.QuantReg`

model = sm.OLS(data['Target'], data[['X1', 'X2', 'X3_B']])
results = model.fit()

results.summary()

C:\Users\silve\AppData\Roaming\jupyterlab-desktop\jlab_server\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

OLS Regression Results
Dep. Variable:	Target	R-squared (uncentered):	0.802
Model:	OLS	Adj. R-squared (uncentered):	0.717
Method:	Least Squares	F-statistic:	9.444
Date:	Sun, 03 Dec 2023	Prob (F-statistic):	0.00741
Time:	23:58:13	Log-Likelihood:	-52.899
No. Observations:	10	AIC:	111.8
Df Residuals:	7	BIC:	112.7
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
X1	2.7610	5.473	0.505	0.629	-10.180	15.702
X2	1.2587	0.446	2.820	0.026	0.203	2.314
X3_B	54.3494	33.395	1.627	0.148	-24.617	133.315

Omnibus:	1.154	Durbin-Watson:	1.854
Prob(Omnibus):	0.562	Jarque-Bera (JB):	0.864
Skew:	0.599	Prob(JB):	0.649
Kurtosis:	2.203	Cond. No.	99.9

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ANOVA & ANCOVA#

ANOVA

from statsmodels.formula.api import ols

model = ols('Target ~ X3_B', data=data)
results = model.fit()

anova_table = sm.stats.anova_lm(results, typ=2)
anova_table

	sum_sq	df	F	PR(>F)
X3_B	263.979903	1.0	3.146072	0.114038
Residual	671.262297	8.0	NaN	NaN

ANCOVA

from statsmodels.formula.api import ols

model = ols('Target ~ X3_B + X1', data=data)
results = model.fit()

anova_table = sm.stats.anova_lm(results, typ=2)
anova_table

	sum_sq	df	F	PR(>F)
X3_B	276.635923	1.0	2.961864	0.128928
X1	17.467390	1.0	0.187018	0.678420
Residual	653.794908	7.0	NaN	NaN

Time Series#

t = np.arange(100)
trend = 0.5 * t
season = 10 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 1, 100)

y = trend + noise
y_season = trend + season + noise
x = 5 * np.sin(2 * np.pi * t)

data = pd.DataFrame({'Time': t, 'Value': y, 'Value_Season': y_season, 'Exo_Var': x})
data

	Time	Value	Value_Season	Exo_Var
0	0	1.787484	1.787484	0.000000e+00
1	1	-0.069517	4.930483	-1.224647e-15
2	2	1.175387	9.835641	-2.449294e-15
3	3	1.037494	11.037494	-3.673940e-15
4	4	0.914199	9.574453	-4.898587e-15
...	...	...	...	...
95	95	47.935183	42.935183	-9.760036e-15
96	96	47.954252	47.954252	-1.175661e-13
97	97	48.549898	53.549898	-2.253721e-13
98	98	48.064470	56.724724	-3.331782e-13
99	99	49.787388	59.787388	1.274499e-13

100 rows × 4 columns

ARIMA#

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(data['Value'], order=(1, 1, 1))
results = model.fit()

results.summary()

SARIMAX Results
Dep. Variable:	Value	No. Observations:	100
Model:	ARIMA(1, 1, 1)	Log Likelihood	-165.516
Date:	Sun, 03 Dec 2023	AIC	337.031
Time:	23:58:13	BIC	344.817
Sample:	0	HQIC	340.181
	- 100
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	-0.4647	0.222	-2.096	0.036	-0.899	-0.030
ma.L1	0.0899	0.256	0.352	0.725	-0.411	0.591
sigma2	1.6556	0.276	5.990	0.000	1.114	2.197

Ljung-Box (L1) (Q):	10.42	Jarque-Bera (JB):	0.98
Prob(Q):	0.00	Prob(JB):	0.61
Heteroskedasticity (H):	1.40	Skew:	-0.24
Prob(H) (two-sided):	0.33	Kurtosis:	2.88

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

SARIMA#

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

results.summary()

SARIMAX Results
Dep. Variable:	Value_Season	No. Observations:	100
Model:	SARIMAX(1, 1, 1)x(1, 1, 1, 12)	Log Likelihood	-125.397
Date:	Sun, 03 Dec 2023	AIC	260.794
Time:	23:58:14	BIC	273.123
Sample:	0	HQIC	265.758
	- 100
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	-0.0647	0.123	-0.528	0.598	-0.305	0.176
ma.L1	-0.9438	0.076	-12.340	0.000	-1.094	-0.794
ar.S.L12	-0.0445	0.174	-0.256	0.798	-0.385	0.296
ma.S.L12	-0.9854	3.389	-0.291	0.771	-7.628	5.658
sigma2	0.7542	2.483	0.304	0.761	-4.113	5.621

Ljung-Box (L1) (Q):	0.00	Jarque-Bera (JB):	1.15
Prob(Q):	0.96	Prob(JB):	0.56
Heteroskedasticity (H):	1.67	Skew:	-0.22
Prob(H) (two-sided):	0.17	Kurtosis:	2.66

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

ARIMAX & SARIMAX#

model = ARIMA(data['Value'], order=(1, 1, 1), exog=data[['Exo_Var']])
# model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

results.summary()

SARIMAX Results
Dep. Variable:	Value	No. Observations:	100
Model:	ARIMA(1, 1, 1)	Log Likelihood	-162.957
Date:	Sun, 03 Dec 2023	AIC	333.915
Time:	23:58:14	BIC	344.295
Sample:	0	HQIC	338.115
	- 100
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
Exo_Var	1.954e+12	2.18e-14	8.97e+25	0.000	1.95e+12	1.95e+12
ar.L1	-0.5121	0.219	-2.341	0.019	-0.941	-0.083
ma.L1	0.1342	0.255	0.526	0.599	-0.366	0.634
sigma2	1.5720	0.268	5.868	0.000	1.047	2.097

Ljung-Box (L1) (Q):	10.99	Jarque-Bera (JB):	1.26
Prob(Q):	0.00	Prob(JB):	0.53
Heteroskedasticity (H):	1.34	Skew:	-0.27
Prob(H) (two-sided):	0.40	Kurtosis:	2.91

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.9e+28. Standard errors may be unstable.

VAR#

from statsmodels.tsa.api import VAR

var_data = data[['Value', 'Exo_Var']]
model = VAR(var_data)
results = model.fit()

results.summary()

  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Sun, 03, Dec, 2023
Time:                     23:58:14
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                   -59.2358
Nobs:                     99.0000    HQIC:                  -59.3294
Log likelihood:           2665.01    FPE:                1.60673e-26
AIC:                     -59.3930    Det(Omega_mle):     1.51360e-26
--------------------------------------------------------------------
Results for equation Value
========================================================================================
                      coefficient            std. error           t-stat            prob
----------------------------------------------------------------------------------------
const                    0.527636              0.261169            2.020           0.043
L1.Value                 0.992909              0.010100           98.311           0.000
L1.Exo_Var  -1910883832505.852539  1461402552441.010742           -1.308           0.191
========================================================================================

Results for equation Exo_Var
=============================================================================
                coefficient       std. error           t-stat            prob
-----------------------------------------------------------------------------
const             -0.000000         0.000000           -0.093           0.926
L1.Value          -0.000000         0.000000           -4.129           0.000
L1.Exo_Var        -0.114682         0.104965           -1.093           0.275
=============================================================================

Correlation matrix of residuals
              Value   Exo_Var
Value      1.000000  0.156168
Exo_Var    0.156168  1.000000

Panel Data#

Tests#

Category	Type	Function
ANOVA & ANCOVA		`sm.stats.anova_lm`
Autocorrelation	Ljung-Box	`sm.stats.diagnostic.acorr_ljungbox`
Homoschedasticity	Breusch–Pagan	`sm.stats.het_breuschpagan`
	White	`sm.stats.het_white`
Normality	D’Agostino-Pearson	`sm.stats.normaltest`
	Kolmogorov-Smirnov	`sm.stats.kstest`
	Shapiro-Wilk	`sm.stats.shapiro`
Parameter Values	Likelihood Ratio	`compare_lr_test`
	Lagrange Multipliers	`score_test`
	Wald	`wald_test`
Seasonality	Kwiatkowski-Phillips-Schmidt-Shin (KPSS)	`sm.tsa.kpss`
Stationarity	Dickey-Fuller (ADF)	`sm.tsa.stattools.adfuller`
Time Series Causality	Granger	`sm.tsa.stattools.grangercausalitytests`