Statsmodels#

Statsmodels provides statistical models and tools for data analysis. It is designed to support the estimation and testing of statistical models in various contexts, including econometrics, linear and non-linear regression, time series analysis, and more. The library is divided into different modules that cover a wide range of statistical techniques and allows users to perform hypothesis testing, statistical modeling, and exploration of relationships within datasets.
import statsmodels.api as sm
import numpy as np
import pandas as pd
np.random.seed(0)
data = pd.DataFrame({
'Target': np.random.normal(100, 10, 10),
'X1': np.random.normal(0, 5, 10),
'X2': np.random.uniform(0, 100, 10),
'X3': np.random.choice(['A', 'B'], 10)
})
data = pd.get_dummies(data, columns=['X3'], drop_first=True)
data['X3_B'] = data['X3_B'].astype(float)
data
| Target | X1 | X2 | X3_B | |
|---|---|---|---|---|
| 0 | 117.640523 | 0.720218 | 14.335329 | 0.0 |
| 1 | 104.001572 | 7.271368 | 94.466892 | 0.0 |
| 2 | 109.787380 | 3.805189 | 52.184832 | 0.0 |
| 3 | 122.408932 | 0.608375 | 41.466194 | 1.0 |
| 4 | 118.675580 | 2.219316 | 26.455561 | 1.0 |
| 5 | 90.227221 | 1.668372 | 77.423369 | 0.0 |
| 6 | 109.500884 | 7.470395 | 45.615033 | 1.0 |
| 7 | 98.486428 | -1.025791 | 56.843395 | 0.0 |
| 8 | 98.967811 | 1.565339 | 1.878980 | 0.0 |
| 9 | 104.105985 | -4.270479 | 61.763550 | 1.0 |
Regression & Classification#
Model |
Function |
|---|---|
Linear |
|
Logit |
|
Poisson |
|
Gamma |
|
Negative Binomial |
|
Zero-Inflated Poisson |
|
Zero-Inflated Negative Binomial |
|
Robust Regression |
|
Quantile Regression |
|
model = sm.OLS(data['Target'], data[['X1', 'X2', 'X3_B']])
results = model.fit()
results.summary()
C:\Users\silve\AppData\Roaming\jupyterlab-desktop\jlab_server\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
| Dep. Variable: | Target | R-squared (uncentered): | 0.802 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared (uncentered): | 0.717 |
| Method: | Least Squares | F-statistic: | 9.444 |
| Date: | Sun, 03 Dec 2023 | Prob (F-statistic): | 0.00741 |
| Time: | 23:58:13 | Log-Likelihood: | -52.899 |
| No. Observations: | 10 | AIC: | 111.8 |
| Df Residuals: | 7 | BIC: | 112.7 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| X1 | 2.7610 | 5.473 | 0.505 | 0.629 | -10.180 | 15.702 |
| X2 | 1.2587 | 0.446 | 2.820 | 0.026 | 0.203 | 2.314 |
| X3_B | 54.3494 | 33.395 | 1.627 | 0.148 | -24.617 | 133.315 |
| Omnibus: | 1.154 | Durbin-Watson: | 1.854 |
|---|---|---|---|
| Prob(Omnibus): | 0.562 | Jarque-Bera (JB): | 0.864 |
| Skew: | 0.599 | Prob(JB): | 0.649 |
| Kurtosis: | 2.203 | Cond. No. | 99.9 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
ANOVA & ANCOVA#
ANOVA
from statsmodels.formula.api import ols
model = ols('Target ~ X3_B', data=data)
results = model.fit()
anova_table = sm.stats.anova_lm(results, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| X3_B | 263.979903 | 1.0 | 3.146072 | 0.114038 |
| Residual | 671.262297 | 8.0 | NaN | NaN |
ANCOVA
from statsmodels.formula.api import ols
model = ols('Target ~ X3_B + X1', data=data)
results = model.fit()
anova_table = sm.stats.anova_lm(results, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| X3_B | 276.635923 | 1.0 | 2.961864 | 0.128928 |
| X1 | 17.467390 | 1.0 | 0.187018 | 0.678420 |
| Residual | 653.794908 | 7.0 | NaN | NaN |
Time Series#
t = np.arange(100)
trend = 0.5 * t
season = 10 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 1, 100)
y = trend + noise
y_season = trend + season + noise
x = 5 * np.sin(2 * np.pi * t)
data = pd.DataFrame({'Time': t, 'Value': y, 'Value_Season': y_season, 'Exo_Var': x})
data
| Time | Value | Value_Season | Exo_Var | |
|---|---|---|---|---|
| 0 | 0 | 1.787484 | 1.787484 | 0.000000e+00 |
| 1 | 1 | -0.069517 | 4.930483 | -1.224647e-15 |
| 2 | 2 | 1.175387 | 9.835641 | -2.449294e-15 |
| 3 | 3 | 1.037494 | 11.037494 | -3.673940e-15 |
| 4 | 4 | 0.914199 | 9.574453 | -4.898587e-15 |
| ... | ... | ... | ... | ... |
| 95 | 95 | 47.935183 | 42.935183 | -9.760036e-15 |
| 96 | 96 | 47.954252 | 47.954252 | -1.175661e-13 |
| 97 | 97 | 48.549898 | 53.549898 | -2.253721e-13 |
| 98 | 98 | 48.064470 | 56.724724 | -3.331782e-13 |
| 99 | 99 | 49.787388 | 59.787388 | 1.274499e-13 |
100 rows × 4 columns
ARIMA#
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(data['Value'], order=(1, 1, 1))
results = model.fit()
results.summary()
| Dep. Variable: | Value | No. Observations: | 100 |
|---|---|---|---|
| Model: | ARIMA(1, 1, 1) | Log Likelihood | -165.516 |
| Date: | Sun, 03 Dec 2023 | AIC | 337.031 |
| Time: | 23:58:13 | BIC | 344.817 |
| Sample: | 0 | HQIC | 340.181 |
| - 100 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -0.4647 | 0.222 | -2.096 | 0.036 | -0.899 | -0.030 |
| ma.L1 | 0.0899 | 0.256 | 0.352 | 0.725 | -0.411 | 0.591 |
| sigma2 | 1.6556 | 0.276 | 5.990 | 0.000 | 1.114 | 2.197 |
| Ljung-Box (L1) (Q): | 10.42 | Jarque-Bera (JB): | 0.98 |
|---|---|---|---|
| Prob(Q): | 0.00 | Prob(JB): | 0.61 |
| Heteroskedasticity (H): | 1.40 | Skew: | -0.24 |
| Prob(H) (two-sided): | 0.33 | Kurtosis: | 2.88 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA#
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()
results.summary()
| Dep. Variable: | Value_Season | No. Observations: | 100 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1)x(1, 1, 1, 12) | Log Likelihood | -125.397 |
| Date: | Sun, 03 Dec 2023 | AIC | 260.794 |
| Time: | 23:58:14 | BIC | 273.123 |
| Sample: | 0 | HQIC | 265.758 |
| - 100 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -0.0647 | 0.123 | -0.528 | 0.598 | -0.305 | 0.176 |
| ma.L1 | -0.9438 | 0.076 | -12.340 | 0.000 | -1.094 | -0.794 |
| ar.S.L12 | -0.0445 | 0.174 | -0.256 | 0.798 | -0.385 | 0.296 |
| ma.S.L12 | -0.9854 | 3.389 | -0.291 | 0.771 | -7.628 | 5.658 |
| sigma2 | 0.7542 | 2.483 | 0.304 | 0.761 | -4.113 | 5.621 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 1.15 |
|---|---|---|---|
| Prob(Q): | 0.96 | Prob(JB): | 0.56 |
| Heteroskedasticity (H): | 1.67 | Skew: | -0.22 |
| Prob(H) (two-sided): | 0.17 | Kurtosis: | 2.66 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
ARIMAX & SARIMAX#
model = ARIMA(data['Value'], order=(1, 1, 1), exog=data[['Exo_Var']])
# model = SARIMAX(data['Value_Season'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()
results.summary()
| Dep. Variable: | Value | No. Observations: | 100 |
|---|---|---|---|
| Model: | ARIMA(1, 1, 1) | Log Likelihood | -162.957 |
| Date: | Sun, 03 Dec 2023 | AIC | 333.915 |
| Time: | 23:58:14 | BIC | 344.295 |
| Sample: | 0 | HQIC | 338.115 |
| - 100 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Exo_Var | 1.954e+12 | 2.18e-14 | 8.97e+25 | 0.000 | 1.95e+12 | 1.95e+12 |
| ar.L1 | -0.5121 | 0.219 | -2.341 | 0.019 | -0.941 | -0.083 |
| ma.L1 | 0.1342 | 0.255 | 0.526 | 0.599 | -0.366 | 0.634 |
| sigma2 | 1.5720 | 0.268 | 5.868 | 0.000 | 1.047 | 2.097 |
| Ljung-Box (L1) (Q): | 10.99 | Jarque-Bera (JB): | 1.26 |
|---|---|---|---|
| Prob(Q): | 0.00 | Prob(JB): | 0.53 |
| Heteroskedasticity (H): | 1.34 | Skew: | -0.27 |
| Prob(H) (two-sided): | 0.40 | Kurtosis: | 2.91 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 5.9e+28. Standard errors may be unstable.
VAR#
from statsmodels.tsa.api import VAR
var_data = data[['Value', 'Exo_Var']]
model = VAR(var_data)
results = model.fit()
results.summary()
Summary of Regression Results
==================================
Model: VAR
Method: OLS
Date: Sun, 03, Dec, 2023
Time: 23:58:14
--------------------------------------------------------------------
No. of Equations: 2.00000 BIC: -59.2358
Nobs: 99.0000 HQIC: -59.3294
Log likelihood: 2665.01 FPE: 1.60673e-26
AIC: -59.3930 Det(Omega_mle): 1.51360e-26
--------------------------------------------------------------------
Results for equation Value
========================================================================================
coefficient std. error t-stat prob
----------------------------------------------------------------------------------------
const 0.527636 0.261169 2.020 0.043
L1.Value 0.992909 0.010100 98.311 0.000
L1.Exo_Var -1910883832505.852539 1461402552441.010742 -1.308 0.191
========================================================================================
Results for equation Exo_Var
=============================================================================
coefficient std. error t-stat prob
-----------------------------------------------------------------------------
const -0.000000 0.000000 -0.093 0.926
L1.Value -0.000000 0.000000 -4.129 0.000
L1.Exo_Var -0.114682 0.104965 -1.093 0.275
=============================================================================
Correlation matrix of residuals
Value Exo_Var
Value 1.000000 0.156168
Exo_Var 0.156168 1.000000
Panel Data#
Tests#
Category |
Type |
Function |
|---|---|---|
ANOVA & ANCOVA |
|
|
Autocorrelation |
Ljung-Box |
|
Homoschedasticity |
Breusch–Pagan |
|
White |
|
|
Normality |
D’Agostino-Pearson |
|
Kolmogorov-Smirnov |
|
|
Shapiro-Wilk |
|
|
Parameter Values |
Likelihood Ratio |
|
Lagrange Multipliers |
|
|
Wald |
|
|
Seasonality |
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) |
|
Stationarity |
Dickey-Fuller (ADF) |
|
Time Series Causality |
Granger |
|