Model Diagnostics¶
After fitting a GLM, you need to assess whether the model fits well and identify areas for improvement. This chapter covers the diagnostic tools available in RustyStats.
Residuals¶
Residuals measure the discrepancy between observed and fitted values. GLMs have several types of residuals, each with different properties.
Response Residuals¶
The simplest residual - just the difference:
Properties: - Easy to interpret - Not standardized (scale depends on \(\mu\)) - Variance is not constant
Pearson Residuals¶
Standardized by the variance function:
Properties: - Approximately standardized (unit variance under the model) - Useful for detecting outliers - Sum of squares = Pearson \(\chi^2\) statistic
Deviance Residuals¶
Based on the contribution to deviance:
where \(d_i\) is the unit deviance.
Properties: - Sum of squares = Model deviance - More normally distributed than Pearson residuals - Preferred for most diagnostic plots
Working Residuals¶
Used internally by IRLS:
Properties: - On the linear predictor scale - Useful for partial residual plots
Goodness-of-Fit Statistics¶
Deviance¶
The deviance measures overall model fit:
Lower deviance = better fit. For comparing nested models:
Null Deviance¶
Deviance of the intercept-only model:
Pseudo-R² (one of many definitions): [ R^2_{\text{pseudo}} = 1 - \frac{D}{D_{\text{null}}} ]
Pearson Chi-Squared¶
Dispersion Estimation¶
For families with unknown dispersion (Gaussian, Gamma, Quasi-families):
Checking for Overdispersion¶
For Poisson/Binomial, dispersion should be ≈ 1:
dispersion_ratio = result.pearson_chi2() / result.df_resid
print(f"Dispersion ratio: {dispersion_ratio:.2f}")
# > 1.5 suggests overdispersion
# < 0.7 suggests underdispersion
Information Criteria¶
Log-Likelihood¶
AIC (Akaike Information Criterion)¶
Balances fit (likelihood) with complexity (number of parameters).
Lower is better. AIC favors parsimony.
BIC (Bayesian Information Criterion)¶
More conservative than AIC for large samples (penalizes complexity more).
When to Use Each¶
| Criterion | Penalty | Best For |
|---|---|---|
| AIC | 2p | Prediction, model averaging |
| BIC | p log(n) | Model selection, large samples |
Calibration Diagnostics¶
Calibration measures how well predicted probabilities/means match observed values.
Actual vs. Expected Ratio¶
diagnostics = result.diagnostics(data=data, categorical_factors=["Region"])
print(f"Overall A/E: {diagnostics.calibration['actual_expected_ratio']:.3f}")
Interpretation: - A/E = 1.0: Perfect overall calibration - A/E > 1.0: Model underpredicts - A/E < 1.0: Model overpredicts
Calibration by Decile¶
Split predictions into 10 groups and compare A/E in each:
for decile in diagnostics.calibration['by_decile']:
print(f"Decile {decile['decile']}: A/E = {decile['ae_ratio']:.3f}")
Good calibration: A/E ≈ 1 in all deciles.
Hosmer-Lemeshow Test¶
Formal test for calibration (Binomial models):
Under \(H_0\) (good calibration): \(H \sim \chi^2_{G-2}\)
Discrimination Diagnostics¶
Discrimination measures how well the model separates high and low values.
Gini Coefficient¶
Measures the area between the Lorenz curve and the line of equality:
Interpretation: - Gini = 0: No discrimination (random predictions) - Gini = 1: Perfect discrimination - Typical values: 0.3-0.5 for insurance models
Lorenz Curve¶
Plots cumulative % of response vs. cumulative % of exposure, ordered by predicted risk:
lorenz = diagnostics.discrimination['lorenz_curve']
# Plot: x = cumulative exposure %, y = cumulative claims %
Lift¶
How much better is the model than random in the top decile?
Factor Diagnostics¶
Analyze model performance by individual factors.
A/E by Factor Level¶
For categorical factors:
for factor in diagnostics.factors:
if factor.factor_type == "categorical":
for level in factor.actual_vs_expected:
print(f"{factor.name}={level['level']}: A/E={level['ae_ratio']:.3f}")
Residual Patterns¶
Check if residuals are correlated with factors:
for factor in diagnostics.factors:
corr = factor.residual_pattern['correlation_with_residuals']
if abs(corr) > 0.05:
print(f"Warning: {factor.name} has residual correlation {corr:.3f}")
High correlation suggests the factor effect is misspecified.
Interaction Detection¶
RustyStats can automatically detect potential interactions:
diagnostics = result.diagnostics(
data=data,
categorical_factors=["Region", "VehBrand"],
continuous_factors=["Age"],
)
for ic in diagnostics.interaction_candidates:
print(f"Consider: {ic['factor1']} × {ic['factor2']} (strength={ic['strength']:.3f})")
The algorithm uses greedy residual-based detection to find factor pairs that explain residual variance.
Pre-Fit Data Exploration¶
Explore data before fitting any model:
exploration = rs.explore_data(
data=data,
response="ClaimNb",
categorical_factors=["Region", "VehBrand"],
continuous_factors=["Age", "VehPower"],
exposure="Exposure",
family="poisson",
detect_interactions=True,
)
print(exploration.response_stats)
print(exploration.to_json())
JSON Export for LLM Integration¶
Export diagnostics in compact JSON format:
json_str = result.diagnostics_json(
data=data,
categorical_factors=["Region"],
continuous_factors=["Age"],
)
# Feed to LLM for analysis
response = llm.analyze(f"Analyze this model: {json_str}")
Diagnostic Plots (External)¶
RustyStats focuses on computation. For visualization, use matplotlib/plotly:
import matplotlib.pyplot as plt
# Residual vs. Fitted
plt.scatter(result.fittedvalues, result.resid_deviance())
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Deviance Residuals')
# Q-Q plot
from scipy import stats
stats.probplot(result.resid_deviance(), plot=plt)
# Calibration plot
# ... using diagnostics.calibration data
Summary¶
Key diagnostics workflow:
- Check convergence:
result.converged,result.iterations - Assess overall fit: Deviance, AIC, BIC
- Check dispersion: Pearson χ²/df ≈ 1
- Examine residuals: Patterns suggest model misspecification
- Verify calibration: A/E by decile
- Assess discrimination: Gini coefficient
- Factor analysis: A/E by level, residual correlations
- Detect interactions: Automated candidates