Regularization¶
Regularization adds a penalty to the objective function to prevent overfitting and enable variable selection. This chapter covers Ridge, Lasso, and Elastic Net regularization for GLMs.
Why Regularize?¶
Standard GLM fitting minimizes deviance:
Problems can arise with: - Many predictors: Overfitting, poor generalization - Correlated predictors: Unstable coefficients - Variable selection: Which predictors matter?
Regularization adds a penalty term:
where \(\lambda\) controls the penalty strength and \(P(\boldsymbol{\beta})\) is the penalty function.
Ridge Regression (L2)¶
The Penalty¶
Full Objective¶
Properties¶
- Shrinks all coefficients toward zero
- Never sets coefficients exactly to zero
- Handles multicollinearity well
- Smooth, differentiable penalty
When to Use¶
- Many correlated predictors
- Want to keep all variables
- Stabilize coefficient estimates
Example¶
import rustystats as rs
# Ridge regression (l1_ratio = 0)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
alpha=0.1, # Penalty strength
l1_ratio=0.0 # Pure L2 (Ridge)
)
Lasso Regression (L1)¶
The Penalty¶
Full Objective¶
Properties¶
- Sets some coefficients exactly to zero
- Performs variable selection
- Selects at most \(n\) variables (in high-dimensional settings)
- Non-smooth penalty (requires special optimization)
The Soft Thresholding Operator¶
Lasso uses soft thresholding:
pub fn soft_threshold(z: f64, gamma: f64) -> f64 {
if z > gamma {
z - gamma
} else if z < -gamma {
z + gamma
} else {
0.0
}
}
When to Use¶
- Feature selection is desired
- Many predictors, few are relevant
- Interpretable sparse models
Example¶
# Lasso (l1_ratio = 1)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="poisson").fit(
alpha=0.1, # Penalty strength
l1_ratio=1.0 # Pure L1 (Lasso)
)
print(f"Non-zero coefficients: {result.n_nonzero()}")
print(f"Selected features: {result.selected_features()}")
Elastic Net¶
The Penalty¶
A weighted combination of L1 and L2:
where \(\rho \in [0, 1]\) is the L1 ratio.
Full Objective¶
Properties¶
- Combines variable selection (L1) with stability (L2)
- Can select more than \(n\) variables
- Handles groups of correlated predictors better than pure Lasso
When to Use¶
- Correlated predictors that should be selected together
- Want some variable selection but also stability
- General default when unsure
Example¶
# Elastic Net (0 < l1_ratio < 1)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
alpha=0.1,
l1_ratio=0.5 # 50% L1, 50% L2
)
Coordinate Descent Algorithm¶
RustyStats uses coordinate descent for regularized GLMs, following the glmnet approach.
Algorithm Overview¶
Initialize β = 0 (or from IRLS)
Repeat until converged:
For j = 1, ..., p:
Compute partial residual: r_j = z - X_{-j} β_{-j}
Update β_j by soft thresholding:
β_j = S(⟨X_j, W r_j⟩, λρ) / (⟨X_j, W X_j⟩ + λ(1-ρ))
Why Coordinate Descent?¶
- Efficient: \(O(np)\) per cycle
- Warm starts: Path from large λ to small λ
- Sparse updates: Skip zero coefficients
Implementation¶
Key optimizations: - Covariance updates (compute X'WX once) - Active set strategy (focus on non-zero coefficients) - Parallelization via Rayon
Choosing λ (Alpha)¶
The penalty strength λ (called alpha in RustyStats API) is crucial.
Tuning Alpha¶
Try different alpha values to find the right balance:
# Test different regularization strengths
for alpha in [0.001, 0.01, 0.1, 1.0]:
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="poisson").fit(
alpha=alpha, l1_ratio=1.0
)
print(f"α={alpha}: {result.n_nonzero()} features, deviance={result.deviance:.2f}")
The 1-SE Rule¶
Choose the largest λ within one standard error of the minimum CV error:
This gives a simpler model with comparable performance.
Standardization¶
Why Standardize?¶
The penalty is applied uniformly to all coefficients. Without standardization, variables on different scales are penalized differently.
Internal Standardization¶
RustyStats internally standardizes features before fitting:
# Features are standardized internally
# Coefficients are returned on original scale
result = rs.glm("y ~ x1 + x2", data, family="gaussian").fit(alpha=0.1, l1_ratio=1.0)
Intercept¶
The intercept is never penalized:
Regularization with Different Families¶
Regularization works with all GLM families:
# Regularized Poisson
result = rs.glm("y ~ x1 + x2", data, family="poisson").fit(alpha=0.1, l1_ratio=1.0)
# Regularized Binomial (Logistic)
result = rs.glm("y ~ x1 + x2", data, family="binomial").fit(alpha=0.1, l1_ratio=0.5)
# Regularized Gamma
result = rs.glm("y ~ x1 + x2", data, family="gamma").fit(alpha=0.1, l1_ratio=0.0)
The coordinate descent algorithm adapts to each family's variance function and link.
Degrees of Freedom¶
For regularized models, the effective degrees of freedom is less than \(p\):
- Ridge: \(\text{df} = \text{tr}(\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T)\)
- Lasso: Approximately the number of non-zero coefficients
Regularization and Inference¶
Standard Errors with Regularization
Standard errors from regularized models are biased and should be interpreted with caution. The coefficients are shrunk toward zero, which affects the sampling distribution.
For valid inference with selected variables: 1. Use cross-validation to select variables 2. Refit an unregularized model with selected variables 3. Use standard errors from the unregularized model
Summary¶
| Method | Penalty | Variable Selection | Best For |
|---|---|---|---|
| Ridge | L2: \(\|\boldsymbol{\beta}\|_2^2\) | No | Multicollinearity |
| Lasso | L1: \(\|\boldsymbol{\beta}\|_1\) | Yes | Sparse models |
| Elastic Net | Mix | Yes (grouped) | Correlated predictors |
Key parameters:
- alpha: Penalty strength (larger = more shrinkage)
- l1_ratio: Mix of L1/L2 (1 = Lasso, 0 = Ridge)
Best practices:
1. Use cross-validation to choose alpha
2. Consider the 1-SE rule for parsimony
3. Start with l1_ratio=0.5 (Elastic Net) unless you have a specific reason