Skip to content

Regularization

Regularization adds a penalty to the objective function to prevent overfitting and enable variable selection. This chapter covers Ridge, Lasso, and Elastic Net regularization for GLMs.

Why Regularize?

Standard GLM fitting minimizes deviance:

\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} D(\boldsymbol{\beta}) \]

Problems can arise with: - Many predictors: Overfitting, poor generalization - Correlated predictors: Unstable coefficients - Variable selection: Which predictors matter?

Regularization adds a penalty term:

\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left[ D(\boldsymbol{\beta}) + \lambda P(\boldsymbol{\beta}) \right] \]

where \(\lambda\) controls the penalty strength and \(P(\boldsymbol{\beta})\) is the penalty function.

Ridge Regression (L2)

The Penalty

\[ P_{\text{Ridge}}(\boldsymbol{\beta}) = \sum_{j=1}^{p} \beta_j^2 = \|\boldsymbol{\beta}\|_2^2 \]

Full Objective

\[ \min_{\boldsymbol{\beta}} \left[ D(\boldsymbol{\beta}) + \lambda \sum_{j=1}^{p} \beta_j^2 \right] \]

Properties

  • Shrinks all coefficients toward zero
  • Never sets coefficients exactly to zero
  • Handles multicollinearity well
  • Smooth, differentiable penalty

When to Use

  • Many correlated predictors
  • Want to keep all variables
  • Stabilize coefficient estimates

Example

import rustystats as rs

# Ridge regression (l1_ratio = 0)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
    alpha=0.1,      # Penalty strength
    l1_ratio=0.0    # Pure L2 (Ridge)
)

Lasso Regression (L1)

The Penalty

\[ P_{\text{Lasso}}(\boldsymbol{\beta}) = \sum_{j=1}^{p} |\beta_j| = \|\boldsymbol{\beta}\|_1 \]

Full Objective

\[ \min_{\boldsymbol{\beta}} \left[ D(\boldsymbol{\beta}) + \lambda \sum_{j=1}^{p} |\beta_j| \right] \]

Properties

  • Sets some coefficients exactly to zero
  • Performs variable selection
  • Selects at most \(n\) variables (in high-dimensional settings)
  • Non-smooth penalty (requires special optimization)

The Soft Thresholding Operator

Lasso uses soft thresholding:

\[ S(z, \gamma) = \text{sign}(z) \max(|z| - \gamma, 0) \]
pub fn soft_threshold(z: f64, gamma: f64) -> f64 {
    if z > gamma {
        z - gamma
    } else if z < -gamma {
        z + gamma
    } else {
        0.0
    }
}

When to Use

  • Feature selection is desired
  • Many predictors, few are relevant
  • Interpretable sparse models

Example

# Lasso (l1_ratio = 1)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="poisson").fit(
    alpha=0.1,      # Penalty strength
    l1_ratio=1.0    # Pure L1 (Lasso)
)

print(f"Non-zero coefficients: {result.n_nonzero()}")
print(f"Selected features: {result.selected_features()}")

Elastic Net

The Penalty

A weighted combination of L1 and L2:

\[ P_{\text{EN}}(\boldsymbol{\beta}) = \rho \|\boldsymbol{\beta}\|_1 + \frac{1-\rho}{2} \|\boldsymbol{\beta}\|_2^2 \]

where \(\rho \in [0, 1]\) is the L1 ratio.

Full Objective

\[ \min_{\boldsymbol{\beta}} \left[ D(\boldsymbol{\beta}) + \lambda \left( \rho \sum_j |\beta_j| + \frac{1-\rho}{2} \sum_j \beta_j^2 \right) \right] \]

Properties

  • Combines variable selection (L1) with stability (L2)
  • Can select more than \(n\) variables
  • Handles groups of correlated predictors better than pure Lasso

When to Use

  • Correlated predictors that should be selected together
  • Want some variable selection but also stability
  • General default when unsure

Example

# Elastic Net (0 < l1_ratio < 1)
result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
    alpha=0.1,
    l1_ratio=0.5    # 50% L1, 50% L2
)

Coordinate Descent Algorithm

RustyStats uses coordinate descent for regularized GLMs, following the glmnet approach.

Algorithm Overview

Initialize β = 0 (or from IRLS)
Repeat until converged:
    For j = 1, ..., p:
        Compute partial residual: r_j = z - X_{-j} β_{-j}
        Update β_j by soft thresholding:
            β_j = S(⟨X_j, W r_j⟩, λρ) / (⟨X_j, W X_j⟩ + λ(1-ρ))

Why Coordinate Descent?

  • Efficient: \(O(np)\) per cycle
  • Warm starts: Path from large λ to small λ
  • Sparse updates: Skip zero coefficients

Implementation

crates/rustystats-core/src/solvers/coordinate_descent.rs

Key optimizations: - Covariance updates (compute X'WX once) - Active set strategy (focus on non-zero coefficients) - Parallelization via Rayon

Choosing λ (Alpha)

The penalty strength λ (called alpha in RustyStats API) is crucial.

Tuning Alpha

Try different alpha values to find the right balance:

# Test different regularization strengths
for alpha in [0.001, 0.01, 0.1, 1.0]:
    result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="poisson").fit(
        alpha=alpha, l1_ratio=1.0
    )
    print(f"α={alpha}: {result.n_nonzero()} features, deviance={result.deviance:.2f}")

The 1-SE Rule

Choose the largest λ within one standard error of the minimum CV error:

λ_1se: largest λ such that CV_error(λ) ≤ CV_error(λ_min) + SE(λ_min)

This gives a simpler model with comparable performance.

Standardization

Why Standardize?

The penalty is applied uniformly to all coefficients. Without standardization, variables on different scales are penalized differently.

Internal Standardization

RustyStats internally standardizes features before fitting:

# Features are standardized internally
# Coefficients are returned on original scale
result = rs.glm("y ~ x1 + x2", data, family="gaussian").fit(alpha=0.1, l1_ratio=1.0)

Intercept

The intercept is never penalized:

\[ \min_{\beta_0, \boldsymbol{\beta}} \left[ D(\beta_0, \boldsymbol{\beta}) + \lambda P(\boldsymbol{\beta}) \right] \]

Regularization with Different Families

Regularization works with all GLM families:

# Regularized Poisson
result = rs.glm("y ~ x1 + x2", data, family="poisson").fit(alpha=0.1, l1_ratio=1.0)

# Regularized Binomial (Logistic)
result = rs.glm("y ~ x1 + x2", data, family="binomial").fit(alpha=0.1, l1_ratio=0.5)

# Regularized Gamma
result = rs.glm("y ~ x1 + x2", data, family="gamma").fit(alpha=0.1, l1_ratio=0.0)

The coordinate descent algorithm adapts to each family's variance function and link.

Degrees of Freedom

For regularized models, the effective degrees of freedom is less than \(p\):

  • Ridge: \(\text{df} = \text{tr}(\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T)\)
  • Lasso: Approximately the number of non-zero coefficients

Regularization and Inference

Standard Errors with Regularization

Standard errors from regularized models are biased and should be interpreted with caution. The coefficients are shrunk toward zero, which affects the sampling distribution.

For valid inference with selected variables: 1. Use cross-validation to select variables 2. Refit an unregularized model with selected variables 3. Use standard errors from the unregularized model

Summary

Method Penalty Variable Selection Best For
Ridge L2: \(\|\boldsymbol{\beta}\|_2^2\) No Multicollinearity
Lasso L1: \(\|\boldsymbol{\beta}\|_1\) Yes Sparse models
Elastic Net Mix Yes (grouped) Correlated predictors

Key parameters: - alpha: Penalty strength (larger = more shrinkage) - l1_ratio: Mix of L1/L2 (1 = Lasso, 0 = Ridge)

Best practices: 1. Use cross-validation to choose alpha 2. Consider the 1-SE rule for parsimony 3. Start with l1_ratio=0.5 (Elastic Net) unless you have a specific reason