Distribution Families¶

This chapter provides a comprehensive treatment of the distribution families available in RustyStats. For each family, we derive the key quantities from first principles and show how they connect to the exponential family framework.

Prerequisites: Familiarity with the exponential family and GLM framework.

Part 1: The Exponential Family Foundation¶

1.1 Recap: Exponential Family Form¶

A distribution belongs to the exponential family if its density can be written as:

\[ f(y; \theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{\phi} + c(y, \phi)\right\} \]

where: - $\theta$ = canonical parameter (related to the mean) - $\phi$ = dispersion parameter - $b(\theta)$ = cumulant function - $c(y, \phi)$ = normalization term

1.2 Key Derived Quantities¶

From $b(\theta)$, we derive:

Quantity	Formula	Meaning
Mean	$\mu = b'(\theta)$	Expected value of $Y$
Variance	$\text{Var}(Y) = \phi \cdot b''(\theta)$	Spread of $Y$
Variance function	$V(\mu) = b''(\theta)$	Variance as function of mean

1.3 The Family Trait in RustyStats¶

Every family in RustyStats implements:

pub trait Family: Send + Sync {
    /// Distribution name
    fn name(&self) -> &str;

    /// Variance function V(μ): how variance depends on mean
    fn variance(&self, mu: &Array1<f64>) -> Array1<f64>;

    /// Unit deviance d(y, μ): contribution of each observation
    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64>;

    /// Total deviance (sum of weighted unit deviances)
    fn deviance(&self, y: &Array1<f64>, mu: &Array1<f64>, 
                weights: Option<&Array1<f64>>) -> f64;

    /// Canonical link function for this family
    fn default_link(&self) -> Box<dyn Link>;

    /// Initialize μ from y for IRLS
    fn initialize_mu(&self, y: &Array1<f64>) -> Array1<f64>;

    /// Check if μ values are in valid range
    fn is_valid_mu(&self, mu: &Array1<f64>) -> bool;
}

Part 2: The Gaussian Family¶

2.1 The Distribution¶

The Gaussian (Normal) distribution has density:

\[ f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{(y-\mu)^2}{2\sigma^2}\right\} \]

Support: $y \in (-\infty, +\infty)$

Parameters: - $\mu$ = mean (location) - $\sigma^2$ = variance (spread)

2.2 Exponential Family Form¶

Rewrite the density:

\[ f(y; \mu, \sigma^2) = \exp\left\{-\frac{(y-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right\} \]

Expand the quadratic:

\[ = \exp\left\{-\frac{y^2 - 2y\mu + \mu^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right\} \]

\[ = \exp\left\{\frac{y\mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\right\} \]

Matching to exponential family form:

Component	Value
$\theta$	$\mu$
$\phi$	$\sigma^2$
$b(\theta)$	$\theta^2/2 = \mu^2/2$
$c(y, \phi)$	$-\frac{y^2}{2\phi} - \frac{1}{2}\log(2\pi\phi)$

2.3 Derived Quantities¶

Mean: $\mu = b'(\theta) = \theta$ ✓

Variance function: $V(\mu) = b''(\theta) = 1$

Variance: $\text{Var}(Y) = \phi \cdot V(\mu) = \sigma^2 \cdot 1 = \sigma^2$ ✓

2.4 Unit Deviance¶

The deviance compares the saturated model (one parameter per observation) to the fitted model. For Gaussian:

\[ D = 2[\ell_{\text{saturated}} - \ell_{\text{fitted}}] \]

The log-likelihood is: $$ \ell = -\frac{1}{2\sigma^2}\sum_i (y_i - \mu_i)^2 - \frac{n}{2}\log(2\pi\sigma^2) $$

For the saturated model, $\hat{\mu}_i = y_i$, so $\ell_{\text{sat}} = -\frac{n}{2}\log(2\pi\sigma^2)$.

For the fitted model with predicted means $\hat{\mu}_i$: $$ \ell_{\text{fit}} = -\frac{1}{2\sigma^2}\sum_i (y_i - \hat{\mu}_i)^2 - \frac{n}{2}\log(2\pi\sigma^2) $$

Thus: $$ D = 2[\ell_{\text{sat}} - \ell_{\text{fit}}] = \frac{1}{\sigma^2}\sum_i (y_i - \hat{\mu}_i)^2 $$

The scaled deviance is $D^* = D/\phi = \sum_i (y_i - \hat{\mu}_i)^2$.

The unit deviance is: $$ \boxed{d(y, \mu) = (y - \mu)^2} $$

This is just the squared residual!

2.5 Canonical Link¶

For Gaussian, $\theta = \mu$, so the canonical link is the identity: $$ \eta = g(\mu) = \mu $$

2.6 IRLS Properties¶

With identity link: - Weights: $w_i = 1/[V(\mu_i) \cdot g'(\mu_i)^2] = 1/[1 \cdot 1] = 1$ - Working response: $z_i = \eta_i + (y_i - \mu_i) \cdot 1 = y_i$

IRLS reduces to ordinary least squares and converges in one iteration.

2.7 Implementation¶

pub struct GaussianFamily;

impl Family for GaussianFamily {
    fn name(&self) -> &str { "Gaussian" }

    fn variance(&self, mu: &Array1<f64>) -> Array1<f64> {
        Array1::ones(mu.len())  // V(μ) = 1
    }

    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64> {
        (y - mu).mapv(|r| r * r)  // d(y,μ) = (y-μ)²
    }

    fn default_link(&self) -> Box<dyn Link> {
        Box::new(IdentityLink)
    }

    fn initialize_mu(&self, y: &Array1<f64>) -> Array1<f64> {
        y.clone()  // Start with μ = y
    }

    fn is_valid_mu(&self, _mu: &Array1<f64>) -> bool {
        true  // Any real μ is valid
    }
}

2.8 When to Use¶

Continuous responses that can be positive or negative
When variance doesn't depend on the mean
Traditional linear regression problems

Part 3: The Poisson Family¶

3.1 The Distribution¶

The Poisson distribution models counts:

\[ P(Y = y) = \frac{\mu^y e^{-\mu}}{y!}, \quad y = 0, 1, 2, \ldots \]

Support: $y \in \{0, 1, 2, \ldots\}$

Parameter: $\mu > 0$ = mean (and variance)

3.2 Exponential Family Form¶

Take the log: $$ \log P(Y = y) = y \log\mu - \mu - \log(y!) $$

Exponentiating: $$ P(Y = y) = \exp{y \log\mu - \mu - \log(y!)} $$

Matching to exponential family:

Component	Value
$\theta$	$\log\mu$
$\phi$	$1$ (fixed)
$b(\theta)$	$e^\theta = \mu$
$c(y, \phi)$	$-\log(y!)$

3.3 Derived Quantities¶

Mean: $\mu = b'(\theta) = e^\theta$ ✓ (since $\theta = \log\mu$)

Variance function: $V(\mu) = b''(\theta) = e^\theta = \mu$

Variance: $\text{Var}(Y) = \phi \cdot V(\mu) = 1 \cdot \mu = \mu$

Equidispersion

For Poisson, mean = variance. This is called equidispersion and is a key assumption to check.

3.4 Unit Deviance Derivation¶

The log-likelihood for one observation: $$ \ell(y; \mu) = y\log\mu - \mu - \log(y!) $$

Saturated model ($\hat{\mu} = y$): $$ \ell_{\text{sat}} = y\log y - y - \log(y!) \quad \text{(for } y > 0\text{)} $$

Fitted model: $$ \ell_{\text{fit}} = y\log\mu - \mu - \log(y!) $$

Unit deviance: $$ d(y, \mu) = 2[\ell_{\text{sat}} - \ell_{\text{fit}}] = 2[y\log y - y - y\log\mu + \mu] $$

\[ \boxed{d(y, \mu) = 2\left[y\log\frac{y}{\mu} - (y - \mu)\right]} \]

Special case: When $y = 0$: $$ d(0, \mu) = 2[0 - 0 - 0 + \mu] = 2\mu $$

3.5 Canonical Link¶

Since $\theta = \log\mu$, the canonical link is the log: $$ \eta = g(\mu) = \log\mu $$

3.6 IRLS Properties¶

With log link: - $g'(\mu) = 1/\mu$ - Weights: $w_i = 1/[V(\mu_i) \cdot g'(\mu_i)^2] = 1/[\mu_i \cdot (1/\mu_i)^2] = \mu_i$ - Working response: $z_i = \log\mu_i + (y_i - \mu_i)/\mu_i$

Observations with larger predicted means get higher weight.

3.7 Worked Example: Claim Frequency¶

Data: 5 policyholders with claim counts

Policyholder	Exposure	Claims	Rate
1	1.0	0	0.0
2	0.5	1	2.0
3	1.0	2	2.0
4	0.75	1	1.33
5	1.0	3	3.0

Model: $\log(\mu_i) = \beta_0 + \log(\text{exposure}_i)$

The offset $\log(\text{exposure})$ makes this a rate model: $$ \log(\mu_i/\text{exposure}_i) = \beta_0 $$

MLE for constant rate model: $$ \hat{\beta}_0 = \log\left(\frac{\sum y_i}{\sum \text{exposure}_i}\right) = \log\left(\frac{7}{4.25}\right) = 0.499 $$

Predicted rate: $e^{0.499} = 1.65$ claims per exposure unit.

Deviance: $$ D = 2\sum_i \left[y_i\log\frac{y_i}{\hat{\mu}_i} - (y_i - \hat{\mu}_i)\right] $$

where $\hat{\mu}_i = 1.65 \times \text{exposure}_i$.

3.8 Checking Overdispersion¶

If Var$(Y) > \mu$, the data is overdispersed. Test using:

\[ \text{Dispersion ratio} = \frac{\text{Pearson } \chi^2}{n - p} = \frac{\sum (y_i - \hat{\mu}_i)^2/\hat{\mu}_i}{n - p} \]

Ratio $\approx 1$: Poisson is appropriate
Ratio $> 1$: Consider QuasiPoisson or Negative Binomial

3.9 Implementation¶

pub struct PoissonFamily;

impl Family for PoissonFamily {
    fn name(&self) -> &str { "Poisson" }

    fn variance(&self, mu: &Array1<f64>) -> Array1<f64> {
        mu.clone()  // V(μ) = μ
    }

    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64> {
        Zip::from(y).and(mu).map_collect(|&yi, &mui| {
            if yi > 0.0 {
                2.0 * (yi * (yi / mui).ln() - (yi - mui))
            } else {
                2.0 * mui  // Special case for y = 0
            }
        })
    }

    fn default_link(&self) -> Box<dyn Link> {
        Box::new(LogLink)
    }

    fn initialize_mu(&self, y: &Array1<f64>) -> Array1<f64> {
        y.mapv(|yi| (yi + 0.1).max(0.1))  // Avoid log(0)
    }

    fn is_valid_mu(&self, mu: &Array1<f64>) -> bool {
        mu.iter().all(|&m| m > 0.0)
    }
}

Part 4: The Binomial Family¶

4.1 The Distribution¶

For $n$ trials with success probability $p$:

\[ P(Y = y) = \binom{n}{y} p^y (1-p)^{n-y}, \quad y = 0, 1, \ldots, n \]

For binary data ($n = 1$): $$ P(Y = 1) = \mu, \quad P(Y = 0) = 1 - \mu $$

Support: $y \in \{0, 1\}$ (binary) or $y \in [0, 1]$ (proportion)

Parameter: $\mu = p \in (0, 1)$

4.2 Exponential Family Form¶

For binary $Y \in \{0, 1\}$: $$ P(Y = y) = \mu^y (1-\mu)^{1-y} $$

Take the log: $$ \log P(Y = y) = y\log\mu + (1-y)\log(1-\mu) = y\log\frac{\mu}{1-\mu} + \log(1-\mu) $$

Exponentiating: $$ P(Y = y) = \exp\left{y\log\frac{\mu}{1-\mu} + \log(1-\mu)\right} $$

Matching to exponential family:

Component	Value
$\theta$	$\log\frac{\mu}{1-\mu}$ (log-odds)
$\phi$	$1$ (fixed)
$b(\theta)$	$\log(1 + e^\theta)$
$c(y, \phi)$	$0$

4.3 Derived Quantities¶

Mean: We have $\mu = \frac{e^\theta}{1 + e^\theta}$. Check: $$ b'(\theta) = \frac{e^\theta}{1 + e^\theta} = \mu \quad ✓ $$

Variance function: $$ b''(\theta) = \frac{e^\theta}{(1 + e^\theta)^2} = \mu(1-\mu) $$

So $V(\mu) = \mu(1-\mu)$.

Variance: $\text{Var}(Y) = \phi \cdot V(\mu) = \mu(1-\mu)$

The variance is maximized at $\mu = 0.5$ and goes to zero at $\mu \to 0$ or $\mu \to 1$.

4.4 Unit Deviance Derivation¶

Log-likelihood for binary $y$: $$ \ell(y; \mu) = y\log\mu + (1-y)\log(1-\mu) $$

Saturated model: $\hat{\mu} = y$, but this gives $\log 0$ when $y \in \{0, 1\}$. By convention, $\ell_{\text{sat}} = 0$.

Fitted model: $$ \ell_{\text{fit}} = y\log\mu + (1-y)\log(1-\mu) $$

Unit deviance: $$ d(y, \mu) = -2\ell_{\text{fit}} = -2[y\log\mu + (1-y)\log(1-\mu)] $$

Rewriting: $$ \boxed{d(y, \mu) = 2\left[y\log\frac{y}{\mu} + (1-y)\log\frac{1-y}{1-\mu}\right]} $$

Special cases: - $y = 1$: $d(1, \mu) = -2\log\mu$ - $y = 0$: $d(0, \mu) = -2\log(1-\mu)$

4.5 Canonical Link: The Logit¶

The canonical link is: $$ \eta = g(\mu) = \log\frac{\mu}{1-\mu} \quad \text{(log-odds or logit)} $$

The inverse (logistic function): $$ \mu = g^{-1}(\eta) = \frac{e^\eta}{1 + e^\eta} = \frac{1}{1 + e^{-\eta}} $$

The derivative: $$ g'(\mu) = \frac{d}{d\mu}\log\frac{\mu}{1-\mu} = \frac{1}{\mu} + \frac{1}{1-\mu} = \frac{1}{\mu(1-\mu)} $$

4.6 IRLS Properties¶

With logit link: - Weights: $w_i = 1/[V(\mu_i) \cdot g'(\mu_i)^2] = 1/[\mu_i(1-\mu_i) \cdot 1/(\mu_i(1-\mu_i))^2] = \mu_i(1-\mu_i)$ - Working response: $z_i = \log\frac{\mu_i}{1-\mu_i} + \frac{y_i - \mu_i}{\mu_i(1-\mu_i)}$

Weights are largest when $\mu \approx 0.5$ (most uncertainty) and smallest near 0 or 1.

4.7 Interpretation: Odds Ratios¶

The logit model: $$ \log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p $$

For a one-unit increase in $x_j$: $$ \log\frac{\mu_{\text{new}}}{1-\mu_{\text{new}}} - \log\frac{\mu_{\text{old}}}{1-\mu_{\text{old}}} = \beta_j $$

Thus: $$ \frac{\text{Odds}{\text{new}}}{\text{Odds} $$}}} = e^{\beta_j

$e^{\beta_j}$ is the odds ratio for a one-unit increase in $x_j$.

4.8 Worked Example: Claims Occurrence¶

Data: Whether policyholder had a claim

Age	Has Claim
25	1
35	0
45	1
55	1
65	0

Model: $\text{logit}(\mu_i) = \beta_0 + \beta_1 \cdot \text{age}_i$

Let $x = (\text{age} - 45)/10$ for numerical stability.

Iteration 1: - Initialize: $\mu^{(0)} = (0.55, 0.45, 0.55, 0.55, 0.45)$ (slightly toward observed) - Compute weights: $w_i = \mu_i(1-\mu_i)$ - Compute working response - Solve WLS...

After convergence (typically 5-8 iterations): - $\hat{\beta}_0 \approx 0.4$ (log-odds at age 45) - $\hat{\beta}_1 \approx 0.05$ (effect of 10-year age increase) - Odds ratio for 10-year increase: $e^{0.05} \approx 1.05$

4.9 Complete Separation¶

A special problem in logistic regression: if a predictor perfectly separates 0s from 1s, the MLE doesn't exist.

Example: All young people have claims, all old people don't.

The algorithm tries $\beta_1 \to \infty$, which never converges.

Solutions: 1. Remove the separating variable 2. Add regularization (Ridge/Lasso) 3. Use Firth's penalized likelihood

4.10 Implementation¶

pub struct BinomialFamily;

impl Family for BinomialFamily {
    fn name(&self) -> &str { "Binomial" }

    fn variance(&self, mu: &Array1<f64>) -> Array1<f64> {
        mu.mapv(|m| m * (1.0 - m))  // V(μ) = μ(1-μ)
    }

    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64> {
        Zip::from(y).and(mu).map_collect(|&yi, &mui| {
            let term1 = if yi > 0.0 { yi * (yi / mui).ln() } else { 0.0 };
            let term2 = if yi < 1.0 { (1.0 - yi) * ((1.0 - yi) / (1.0 - mui)).ln() } else { 0.0 };
            2.0 * (term1 + term2)
        })
    }

    fn default_link(&self) -> Box<dyn Link> {
        Box::new(LogitLink)
    }

    fn initialize_mu(&self, y: &Array1<f64>) -> Array1<f64> {
        y.mapv(|yi| (yi + 0.5) / 2.0)  // Shrink toward 0.5
    }

    fn is_valid_mu(&self, mu: &Array1<f64>) -> bool {
        mu.iter().all(|&m| m > 0.0 && m < 1.0)
    }
}

Part 5: The Gamma Family¶

5.1 The Distribution¶

The Gamma distribution for positive continuous data:

\[ f(y; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} y^{\alpha-1} e^{-\beta y}, \quad y > 0 \]

Reparameterized in terms of mean $\mu$ and shape $\nu$: - $\mu = \alpha/\beta$ (mean) - $\nu = \alpha$ (shape)

\[ E(Y) = \mu, \quad \text{Var}(Y) = \mu^2/\nu \]

5.2 Exponential Family Form¶

With $\alpha = \nu$ and $\beta = \nu/\mu$:

\[ f(y; \mu, \nu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} y^{\nu-1} \exp\{-\nu y/\mu\} \]

Rewriting: $$ \log f = \nu[\log(\nu/\mu) + (\nu-1)/\nu \cdot \log y - y/\mu] - \log\Gamma(\nu) $$

This is messy, but the key matching gives:

Component	Value
$\theta$	$-1/\mu$
$\phi$	$1/\nu$
$b(\theta)$	$-\log(-\theta)$

5.3 Derived Quantities¶

Mean: $\mu = b'(\theta) = -1/\theta = \mu$ ✓

Variance function: $$ b''(\theta) = 1/\theta^2 = \mu^2 $$

So $V(\mu) = \mu^2$.

Variance: $\text{Var}(Y) = \phi \cdot V(\mu) = \mu^2/\nu$

5.4 Key Property: Constant CV¶

The coefficient of variation (CV) is: $$ \text{CV} = \frac{\text{SD}}{\text{Mean}} = \frac{\sqrt{\mu^2/\nu}}{\mu} = \frac{1}{\sqrt{\nu}} $$

This is constant regardless of the mean! This makes Gamma ideal for: - Claim severity (larger claims are more variable, proportionally) - Financial data with percentage-based volatility

5.5 Unit Deviance¶

\[ \boxed{d(y, \mu) = 2\left[-\log\frac{y}{\mu} + \frac{y - \mu}{\mu}\right]} \]

Or equivalently: $$ d(y, \mu) = 2\left[\frac{y - \mu}{\mu} - \log\frac{y}{\mu}\right] $$

5.6 Links¶

Canonical link (inverse): $\eta = -1/\mu$

This is rarely used because: - $\mu > 0$ but $\eta$ can be any real number - The relationship is awkward to interpret

Common link (log): $\eta = \log\mu$ - $\mu = e^\eta$ is always positive - Coefficients have multiplicative interpretation - Recommended in most applications

5.7 IRLS with Log Link¶

With log link: - $g'(\mu) = 1/\mu$ - Weights: $w_i = 1/[\mu_i^2 \cdot (1/\mu_i)^2] = 1$

Constant weights! This is a happy coincidence when $V(\mu) = \mu^2$ and $g'(\mu) = 1/\mu$.

5.8 Worked Example: Claim Severity¶

Data: Claim amounts for 4 claims

Claim	Amount	log(Amount)
1	500	6.21
2	1200	7.09
3	800	6.68
4	2500	7.82

Model: $\log(\mu_i) = \beta_0$ (constant model)

MLE: $\hat{\beta}_0 = \frac{1}{4}\sum \log(y_i) = 6.95$

(Note: For Gamma, the MLE of log-mean is the mean of logs, not log of means!)

Predicted mean: $\hat{\mu} = e^{6.95} = 1047$

Dispersion: Estimated from residuals, $\hat{\phi} \approx 0.15$, so $\hat{\nu} \approx 6.7$.

5.9 Implementation¶

pub struct GammaFamily;

impl Family for GammaFamily {
    fn name(&self) -> &str { "Gamma" }

    fn variance(&self, mu: &Array1<f64>) -> Array1<f64> {
        mu.mapv(|m| m * m)  // V(μ) = μ²
    }

    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64> {
        Zip::from(y).and(mu).map_collect(|&yi, &mui| {
            2.0 * ((yi - mui) / mui - (yi / mui).ln())
        })
    }

    fn default_link(&self) -> Box<dyn Link> {
        Box::new(LogLink)  // Log is more practical than canonical inverse
    }

    fn initialize_mu(&self, y: &Array1<f64>) -> Array1<f64> {
        y.mapv(|yi| yi.max(1e-10))  // Ensure positive
    }

    fn is_valid_mu(&self, mu: &Array1<f64>) -> bool {
        mu.iter().all(|&m| m > 0.0)
    }
}

Part 6: The Tweedie Family¶

6.1 The Power Variance Family¶

Tweedie distributions have variance function: $$ V(\mu) = \mu^p $$

where $p$ is the variance power.

Power $p$	Distribution	Support
$p = 0$	Gaussian	$(-\infty, \infty)$
$p = 1$	Poisson	$\{0, 1, 2, \ldots\}$
$1 < p < 2$	Compound Poisson-Gamma	$\{0\} \cup (0, \infty)$
$p = 2$	Gamma	$(0, \infty)$
$p = 3$	Inverse Gaussian	$(0, \infty)$

6.2 The Sweet Spot: $1 < p < 2$¶

For $1 < p < 2$, Tweedie has a remarkable property: it supports exact zeros AND positive continuous values.

This is modeled as a compound Poisson process: $$ Y = \sum_{j=1}^{N} Z_j $$

where: - $N \sim \text{Poisson}(\lambda)$ — number of events - $Z_j \sim \text{Gamma}(\alpha, \beta)$ — individual event sizes (i.i.d.)

When $N = 0$: $Y = 0$ (exact zero) When $N > 0$: $Y > 0$ (sum of positive Gammas)

6.3 Connection to Frequency-Severity¶

In insurance, this naturally models pure premium: - $N$ = number of claims (Poisson) - $Z_j$ = individual claim amount (Gamma) - $Y$ = total claims = pure premium

The parameters connect: - $E(N) = \lambda$ - $E(Z) = \alpha/\beta$ - $E(Y) = \lambda \cdot \alpha/\beta = \mu$

6.4 Unit Deviance¶

For $1 < p < 2$:

\[ \boxed{d(y, \mu) = 2\left[\frac{y^{2-p}}{(1-p)(2-p)} - \frac{y\mu^{1-p}}{1-p} + \frac{\mu^{2-p}}{2-p}\right]} \]

When $y = 0$: $$ d(0, \mu) = \frac{2\mu^{2-p}}{2-p} $$

6.5 Choosing the Variance Power $p$¶

$p \to 1$: More Poisson-like (frequency dominates)
$p = 1.5$: Balanced (common default)
$p \to 2$: More Gamma-like (severity dominates)

Estimating $p$: Profile the likelihood over a grid of $p$ values.

6.6 Implementation¶

pub struct TweedieFamily {
    pub var_power: f64,  // p in V(μ) = μ^p
}

impl Family for TweedieFamily {
    fn variance(&self, mu: &Array1<f64>) -> Array1<f64> {
        mu.mapv(|m| m.powf(self.var_power))
    }

    fn unit_deviance(&self, y: &Array1<f64>, mu: &Array1<f64>) -> Array1<f64> {
        let p = self.var_power;
        Zip::from(y).and(mu).map_collect(|&yi, &mui| {
            if yi == 0.0 {
                2.0 * mui.powf(2.0 - p) / (2.0 - p)
            } else {
                let t1 = yi.powf(2.0 - p) / ((1.0 - p) * (2.0 - p));
                let t2 = yi * mui.powf(1.0 - p) / (1.0 - p);
                let t3 = mui.powf(2.0 - p) / (2.0 - p);
                2.0 * (t1 - t2 + t3)
            }
        })
    }

    fn default_link(&self) -> Box<dyn Link> {
        Box::new(LogLink)
    }
}

Part 7: Quasi-Families¶

7.1 The Quasi-Likelihood Approach¶

Sometimes we don't have a full probability distribution—just: - A mean model: $E(Y) = \mu$ - A variance model: $\text{Var}(Y) = \phi \cdot V(\mu)$

Quasi-likelihood uses just these two relationships without specifying the full distribution.

7.2 QuasiPoisson¶

Same variance function as Poisson, but $\phi$ is estimated:

\[ V(\mu) = \mu, \quad \phi \text{ estimated from data} \]

Effect: - Point estimates: Identical to Poisson - Standard errors: Multiplied by $\sqrt{\hat{\phi}}$ - Confidence intervals: Wider if $\hat{\phi} > 1$

7.3 QuasiBinomial¶

Same idea for binomial:

\[ V(\mu) = \mu(1-\mu), \quad \phi \text{ estimated} \]

7.4 Dispersion Estimation¶

\[ \hat{\phi} = \frac{1}{n-p} \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]

This is the Pearson chi-square divided by residual degrees of freedom.

7.5 When to Use Quasi-Families¶

Quick fix for overdispersion
When you don't trust the full distributional assumption
When AIC/BIC aren't the primary concern

Caution: Quasi-likelihood inference is valid for large samples. AIC/BIC interpretation is murky.

Part 8: The Negative Binomial Family¶

8.1 The Distribution¶

The Negative Binomial with mean $\mu$ and shape $\theta$:

\[ P(Y = y) = \frac{\Gamma(y + \theta)}{\Gamma(\theta) \cdot y!} \left(\frac{\theta}{\theta + \mu}\right)^\theta \left(\frac{\mu}{\theta + \mu}\right)^y \]

Variance: $$ \text{Var}(Y) = \mu + \frac{\mu^2}{\theta} $$

8.2 Variance Function¶

\[ V(\mu) = \mu + \frac{\mu^2}{\theta} \]

This is quadratic in $\mu$, unlike QuasiPoisson's linear variance.

8.3 The $\theta$ Parameter¶

Large $\theta$ ($\to \infty$): Approaches Poisson ($V \to \mu$)
Small $\theta$ (near 0): Strong overdispersion ($V \to \mu^2/\theta$)

8.4 Comparison: QuasiPoisson vs. Negative Binomial¶

Aspect	QuasiPoisson	Negative Binomial
Variance	$\phi \cdot \mu$ (linear)	$\mu + \mu^2/\theta$ (quadratic)
True distribution	No	Yes
AIC/BIC valid	No	Yes
Estimation	$\phi$ from residuals	$\theta$ from likelihood

Rule of thumb: Use NegBin when overdispersion increases with the mean.

8.5 Estimating $\theta$¶

RustyStats estimates $\theta$ using an iterative moment-based approach:

Initial fit: Fit a Poisson GLM to get starting $\hat{\mu}$
Moment estimation: Estimate $\theta$ from residuals: $$\hat{\theta} = \frac{\bar{\mu}^2}{\text{Var}(Y - \mu) - \bar{\mu}}$$
Iterate: Fit NegBin GLM with current $\theta$, re-estimate $\theta$, repeat until convergence

Bounds: $\theta$ is constrained to $[0.01, 1000]$ to prevent numerical issues.

8.6 Numerical Stability¶

Negative Binomial fitting is numerically challenging due to:

Larger variance function: $V(\mu) = \mu + \mu^2/\theta$ produces smaller IRLS weights than Poisson
Ill-conditioned design matrices: Common with splines + target encoding
Large $\theta$ instability: When $\theta \to \infty$, the log-likelihood involves $\log\Gamma(y + \theta) - \log\Gamma(\theta)$ which loses precision

RustyStats Solution:

RustyStats applies minimum ridge regularization (α ≥ 1e-6) automatically for NegBin models:

# The formula API automatically applies minimum regularization
result = rs.glm("claims ~ ns(age, df=5) + TE(region)", 
                data, family="negbinomial").fit()

# You'll see "Method: IRLS + Ridge" in the summary

Coefficient Interpretation

The ridge penalty introduces negligible bias (~1e-6 × coefficient magnitude) but makes inference approximate. For actuarial applications, this bias is typically irrelevant.

8.7 Log-Likelihood for Large $\theta$¶

When $\theta > 100$, the NegBin distribution is essentially Poisson. RustyStats uses the Poisson log-likelihood in this case to avoid numerical overflow:

\[ \ell(\mu; y) = \sum_i \left[ y_i \log\mu_i - \mu_i - \log\Gamma(y_i + 1) \right] \]

8.8 Diagnostics Warnings¶

The diagnostics system generates warnings for NegBin models:

Warning	Condition	Recommendation
`negbinomial_regularization`	Always	Informational: ridge penalty applied
`negbinomial_large_theta`	$\theta \geq 100$	Consider Poisson instead
`negbinomial_small_theta`	$\theta \leq 0.1$	Check for missing covariates

# Access warnings via diagnostics
diag = result.diagnostics(data, ...)
for w in diag.warnings:
    print(f"[{w['type']}]: {w['message']}")

Part 9: Summary and Selection Guide¶

9.1 Decision Tree¶

Response type?
│
├── Continuous
│   ├── Can be negative → Gaussian
│   └── Always positive
│       ├── Has exact zeros → Tweedie (p ∈ (1,2))
│       └── No zeros → Gamma
│
└── Discrete
    ├── Binary (0/1) → Binomial
    │   └── Overdispersed? → QuasiBinomial
    │
    └── Counts (0, 1, 2, ...)
        ├── Var ≈ Mean → Poisson
        └── Var > Mean (overdispersion)
            ├── Quick fix → QuasiPoisson
            └── Better model → Negative Binomial

9.2 Key Formulas Summary¶

Family	$V(\mu)$	$\phi$	Canonical Link	Support
Gaussian	$1$	estimated	Identity	$\mathbb{R}$
Poisson	$\mu$	$1$	Log	$\mathbb{N}_0$
Binomial	$\mu(1-\mu)$	$1$	Logit	$\{0,1\}$
Gamma	$\mu^2$	estimated	Inverse (Log used)	$\mathbb{R}^+$
Tweedie	$\mu^p$	estimated	Log	$\{0\} \cup \mathbb{R}^+$
NegBin	$\mu + \mu^2/\theta$	$1$	Log	$\mathbb{N}_0$

Exercises¶

Exercise 1: Exponential Family

Show that the Exponential distribution ($f(y) = \lambda e^{-\lambda y}$) is a special case of the Gamma family with $\nu = 1$.

Exercise 2: Deviance Calculation

For data $y = (3, 0, 2, 5)$ with fitted means $\mu = (2.5, 0.5, 2.5, 4.5)$:

a) Calculate the unit deviances for Poisson

b) Calculate the total deviance

c) What does a high deviance indicate?

Exercise 3: Overdispersion

You fit a Poisson model and get Pearson $\chi^2 = 180$ with 90 degrees of freedom.

a) Calculate the dispersion ratio

b) Is this evidence of overdispersion?

c) How would QuasiPoisson adjust the standard errors?

Exercise 4: Tweedie

A Tweedie model with $p = 1.6$ and $\mu = 100$ has $\phi = 2$.

a) Calculate $V(\mu)$

b) Calculate $\text{Var}(Y)$

c) What's the CV?

Component	Value
\(\theta\)	\(\mu\)
\(\phi\)	\(\sigma^2\)
\(b(\theta)\)	\(\theta^2/2 = \mu^2/2\)
\(c(y, \phi)\)	\(-\frac{y^2}{2\phi} - \frac{1}{2}\log(2\pi\phi)\)

Component	Value
\(\theta\)	\(\log\mu\)
\(\phi\)	\(1\) (fixed)
\(b(\theta)\)	\(e^\theta = \mu\)
\(c(y, \phi)\)	\(-\log(y!)\)

Component	Value
\(\theta\)	\(\log\frac{\mu}{1-\mu}\) (log-odds)
\(\phi\)	\(1\) (fixed)
\(b(\theta)\)	\(\log(1 + e^\theta)\)
\(c(y, \phi)\)	\(0\)

Component	Value
\(\theta\)	\(-1/\mu\)
\(\phi\)	\(1/\nu\)
\(b(\theta)\)	\(-\log(-\theta)\)

Power \(p\)	Distribution	Support
\(p = 0\)	Gaussian	\((-\infty, \infty)\)
\(p = 1\)	Poisson	\(\{0, 1, 2, \ldots\}\)
\(1 < p < 2\)	Compound Poisson-Gamma	\(\{0\} \cup (0, \infty)\)
\(p = 2\)	Gamma	\((0, \infty)\)
\(p = 3\)	Inverse Gaussian	\((0, \infty)\)

Quantity	Formula	Meaning
Mean	\(\mu = b'(\theta)\)	Expected value of \(Y\)
Variance	\(\text{Var}(Y) = \phi \cdot b''(\theta)\)	Spread of \(Y\)
Variance function	\(V(\mu) = b''(\theta)\)	Variance as function of mean

Aspect	QuasiPoisson	Negative Binomial
Variance	\(\phi \cdot \mu\) (linear)	\(\mu + \mu^2/\theta\) (quadratic)
True distribution	No	Yes
AIC/BIC valid	No	Yes
Estimation	\(\phi\) from residuals	\(\theta\) from likelihood

Family	\(V(\mu)\)	\(\phi\)	Canonical Link	Support
Gaussian	\(1\)	estimated	Identity	\(\mathbb{R}\)
Poisson	\(\mu\)	\(1\)	Log	\(\mathbb{N}_0\)
Binomial	\(\mu(1-\mu)\)	\(1\)	Logit	\(\{0,1\}\)
Gamma	\(\mu^2\)	estimated	Inverse (Log used)	\(\mathbb{R}^+\)
Tweedie	\(\mu^p\)	estimated	Log	\(\{0\} \cup \mathbb{R}^+\)
NegBin	\(\mu + \mu^2/\theta\)	\(1\)	Log	\(\mathbb{N}_0\)