Interaction Detection¶

This document explains how RustyStats detects potential interactions between factors, helping you decide which interaction terms to include in your GLM.

High-Level Explanation (Non-Technical)¶

What is an Interaction?¶

An interaction occurs when the effect of one factor on the outcome depends on the value of another factor.

Example without interaction: - Older drivers have fewer claims - Sports cars have more claims - These effects are independent—the age effect is the same whether driving a sedan or sports car

Example with interaction: - Young drivers in sports cars have disproportionately more claims than you'd expect from adding the "young driver" effect and "sports car" effect together - The combination is worse than the sum of its parts

Why Detect Interactions?¶

Including important interactions in your model:

Improves accuracy — Captures real patterns the main effects miss
Better pricing — Avoids under/over-charging specific segments
Regulatory compliance — Demonstrates you've properly modeled risk

Missing important interactions can lead to:

Systematic under-pricing for high-risk segments
Cross-subsidization between customer groups
Poor model performance on specific populations

How RustyStats Detects Interactions¶

RustyStats uses two approaches depending on whether you have a fitted model:

Pre-Fit Detection (Data Exploration)¶

Before fitting a model, we look at how the response rate varies across combinations of factors:

Divide each factor into groups (bins for continuous, levels for categorical)
Create cells from all combinations (e.g., Age bin × Region)
Calculate average response in each cell
Compare to what we'd expect if effects were independent
Flag pairs where the combined effect differs significantly

This answers: "Which factor pairs show non-additive effects on the response?"

Post-Fit Detection (Model Diagnostics)¶

After fitting a model, we look at residual patterns:

Compute residuals (actual - predicted)
Group by factor combinations
Check if residuals are systematically high/low for certain combinations
Flag pairs where the model misses a pattern

This answers: "Which factor pairs does my current model fail to capture?"

Interpreting the Results¶

The output includes:

Field	Meaning
`factor1`, `factor2`	The two factors with potential interaction
`interaction_strength`	How much variance is explained (0-1 scale)
`pvalue`	Statistical significance (lower = more confident)
`n_cells`	Number of valid combinations tested

Rules of thumb:

strength > 0.01: Worth investigating
strength > 0.05: Likely important
pvalue < 0.01: Statistically significant

What to Do With the Results¶

Review top candidates — Look at the factor pairs with highest strength
Visualize the pattern — Plot response rates by combination
Business sense check — Does the interaction make intuitive sense?
Test in model — Add the interaction term and compare fit metrics

Technical Explanation¶

Mathematical Framework¶

Interaction Effect Definition¶

For two factors X₁ and X₂, an interaction exists when:

E[Y | X₁, X₂] ≠ g⁻¹(β₀ + β₁X₁ + β₂X₂)

where g is the link function. The model requires an additional term:

E[Y | X₁, X₂] = g⁻¹(β₀ + β₁X₁ + β₂X₂ + β₃X₁X₂)

Pre-Fit Detection Algorithm¶

Step 1: Factor Ranking by Marginal Effect¶

First, we rank factors by their univariate association with the response using eta-squared (η²):

def compute_eta_squared(y_rate, exposure, factor_bins):
    """Compute variance explained by factor grouping."""
    overall_mean = weighted_average(y_rate, exposure)

    ss_total = sum(exposure * (y_rate - overall_mean)²)

    ss_between = 0
    for level in unique(factor_bins):
        mask = factor_bins == level
        level_mean = weighted_average(y_rate[mask], exposure[mask])
        ss_between += sum(exposure[mask]) * (level_mean - overall_mean)²

    return ss_between / ss_total

Interpretation: - η² = 0: Factor has no marginal effect - η² = 0.01: 1% of response variance explained - η² = 0.10: Strong effect

Step 2: Pairwise Interaction Testing¶

For top-ranked factors, we test all pairs:

def compute_interaction_strength(y_rate, exposure, bins1, bins2, min_cell_count):
    """Compute R² from interaction cell grouping."""

    # Create interaction cells
    cell_ids = bins1 * 1000 + bins2

    # Filter cells with sufficient data
    valid_cells = [c for c in unique(cell_ids) 
                   if count(cell_ids == c) >= min_cell_count]

    if len(valid_cells) < 4:
        return None  # Insufficient data

    # Compute variance explained by cell grouping
    overall_mean = weighted_average(y_rate, exposure)
    ss_total = sum(exposure * (y_rate - overall_mean)²)

    ss_model = 0
    for cell in valid_cells:
        mask = cell_ids == cell
        cell_mean = weighted_average(y_rate[mask], exposure[mask])
        ss_model += sum(exposure[mask]) * (cell_mean - overall_mean)²

    r_squared = ss_model / ss_total
    return r_squared

Step 3: Statistical Significance¶

We use an F-test to assess significance:

F = (SS_model / df_model) / (SS_residual / df_residual)

where:
  df_model = n_cells - 1
  df_residual = n_observations - n_cells

The p-value comes from the F-distribution with (df_model, df_residual) degrees of freedom.

Post-Fit Detection Algorithm¶

After fitting a model, we detect interactions the model is missing:

Step 1: Compute Residuals¶

pearson_residuals = (y - mu) / sqrt(variance(mu))

Step 2: Residual Association¶

For each factor, compute correlation with residuals:

Continuous factors:

correlation = weighted_corr(factor_values, pearson_residuals, weights=exposure)

Categorical factors:

eta_squared = variance_between_levels / total_variance

Step 3: Interaction Residual Patterns¶

For factor pairs, check if residual means vary across combinations:

def interaction_residual_strength(residuals, bins1, bins2):
    """How much residual variance is explained by the interaction?"""

    cell_ids = bins1 * 1000 + bins2
    overall_mean = mean(residuals)

    ss_total = sum((residuals - overall_mean)²)

    ss_between = 0
    for cell in unique(cell_ids):
        mask = cell_ids == cell
        cell_mean = mean(residuals[mask])
        ss_between += count(mask) * (cell_mean - overall_mean)²

    return ss_between / ss_total

Discretization Strategy¶

Continuous factors are discretized into 5 quantile bins:

def discretize(values, n_bins=5):
    quantiles = percentile(values, linspace(0, 100, n_bins + 1))
    return digitize(values, quantiles[1:-1])

Using 5 bins provides: - Sufficient granularity to detect non-linear interactions - Enough observations per cell for stable estimates - Manageable number of combinations (5 × 5 = 25 cells maximum)

Configuration Parameters¶

Parameter	Default	Description
`max_factors`	10	Maximum factors to consider (top by marginal effect)
`min_effect_size`	0.001	Minimum η² to include factor in pairwise testing
`max_candidates`	5	Maximum interaction candidates to return
`min_cell_count`	30	Minimum observations per cell for valid estimate

Computational Complexity¶

Factor ranking: O(n × p) where n = observations, p = factors
Pairwise testing: O(n × p²) worst case
Total: O(n × p²)

For 678,000 observations and 10 factors, this takes < 1 second.

Limitations¶

Only pairwise: Does not detect 3-way or higher interactions
Discretization: May miss interactions that depend on exact continuous values
Correlation ≠ Causation: Statistical significance doesn't imply the interaction should be modeled
Multiple testing: With many factor pairs, some will appear significant by chance

Best Practices¶

Use domain knowledge — Prioritize interactions that make business sense
Validate on holdout — Test if interaction improves out-of-sample performance
Check stability — Ensure interaction effect is consistent across time periods
Consider parsimony — Only add interactions that meaningfully improve the model

Code Examples¶

Pre-Fit Exploration¶

import rustystats as rs

# Explore data for potential interactions
exploration = rs.explore_data(
    data=data,
    response="ClaimNb",
    categorical_factors=["Region", "VehBrand", "Area"],
    continuous_factors=["DrivAge", "VehAge", "VehPower"],
    exposure="Exposure",
    family="poisson",
)

# Check interaction candidates
for ic in exploration.interaction_candidates:
    print(f"{ic.factor1} × {ic.factor2}: strength={ic.interaction_strength:.4f}, p={ic.pvalue:.4f}")

Post-Fit Diagnostics¶

# Fit base model
result = rs.glm(
    "ClaimNb ~ DrivAge + VehAge + C(Region)",
    data=data,
    family="poisson",
    offset="Exposure"
).fit()

# Check for missing interactions
diagnostics = result.diagnostics(
    data=data,
    categorical_factors=["Region", "VehBrand"],
    continuous_factors=["DrivAge", "VehAge"],
)

# If interactions detected, add to model
for ic in diagnostics.interaction_candidates:
    if ic.interaction_strength > 0.01:
        print(f"Consider adding: {ic.factor1}:{ic.factor2}")

Adding an Interaction¶

# Add interaction term
result_with_interaction = rs.glm(
    "ClaimNb ~ DrivAge * VehAge + C(Region)",  # DrivAge:VehAge interaction
    data=data,
    family="poisson",
    offset="Exposure"
).fit()

# Compare AIC
print(f"Base AIC: {result.aic():.1f}")
print(f"With interaction AIC: {result_with_interaction.aic():.1f}")