Interaction Detection¶
This document explains how RustyStats detects potential interactions between factors, helping you decide which interaction terms to include in your GLM.
High-Level Explanation (Non-Technical)¶
What is an Interaction?¶
An interaction occurs when the effect of one factor on the outcome depends on the value of another factor.
Example without interaction: - Older drivers have fewer claims - Sports cars have more claims - These effects are independent—the age effect is the same whether driving a sedan or sports car
Example with interaction: - Young drivers in sports cars have disproportionately more claims than you'd expect from adding the "young driver" effect and "sports car" effect together - The combination is worse than the sum of its parts
Why Detect Interactions?¶
Including important interactions in your model:
- Improves accuracy — Captures real patterns the main effects miss
- Better pricing — Avoids under/over-charging specific segments
- Regulatory compliance — Demonstrates you've properly modeled risk
Missing important interactions can lead to:
- Systematic under-pricing for high-risk segments
- Cross-subsidization between customer groups
- Poor model performance on specific populations
How RustyStats Detects Interactions¶
RustyStats uses two approaches depending on whether you have a fitted model:
Pre-Fit Detection (Data Exploration)¶
Before fitting a model, we look at how the response rate varies across combinations of factors:
- Divide each factor into groups (bins for continuous, levels for categorical)
- Create cells from all combinations (e.g., Age bin × Region)
- Calculate average response in each cell
- Compare to what we'd expect if effects were independent
- Flag pairs where the combined effect differs significantly
This answers: "Which factor pairs show non-additive effects on the response?"
Post-Fit Detection (Model Diagnostics)¶
After fitting a model, we look at residual patterns:
- Compute residuals (actual - predicted)
- Group by factor combinations
- Check if residuals are systematically high/low for certain combinations
- Flag pairs where the model misses a pattern
This answers: "Which factor pairs does my current model fail to capture?"
Interpreting the Results¶
The output includes:
| Field | Meaning |
|---|---|
factor1, factor2 |
The two factors with potential interaction |
interaction_strength |
How much variance is explained (0-1 scale) |
pvalue |
Statistical significance (lower = more confident) |
n_cells |
Number of valid combinations tested |
Rules of thumb:
- strength > 0.01: Worth investigating
- strength > 0.05: Likely important
- pvalue < 0.01: Statistically significant
What to Do With the Results¶
- Review top candidates — Look at the factor pairs with highest strength
- Visualize the pattern — Plot response rates by combination
- Business sense check — Does the interaction make intuitive sense?
- Test in model — Add the interaction term and compare fit metrics
Technical Explanation¶
Mathematical Framework¶
Interaction Effect Definition¶
For two factors X₁ and X₂, an interaction exists when:
where g is the link function. The model requires an additional term:
Pre-Fit Detection Algorithm¶
Step 1: Factor Ranking by Marginal Effect¶
First, we rank factors by their univariate association with the response using eta-squared (η²):
def compute_eta_squared(y_rate, exposure, factor_bins):
"""Compute variance explained by factor grouping."""
overall_mean = weighted_average(y_rate, exposure)
ss_total = sum(exposure * (y_rate - overall_mean)²)
ss_between = 0
for level in unique(factor_bins):
mask = factor_bins == level
level_mean = weighted_average(y_rate[mask], exposure[mask])
ss_between += sum(exposure[mask]) * (level_mean - overall_mean)²
return ss_between / ss_total
Interpretation: - η² = 0: Factor has no marginal effect - η² = 0.01: 1% of response variance explained - η² = 0.10: Strong effect
Step 2: Pairwise Interaction Testing¶
For top-ranked factors, we test all pairs:
def compute_interaction_strength(y_rate, exposure, bins1, bins2, min_cell_count):
"""Compute R² from interaction cell grouping."""
# Create interaction cells
cell_ids = bins1 * 1000 + bins2
# Filter cells with sufficient data
valid_cells = [c for c in unique(cell_ids)
if count(cell_ids == c) >= min_cell_count]
if len(valid_cells) < 4:
return None # Insufficient data
# Compute variance explained by cell grouping
overall_mean = weighted_average(y_rate, exposure)
ss_total = sum(exposure * (y_rate - overall_mean)²)
ss_model = 0
for cell in valid_cells:
mask = cell_ids == cell
cell_mean = weighted_average(y_rate[mask], exposure[mask])
ss_model += sum(exposure[mask]) * (cell_mean - overall_mean)²
r_squared = ss_model / ss_total
return r_squared
Step 3: Statistical Significance¶
We use an F-test to assess significance:
F = (SS_model / df_model) / (SS_residual / df_residual)
where:
df_model = n_cells - 1
df_residual = n_observations - n_cells
The p-value comes from the F-distribution with (df_model, df_residual) degrees of freedom.
Post-Fit Detection Algorithm¶
After fitting a model, we detect interactions the model is missing:
Step 1: Compute Residuals¶
Step 2: Residual Association¶
For each factor, compute correlation with residuals:
Continuous factors:
Categorical factors:
Step 3: Interaction Residual Patterns¶
For factor pairs, check if residual means vary across combinations:
def interaction_residual_strength(residuals, bins1, bins2):
"""How much residual variance is explained by the interaction?"""
cell_ids = bins1 * 1000 + bins2
overall_mean = mean(residuals)
ss_total = sum((residuals - overall_mean)²)
ss_between = 0
for cell in unique(cell_ids):
mask = cell_ids == cell
cell_mean = mean(residuals[mask])
ss_between += count(mask) * (cell_mean - overall_mean)²
return ss_between / ss_total
Discretization Strategy¶
Continuous factors are discretized into 5 quantile bins:
def discretize(values, n_bins=5):
quantiles = percentile(values, linspace(0, 100, n_bins + 1))
return digitize(values, quantiles[1:-1])
Using 5 bins provides: - Sufficient granularity to detect non-linear interactions - Enough observations per cell for stable estimates - Manageable number of combinations (5 × 5 = 25 cells maximum)
Configuration Parameters¶
| Parameter | Default | Description |
|---|---|---|
max_factors |
10 | Maximum factors to consider (top by marginal effect) |
min_effect_size |
0.001 | Minimum η² to include factor in pairwise testing |
max_candidates |
5 | Maximum interaction candidates to return |
min_cell_count |
30 | Minimum observations per cell for valid estimate |
Computational Complexity¶
- Factor ranking: O(n × p) where n = observations, p = factors
- Pairwise testing: O(n × p²) worst case
- Total: O(n × p²)
For 678,000 observations and 10 factors, this takes < 1 second.
Limitations¶
- Only pairwise: Does not detect 3-way or higher interactions
- Discretization: May miss interactions that depend on exact continuous values
- Correlation ≠ Causation: Statistical significance doesn't imply the interaction should be modeled
- Multiple testing: With many factor pairs, some will appear significant by chance
Best Practices¶
- Use domain knowledge — Prioritize interactions that make business sense
- Validate on holdout — Test if interaction improves out-of-sample performance
- Check stability — Ensure interaction effect is consistent across time periods
- Consider parsimony — Only add interactions that meaningfully improve the model
Code Examples¶
Pre-Fit Exploration¶
import rustystats as rs
# Explore data for potential interactions
exploration = rs.explore_data(
data=data,
response="ClaimNb",
categorical_factors=["Region", "VehBrand", "Area"],
continuous_factors=["DrivAge", "VehAge", "VehPower"],
exposure="Exposure",
family="poisson",
)
# Check interaction candidates
for ic in exploration.interaction_candidates:
print(f"{ic.factor1} × {ic.factor2}: strength={ic.interaction_strength:.4f}, p={ic.pvalue:.4f}")
Post-Fit Diagnostics¶
# Fit base model
result = rs.glm(
"ClaimNb ~ DrivAge + VehAge + C(Region)",
data=data,
family="poisson",
offset="Exposure"
).fit()
# Check for missing interactions
diagnostics = result.diagnostics(
data=data,
categorical_factors=["Region", "VehBrand"],
continuous_factors=["DrivAge", "VehAge"],
)
# If interactions detected, add to model
for ic in diagnostics.interaction_candidates:
if ic.interaction_strength > 0.01:
print(f"Consider adding: {ic.factor1}:{ic.factor2}")
Adding an Interaction¶
# Add interaction term
result_with_interaction = rs.glm(
"ClaimNb ~ DrivAge * VehAge + C(Region)", # DrivAge:VehAge interaction
data=data,
family="poisson",
offset="Exposure"
).fit()
# Compare AIC
print(f"Base AIC: {result.aic():.1f}")
print(f"With interaction AIC: {result_with_interaction.aic():.1f}")