Dict API Reference¶

The Dict API (glm_dict) is RustyStats' primary interface, designed for programmatic model building ideal for automated workflows and agents.

glm_dict¶

Create a GLM specification with dict-based term definitions.

rustystats.glm_dict(
    response,
    terms,
    data,
    family="gaussian",
    link=None,
    offset=None,
    weights=None,
    interactions=None,
    theta=None,
    var_power=1.5,
    seed=None,
)

Parameters¶

Parameter	Type	Description
`response`	str	Column name for response variable
`terms`	dict	Term specifications (see below)
`data`	DataFrame	Polars or Pandas DataFrame
`family`	str	Distribution family
`link`	str	Link function (optional)
`offset`	str	Column name for offset
`weights`	str	Column name for weights
`interactions`	list	Interaction specifications (see below)
`theta`	float	Negative Binomial dispersion
`var_power`	float	Tweedie variance power
`seed`	int	Random seed for reproducibility

Returns¶

FormulaGLMDict object - call .fit() to fit the model.

Term Types¶

Each term in the terms dict maps a variable name to a specification dict.

linear¶

Raw continuous variable.

terms = {
    "Age": {"type": "linear"},
    "VehPower": {"type": "linear", "monotonicity": "increasing"},  # β ≥ 0
}

Parameter	Type	Description
`monotonicity`	str	`"increasing"` (β ≥ 0) or `"decreasing"` (β ≤ 0)

categorical¶

Dummy encoding for categorical variables.

terms = {
    "Region": {"type": "categorical"},
    "Area": {"type": "categorical", "levels": ["A", "B", "C"]},  # Explicit levels
    "IsParis": {"type": "categorical", "level": "Paris"},  # Single level indicator
}

Parameter	Type	Description
`levels`	list	Explicit level ordering (optional)
`level`	str	Single level to create 0/1 indicator for

Single-Level Indicators¶

Create a binary indicator for a specific category level:

terms = {
    # 0/1 indicator: 1 if Region == "Paris", else 0
    "IsParis": {"type": "categorical", "level": "Paris", "source": "Region"},
}

Useful for: - Testing specific level effects - Creating custom groupings - Simplifying high-cardinality factors to key levels

bs (B-spline)¶

B-spline basis for non-linear effects.

terms = {
    "Age": {"type": "bs"},                              # Penalized smooth (default k=10)
    "VehAge": {"type": "bs", "df": 5},                  # Fixed 5 df
    "Income": {"type": "bs", "k": 15},                  # Penalized with 15 basis functions
    "Risk": {"type": "bs", "monotonicity": "increasing"},  # Monotonic
}

Parameter	Type	Default	Description
`df`	int	-	Fixed degrees of freedom (no penalty)
`k`	int	10	Basis size for penalized smooth
`degree`	int	3	Polynomial degree
`monotonicity`	str	-	`"increasing"` or `"decreasing"`

Behavior: - No df or k → penalized smooth with k=10, auto-tuned via GCV - df=5 → fixed 5 degrees of freedom, no penalty - k=15 → penalized smooth with 15 basis functions - monotonicity → I-spline basis with coefficient constraints

ns (Natural spline)¶

Natural cubic spline with linear extrapolation beyond boundaries.

terms = {
    "Age": {"type": "ns"},           # Penalized smooth (default k=10)
    "Income": {"type": "ns", "df": 4},  # Fixed 4 df
}

Parameter	Type	Default	Description
`df`	int	-	Fixed degrees of freedom
`k`	int	10	Basis size for penalized smooth

target_encoding¶

Regularized target encoding for high-cardinality categoricals.

terms = {
    "Brand": {"type": "target_encoding"},
    "Model": {"type": "target_encoding", "prior_weight": 2.0},
}

Parameter	Type	Default	Description
`prior_weight`	float	1.0	Regularization toward global mean

expression¶

Arbitrary arithmetic expressions (like R's I()).

terms = {
    "Age2": {"type": "expression", "expr": "Age ** 2"},
    "BMI": {"type": "expression", "expr": "Weight / (Height ** 2)"},
    "LogDensity": {"type": "expression", "expr": "log(Density)"},
}

Parameter	Type	Description
`expr`	str	Python expression using column names
`monotonicity`	str	`"increasing"` or `"decreasing"` (optional)

Supported operations: +, -, *, /, **, log, exp, sqrt

Interactions¶

Interactions are specified as a list of dicts. Each interaction dict contains variable specifications plus control flags.

Standard Interactions¶

Product terms between variables.

interactions = [
    # Continuous × Continuous
    {
        "Age": {"type": "linear"},
        "VehPower": {"type": "linear"},
        "include_main": True,  # Adds Age + VehPower + Age:VehPower
    },
    # Categorical × Continuous
    {
        "Region": {"type": "categorical"},
        "Age": {"type": "bs", "df": 4},
        "include_main": True,  # Region-specific age curves
    },
    # Categorical × Categorical
    {
        "Region": {"type": "categorical"},
        "Area": {"type": "categorical"},
        "include_main": False,  # Interaction only
    },
]

Parameter	Type	Default	Description
`include_main`	bool	True	Include main effects alongside interaction

Target Encoding Interactions¶

Combined target encoding for variable combinations: TE(Brand:Region).

interactions = [
    {
        "Brand": {"type": "categorical"},
        "Region": {"type": "categorical"},
        "target_encoding": True,
        "prior_weight": 1.0,  # Optional
    },
]

Creates a single encoded column for the brand×region combination, useful for high-cardinality interaction effects.

Frequency Encoding Interactions¶

Combined frequency encoding for variable combinations: FE(Brand:Region).

interactions = [
    {
        "Brand": {"type": "categorical"},
        "Region": {"type": "categorical"},
        "frequency_encoding": True,
    },
]

Encodes combinations by their frequency in the training data.

Fitting¶

fit()¶

Fit the model with optional regularization.

result = model.fit()  # Standard IRLS

# With CV-based regularization
result = model.fit(regularization="ridge")  # "ridge", "lasso", "elastic_net"
result = model.fit(regularization="lasso", selection="1se", cv=5)

# With explicit alpha
result = model.fit(alpha=0.1, l1_ratio=0.0)  # Ridge
result = model.fit(alpha=0.1, l1_ratio=1.0)  # Lasso

Parameter	Type	Default	Description
`regularization`	str	None	`"ridge"`, `"lasso"`, or `"elastic_net"`
`selection`	str	`"min"`	`"min"` or `"1se"` for CV selection
`cv`	int	5	Number of CV folds
`alpha`	float	0.0	Explicit regularization strength
`l1_ratio`	float	1.0	Elastic Net mixing (0=Ridge, 1=Lasso)
`cv_seed`	int	None	Seed for reproducible CV folds

Complete Examples¶

Insurance Frequency Model¶

import rustystats as rs
import polars as pl

data = pl.read_parquet("insurance.parquet")

result = rs.glm_dict(
    response="ClaimCount",
    terms={
        "VehAge": {"type": "bs", "monotonicity": "increasing"},
        "DrivAge": {"type": "bs"},
        "BonusMalus": {"type": "linear", "monotonicity": "increasing"},
        "VehPower": {"type": "linear"},
        "Region": {"type": "categorical"},
        "Brand": {"type": "target_encoding"},
    },
    interactions=[
        {
            "VehAge": {"type": "linear"},
            "Region": {"type": "categorical"},
            "include_main": True,
        },
    ],
    data=data,
    family="poisson",
    offset="Exposure",
    seed=42,
).fit()

print(result.summary())

Regularized Model¶

result = rs.glm_dict(
    response="ClaimCount",
    terms={
        "Age": {"type": "linear"},
        "Income": {"type": "linear"},
        "Region": {"type": "categorical"},
    },
    data=data,
    family="poisson",
).fit(regularization="elastic_net", selection="1se")

print(f"Selected alpha: {result.alpha}")
print(f"Non-zero features: {result.n_nonzero()}")

High-Cardinality Features¶

result = rs.glm_dict(
    response="ClaimCount",
    terms={
        "Age": {"type": "bs"},
        "Brand": {"type": "target_encoding"},
        "Model": {"type": "target_encoding"},
        "ZipCode": {"type": "target_encoding", "prior_weight": 2.0},
    },
    interactions=[
        {
            "Brand": {"type": "categorical"},
            "Region": {"type": "categorical"},
            "target_encoding": True,  # TE(Brand:Region)
        },
    ],
    data=data,
    family="poisson",
    offset="Exposure",
).fit()

Validation¶

validate()¶

Check design matrix for issues before fitting.

model = rs.glm_dict(
    response="y",
    terms={"x": {"type": "ns", "df": 4}, "cat": {"type": "categorical"}},
    data=data,
    family="poisson",
)
results = model.validate()

if not results['valid']:
    print("Issues:", results['suggestions'])

Checks performed: - Rank deficiency (linearly dependent columns) - High multicollinearity (condition number) - Zero variance columns - NaN/Inf values - Highly correlated column pairs (>0.999)

Comparison: Dict API vs Formula API¶

Feature	Dict API	Formula API
Programmatic building	✓ Native	Requires string construction
Agent/automation friendly	✓ Yes	String parsing
Complex interactions	✓ Explicit	Limited syntax
TE interactions	✓ Yes	Limited
FE interactions	✓ Yes	No
Monotonicity constraints	✓ All term types	Limited

The Dict API is recommended for production systems and automated workflows.