Dict API Reference¶
The Dict API (glm_dict) is RustyStats' primary interface, designed for programmatic model building ideal for automated workflows and agents.
glm_dict¶
Create a GLM specification with dict-based term definitions.
rustystats.glm_dict(
response,
terms,
data,
family="gaussian",
link=None,
offset=None,
weights=None,
interactions=None,
theta=None,
var_power=1.5,
seed=None,
)
Parameters¶
| Parameter | Type | Description |
|---|---|---|
response |
str | Column name for response variable |
terms |
dict | Term specifications (see below) |
data |
DataFrame | Polars or Pandas DataFrame |
family |
str | Distribution family |
link |
str | Link function (optional) |
offset |
str | Column name for offset |
weights |
str | Column name for weights |
interactions |
list | Interaction specifications (see below) |
theta |
float | Negative Binomial dispersion |
var_power |
float | Tweedie variance power |
seed |
int | Random seed for reproducibility |
Returns¶
FormulaGLMDict object - call .fit() to fit the model.
Term Types¶
Each term in the terms dict maps a variable name to a specification dict.
linear¶
Raw continuous variable.
terms = {
"Age": {"type": "linear"},
"VehPower": {"type": "linear", "monotonicity": "increasing"}, # β ≥ 0
}
| Parameter | Type | Description |
|---|---|---|
monotonicity |
str | "increasing" (β ≥ 0) or "decreasing" (β ≤ 0) |
categorical¶
Dummy encoding for categorical variables.
terms = {
"Region": {"type": "categorical"},
"Area": {"type": "categorical", "levels": ["A", "B", "C"]}, # Explicit levels
"IsParis": {"type": "categorical", "level": "Paris"}, # Single level indicator
}
| Parameter | Type | Description |
|---|---|---|
levels |
list | Explicit level ordering (optional) |
level |
str | Single level to create 0/1 indicator for |
Single-Level Indicators¶
Create a binary indicator for a specific category level:
terms = {
# 0/1 indicator: 1 if Region == "Paris", else 0
"IsParis": {"type": "categorical", "level": "Paris", "source": "Region"},
}
Useful for: - Testing specific level effects - Creating custom groupings - Simplifying high-cardinality factors to key levels
bs (B-spline)¶
B-spline basis for non-linear effects.
terms = {
"Age": {"type": "bs"}, # Penalized smooth (default k=10)
"VehAge": {"type": "bs", "df": 5}, # Fixed 5 df
"Income": {"type": "bs", "k": 15}, # Penalized with 15 basis functions
"Risk": {"type": "bs", "monotonicity": "increasing"}, # Monotonic
}
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
int | - | Fixed degrees of freedom (no penalty) |
k |
int | 10 | Basis size for penalized smooth |
degree |
int | 3 | Polynomial degree |
monotonicity |
str | - | "increasing" or "decreasing" |
Behavior:
- No df or k → penalized smooth with k=10, auto-tuned via GCV
- df=5 → fixed 5 degrees of freedom, no penalty
- k=15 → penalized smooth with 15 basis functions
- monotonicity → I-spline basis with coefficient constraints
ns (Natural spline)¶
Natural cubic spline with linear extrapolation beyond boundaries.
terms = {
"Age": {"type": "ns"}, # Penalized smooth (default k=10)
"Income": {"type": "ns", "df": 4}, # Fixed 4 df
}
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
int | - | Fixed degrees of freedom |
k |
int | 10 | Basis size for penalized smooth |
target_encoding¶
Regularized target encoding for high-cardinality categoricals.
terms = {
"Brand": {"type": "target_encoding"},
"Model": {"type": "target_encoding", "prior_weight": 2.0},
}
| Parameter | Type | Default | Description |
|---|---|---|---|
prior_weight |
float | 1.0 | Regularization toward global mean |
expression¶
Arbitrary arithmetic expressions (like R's I()).
terms = {
"Age2": {"type": "expression", "expr": "Age ** 2"},
"BMI": {"type": "expression", "expr": "Weight / (Height ** 2)"},
"LogDensity": {"type": "expression", "expr": "log(Density)"},
}
| Parameter | Type | Description |
|---|---|---|
expr |
str | Python expression using column names |
monotonicity |
str | "increasing" or "decreasing" (optional) |
Supported operations: +, -, *, /, **, log, exp, sqrt
Interactions¶
Interactions are specified as a list of dicts. Each interaction dict contains variable specifications plus control flags.
Standard Interactions¶
Product terms between variables.
interactions = [
# Continuous × Continuous
{
"Age": {"type": "linear"},
"VehPower": {"type": "linear"},
"include_main": True, # Adds Age + VehPower + Age:VehPower
},
# Categorical × Continuous
{
"Region": {"type": "categorical"},
"Age": {"type": "bs", "df": 4},
"include_main": True, # Region-specific age curves
},
# Categorical × Categorical
{
"Region": {"type": "categorical"},
"Area": {"type": "categorical"},
"include_main": False, # Interaction only
},
]
| Parameter | Type | Default | Description |
|---|---|---|---|
include_main |
bool | True | Include main effects alongside interaction |
Target Encoding Interactions¶
Combined target encoding for variable combinations: TE(Brand:Region).
interactions = [
{
"Brand": {"type": "categorical"},
"Region": {"type": "categorical"},
"target_encoding": True,
"prior_weight": 1.0, # Optional
},
]
Creates a single encoded column for the brand×region combination, useful for high-cardinality interaction effects.
Frequency Encoding Interactions¶
Combined frequency encoding for variable combinations: FE(Brand:Region).
interactions = [
{
"Brand": {"type": "categorical"},
"Region": {"type": "categorical"},
"frequency_encoding": True,
},
]
Encodes combinations by their frequency in the training data.
Fitting¶
fit()¶
Fit the model with optional regularization.
result = model.fit() # Standard IRLS
# With CV-based regularization
result = model.fit(regularization="ridge") # "ridge", "lasso", "elastic_net"
result = model.fit(regularization="lasso", selection="1se", cv=5)
# With explicit alpha
result = model.fit(alpha=0.1, l1_ratio=0.0) # Ridge
result = model.fit(alpha=0.1, l1_ratio=1.0) # Lasso
| Parameter | Type | Default | Description |
|---|---|---|---|
regularization |
str | None | "ridge", "lasso", or "elastic_net" |
selection |
str | "min" |
"min" or "1se" for CV selection |
cv |
int | 5 | Number of CV folds |
alpha |
float | 0.0 | Explicit regularization strength |
l1_ratio |
float | 1.0 | Elastic Net mixing (0=Ridge, 1=Lasso) |
cv_seed |
int | None | Seed for reproducible CV folds |
Complete Examples¶
Insurance Frequency Model¶
import rustystats as rs
import polars as pl
data = pl.read_parquet("insurance.parquet")
result = rs.glm_dict(
response="ClaimCount",
terms={
"VehAge": {"type": "bs", "monotonicity": "increasing"},
"DrivAge": {"type": "bs"},
"BonusMalus": {"type": "linear", "monotonicity": "increasing"},
"VehPower": {"type": "linear"},
"Region": {"type": "categorical"},
"Brand": {"type": "target_encoding"},
},
interactions=[
{
"VehAge": {"type": "linear"},
"Region": {"type": "categorical"},
"include_main": True,
},
],
data=data,
family="poisson",
offset="Exposure",
seed=42,
).fit()
print(result.summary())
Regularized Model¶
result = rs.glm_dict(
response="ClaimCount",
terms={
"Age": {"type": "linear"},
"Income": {"type": "linear"},
"Region": {"type": "categorical"},
},
data=data,
family="poisson",
).fit(regularization="elastic_net", selection="1se")
print(f"Selected alpha: {result.alpha}")
print(f"Non-zero features: {result.n_nonzero()}")
High-Cardinality Features¶
result = rs.glm_dict(
response="ClaimCount",
terms={
"Age": {"type": "bs"},
"Brand": {"type": "target_encoding"},
"Model": {"type": "target_encoding"},
"ZipCode": {"type": "target_encoding", "prior_weight": 2.0},
},
interactions=[
{
"Brand": {"type": "categorical"},
"Region": {"type": "categorical"},
"target_encoding": True, # TE(Brand:Region)
},
],
data=data,
family="poisson",
offset="Exposure",
).fit()
Validation¶
validate()¶
Check design matrix for issues before fitting.
model = rs.glm_dict(
response="y",
terms={"x": {"type": "ns", "df": 4}, "cat": {"type": "categorical"}},
data=data,
family="poisson",
)
results = model.validate()
if not results['valid']:
print("Issues:", results['suggestions'])
Checks performed: - Rank deficiency (linearly dependent columns) - High multicollinearity (condition number) - Zero variance columns - NaN/Inf values - Highly correlated column pairs (>0.999)
Comparison: Dict API vs Formula API¶
| Feature | Dict API | Formula API |
|---|---|---|
| Programmatic building | ✓ Native | Requires string construction |
| Agent/automation friendly | ✓ Yes | String parsing |
| Complex interactions | ✓ Explicit | Limited syntax |
| TE interactions | ✓ Yes | Limited |
| FE interactions | ✓ Yes | No |
| Monotonicity constraints | ✓ All term types | Limited |
The Dict API is recommended for production systems and automated workflows.