Target Encoding: The CatBoost Approach¶
This chapter explains how RustyStats handles high-cardinality categorical variables using CatBoost-style ordered target encoding. We'll build the intuition step-by-step, show why naive approaches fail, and explain why the CatBoost method works.
The Problem: High-Cardinality Categoricals¶
What's a High-Cardinality Categorical?¶
A categorical variable with many unique values:
| Variable | Example Values | Cardinality |
|---|---|---|
| Gender | Male, Female, Other | Low (3) |
| US State | CA, TX, NY, ... | Medium (50) |
| ZIP Code | 90210, 10001, ... | High (40,000+) |
| Brand | Toyota, Ford, ... | High (varies) |
| Customer ID | C001, C002, ... | Very High (millions) |
Why Are They Problematic?¶
One-hot encoding creates a column for each category:
Problems: - Dimensionality explosion: 10,000 brands = 10,000 new columns - Sparse data: Most columns are 0 - Rare categories: Some brands appear only once—can't estimate a reliable coefficient - Overfitting: Model memorizes rare categories instead of learning patterns
Naive Solution: Mean Target Encoding¶
The Idea¶
Replace each category with the mean of the target for that category:
Category Target → Encoded
A 1 0.67 (mean of A's: 1,1,0)
B 0 0.33 (mean of B's: 1,0,0)
A 1 0.67
B 1 0.33
A 0 0.67
B 0 0.33
This seems clever—we compress arbitrary categories into a single informative number!
The Fatal Flaw: Target Leakage¶
Target leakage occurs when information about the target "leaks" into the features.
Consider a rare category that appears only once:
The encoding for "RARE" is exactly its target value! The model can achieve perfect prediction on training data by memorizing these encodings—but it will fail on new data.
This is overfitting in disguise.
Even for categories with a few observations, the encoding contains substantial information about the specific target values in those rows.
The CatBoost Solution: Ordered Target Statistics¶
The Key Insight¶
Don't let an observation see its own target (or targets from "future" observations).
CatBoost's solution: 1. Randomly order all observations 2. For each observation, compute the encoding using only observations that came before it in the random order
Step-by-Step Example¶
Data (6 observations):
Step 1: Random permutation order: [3, 0, 5, 2, 4, 1]
Step 2: Process in order, tracking running statistics:
| Process Order | Original Index | Category | Target | Sum_before | Count_before | Encoded |
|---|---|---|---|---|---|---|
| 1st | 3 | B | 0 | 0 | 0 | prior |
| 2nd | 0 | A | 1 | 0 | 0 | prior |
| 3rd | 5 | B | 1 | 0 | 1 | 0/1 = 0.0 |
| 4th | 2 | A | 1 | 1 | 1 | 1/1 = 1.0 |
| 5th | 4 | A | 0 | 2 | 2 | 2/2 = 1.0 |
| 6th | 1 | B | 0 | 1 | 2 | 1/2 = 0.5 |
The first observation of each category sees no data for that category—it gets the prior (global mean).
Why This Prevents Leakage¶
- Each observation is encoded using only data it hasn't "seen"
- An observation's own target is never used in its encoding
- Rare categories get the prior (global mean) as their first encoding—no memorization!
Adding Regularization: The Prior¶
The Problem with Raw Statistics¶
What if a category has only 1 observation before the current one?
This is very noisy! One observation doesn't give reliable information.
The Solution: Blend with Prior¶
Add a "pseudo-count" that pulls the estimate toward the global mean:
Where: - prior = global mean of target (e.g., 0.5) - prior_weight = regularization strength (e.g., 1.0)
Example with prior_weight = 1.0¶
If prior = 0.5 and we've seen 1 observation with target = 1:
Instead of 1.0, we get 0.75—pulled toward the global mean.
Interpretation of Prior Weight¶
| prior_weight | Effect |
|---|---|
| 0 | No regularization (raw statistics) |
| 1 | Balance: 1 observation ≈ 1 "prior observation" |
| 10 | Heavy regularization: need 10+ observations to overcome prior |
| 100 | Very heavy: rare categories essentially get the prior |
Rule of thumb: Start with prior_weight = 1.0. Increase if overfitting on rare categories.
Multiple Permutations: Reducing Variance¶
The Problem¶
The encoding depends on the random permutation order:
- Permutation [0,1,2,3,4,5] gives different encodings than [5,4,3,2,1,0]
- This adds noise to the features
The Solution¶
Average encodings across multiple random permutations:
encoded = average([
encode_with_permutation(seed=1),
encode_with_permutation(seed=2),
encode_with_permutation(seed=3),
encode_with_permutation(seed=4),
])
How Many Permutations?¶
| n_permutations | Variance | Speed |
|---|---|---|
| 1 | High | Fast |
| 4 (default) | Medium | Good balance |
| 10 | Low | Slower |
| 100 | Very low | Much slower |
Rule of thumb: 4 permutations is usually sufficient. Increase if you see high variance in cross-validation.
The Complete Algorithm¶
Algorithm: CatBoost-style Ordered Target Encoding
Input:
- categories[n]: categorical values
- target[n]: target variable
- prior_weight: regularization strength (default: 1.0)
- n_permutations: number of random orderings (default: 4)
Output:
- encoded[n]: encoded values
1. Compute prior = mean(target)
2. For each permutation p = 1 to n_permutations:
a. Generate random permutation order π
b. Initialize: sum_by_category = {}, count_by_category = {}
c. For i in permutation order π:
category = categories[i]
sum_before = sum_by_category.get(category, 0)
count_before = count_by_category.get(category, 0)
encoded_p[i] = (sum_before + prior × prior_weight) / (count_before + prior_weight)
sum_by_category[category] += target[i]
count_by_category[category] += 1
3. Return average across permutations: encoded = mean(encoded_1, ..., encoded_p)
Applying to New Data¶
For prediction on new data, we don't need ordering—we use the full training statistics:
For unseen categories (not in training), use the prior:
RustyStats Implementation¶
Basic Usage¶
import rustystats as rs
import numpy as np
# Training data
categories = ["Toyota", "Ford", "Toyota", "Honda", "Ford", "Toyota"]
target = np.array([1.0, 0.0, 1.0, 0.5, 0.0, 0.8])
# Encode
encoded, name, prior, stats = rs.target_encode(
categories,
target,
var_name="brand",
prior_weight=1.0,
n_permutations=4,
seed=42 # For reproducibility
)
print(f"Encoded values: {encoded}")
print(f"Column name: {name}") # "TE(brand)"
print(f"Prior (global mean): {prior:.3f}")
For Prediction¶
# New data (including unseen category)
new_categories = ["Toyota", "Ford", "BMW"] # BMW not in training
# Apply encoding using training statistics
new_encoded = rs.apply_target_encoding(
new_categories,
stats, # From training
prior, # From training
prior_weight=1.0
)
print(new_encoded)
# Toyota: uses training stats
# Ford: uses training stats
# BMW: gets prior (unseen category)
Scikit-learn Style API¶
# Fit-transform pattern
encoder = rs.TargetEncoder(prior_weight=1.0, n_permutations=4, seed=42)
# Training: uses ordered statistics
train_encoded = encoder.fit_transform(train_categories, train_target)
# Prediction: uses full training statistics
test_encoded = encoder.transform(test_categories)
In Formulas¶
# TE() in formulas automatically applies target encoding
result = rs.glm(
"claims ~ TE(brand) + TE(region) + age + C(gender)",
data,
family="poisson"
).fit()
When to Use Target Encoding¶
Good Use Cases¶
| Scenario | Why Target Encoding Helps |
|---|---|
| High-cardinality categoricals (100+ levels) | Avoids dimensionality explosion |
| Rare categories | Prior regularization prevents overfitting |
| Tree-unfriendly models (GLMs) | Compresses categories to single feature |
| Memory constraints | One column instead of thousands |
When NOT to Use¶
| Scenario | Better Alternative |
|---|---|
| Low-cardinality (< 10 levels) | One-hot encoding or C() |
| Very few observations | Just use prior (or drop variable) |
| Categories have no relationship to target | Drop the variable |
| Interpretability is critical | One-hot (coefficients are clearer) |
Comparison with Alternatives¶
| Method | Handles High Cardinality | Prevents Leakage | Regularization | Interpretable |
|---|---|---|---|---|
| One-hot encoding | ❌ | ✅ | ❌ | ✅ |
| Label encoding | ✅ | ✅ | ❌ | ❌ |
| Naive mean encoding | ✅ | ❌ | ❌ | ⚠️ |
| CatBoost target encoding | ✅ | ✅ | ✅ | ⚠️ |
| Leave-one-out encoding | ✅ | ⚠️ | ❌ | ⚠️ |
CatBoost-style is the only method that handles all three challenges: high cardinality, target leakage, and rare categories.
Summary¶
| Concept | Description |
|---|---|
| Target leakage | When features contain information about the target they shouldn't have |
| Ordered statistics | Encode each observation using only "past" data in random order |
| Prior | Global mean used for regularization and unseen categories |
| Prior weight | Controls regularization strength (higher = more pull toward prior) |
| Multiple permutations | Average across random orderings to reduce variance |
The CatBoost approach is elegant: by simply changing the order of computation and adding a prior, we solve both overfitting and high-cardinality problems in one technique.
References¶
- Prokhorenkova et al. (2018). CatBoost: unbiased boosting with categorical features. NeurIPS.
- Micci-Barreca (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD.
Next Steps¶
- Formula API — TE() syntax in formulas
- Code Walkthrough — Implementation details