Performance Guide¶

This guide covers performance optimization, profiling, and benchmarking for RustyStats.

Performance Characteristics¶

Complexity Analysis¶

Operation	Complexity	Notes
IRLS iteration	O(np² + p³)	X'WX computation + solve
Single coordinate descent step	O(np)	Update one coefficient
Full CD iteration (all p)	O(np²)	With Gram matrix
Spline basis evaluation	O(n × df × degree)	Per-observation
Target encoding	O(n × n_perms)	With permutations

Typical Performance¶

Dataset	Features	Family	Time (release)
10K rows	20	Poisson	~20ms
100K rows	50	Poisson	~200ms
500K rows	50	Poisson	~800ms
1M rows	100	Gaussian	~2s

Build Optimization¶

Always Use Release Mode¶

# Development (fast compile, slow run)
maturin develop

# Release (slow compile, fast run)
maturin develop --release

Release mode enables: - Optimizations (-O3) - LTO (Link-Time Optimization) - Inlining - SIMD auto-vectorization

Cargo Profile¶

# Cargo.toml
[profile.release]
lto = true
codegen-units = 1
opt-level = 3

Parallelization¶

Rayon Configuration¶

Rayon automatically uses all available cores. Control with:

// Limit threads
rayon::ThreadPoolBuilder::new()
    .num_threads(4)
    .build_global()
    .unwrap();

Or environment variable:

RAYON_NUM_THREADS=4 python script.py

Parallel Patterns¶

Parallel Map¶

let results: Vec<_> = data.par_iter()
    .map(|x| expensive_compute(x))
    .collect();

Parallel Fold-Reduce¶

let sum = (0..n).into_par_iter()
    .fold(|| 0.0, |acc, i| acc + data[i])
    .reduce(|| 0.0, |a, b| a + b);

When to Parallelize¶

Good: Independent computations, large n
Bad: Small n (overhead dominates), sequential dependencies

Rule of thumb: Parallelize when n > 1000.

Memory Optimization¶

Avoid Unnecessary Copies¶

// Bad: clones array
fn process(data: Array1<f64>) { ... }

// Good: borrows
fn process(data: &Array1<f64>) { ... }

// Good: moves ownership when needed
fn consume(data: Array1<f64>) -> Array1<f64> { ... }

Pre-allocate Buffers¶

// Bad: allocates in loop
for iter in 0..max_iter {
    let buffer = Array1::zeros(n);
    // ...
}

// Good: allocate once
let mut buffer = Array1::zeros(n);
for iter in 0..max_iter {
    buffer.fill(0.0);
    // ...
}

Use Views¶

// No copy - just a view
let column = matrix.column(j);
let slice = array.slice(s![10..20]);

Memory Layout¶

ndarray uses row-major by default. Access patterns matter:

// Good: row-wise access (contiguous)
for row in matrix.rows() {
    for &val in row.iter() { ... }
}

// Bad: column-wise access (strided)
for col in matrix.columns() {
    for &val in col.iter() { ... }
}

Profiling¶

Rust Profiling¶

Using perf (Linux)¶

# Build with debug symbols
cargo build --release

# Record profile
perf record -g ./target/release/benchmark

# View report
perf report

Using flamegraph¶

# Install
cargo install flamegraph

# Generate
cargo flamegraph --bin benchmark

Using cargo-instruments (macOS)¶

cargo install cargo-instruments
cargo instruments -t "Time Profiler" --release

Python Profiling¶

cProfile¶

import cProfile
import pstats

with cProfile.Profile() as pr:
    result = rs.fit_glm(y, X, family="poisson")

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(10)

line_profiler¶

# Install: pip install line_profiler

@profile
def benchmark():
    result = rs.fit_glm(y, X, family="poisson")
    return result

# Run: kernprof -l -v script.py

Benchmarking¶

Rust Benchmarks¶

// benches/irls.rs
use criterion::{criterion_group, criterion_main, Criterion};

fn bench_irls(c: &mut Criterion) {
    let (y, x) = generate_data(10000, 20);

    c.bench_function("irls_10k_20", |b| {
        b.iter(|| {
            fit_glm(&y, &x, &PoissonFamily, &LogLink, &IRLSConfig::default())
        })
    });
}

criterion_group!(benches, bench_irls);
criterion_main!(benches);

Run with:

cargo bench

Python Benchmarks¶

import time
import numpy as np
import rustystats as rs

def benchmark(n, p, family="poisson", n_runs=10):
    """Benchmark GLM fitting."""
    y = np.random.poisson(5, n).astype(float)
    X = np.column_stack([np.ones(n), np.random.randn(n, p)])

    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        result = rs.fit_glm(y, X, family=family)
        times.append(time.perf_counter() - start)

    print(f"n={n}, p={p}, family={family}")
    print(f"  Mean: {np.mean(times)*1000:.1f}ms")
    print(f"  Std:  {np.std(times)*1000:.1f}ms")
    print(f"  Iterations: {result.iterations}")

# Run benchmarks
for n in [10000, 100000, 500000]:
    benchmark(n, 20)

Common Bottlenecks¶

1. X'WX Computation¶

The Gram matrix computation is often the bottleneck:

// O(np²) - parallelized
let xtwx = compute_gram_matrix(x, w);

Optimization: Use flat Vec instead of Array2 for better cache performance.

2. Matrix Solve¶

Solving (X'WX)β = X'Wz is O(p³):

let beta = cholesky_solve(&xtwx, &xtwz)?;

Optimization: For very large p, consider iterative solvers.

3. Memory Allocation¶

Allocations in hot loops hurt performance:

// Profile with: MALLOC_CONF=prof:true

// Look for patterns like:
for iter in 0..max_iter {
    let temp = compute();  // Allocation every iteration
}

4. Cache Misses¶

Random memory access patterns cause cache misses:

// Bad: random access
for &i in random_indices.iter() {
    result += data[i];
}

// Good: sequential access
for &val in data.iter() {
    result += val;
}

Optimization Checklist¶

[ ] Build with --release
[ ] Enable LTO in Cargo.toml
[ ] Parallelize with Rayon where appropriate
[ ] Pre-allocate buffers outside loops
[ ] Use views instead of copies
[ ] Prefer row-major access patterns
[ ] Profile to find actual bottlenecks
[ ] Benchmark before and after changes

Comparison with Other Libraries¶

vs Statsmodels¶

Operation	RustyStats	Statsmodels
Poisson 100K × 20	~100ms	~500ms
Lasso GLM 100K × 50	~500ms	N/A

vs glmnet (R)¶

Operation	RustyStats	glmnet
Lasso path 100K × 100	~2s	~1.5s

glmnet is highly optimized; RustyStats is competitive.

Why RustyStats is Fast¶

Rust: No GC, predictable performance
Rayon: Automatic parallelization
SIMD: Auto-vectorization by LLVM
Cache-friendly: Careful memory layout
Zero-copy: NumPy interop where possible