Predicts per-policyholder claim severity on the French Motor Third-Party Liability (freMTPL) insurance dataset and compares four regression approaches. Built as a hands-on walkthrough of the ML workflow applied to actuarial pricing.
freMTPLfreq (policy-level exposure and claim count) joined with freMTPLsev (claim-level loss amounts), both shipped in archive.zip:
| File | Rows | Description |
|---|---|---|
freMTPLfreq.csv |
~678K policies | Policy features + Exposure, ClaimNb |
freMTPLsev.csv |
~26K claim records | IDpol, ClaimAmount |
Target: total claim cost per policy after rolling severity up to policy level.
flowchart LR
A[freMTPLfreq.csv<br/>freMTPLsev.csv] --> B[EDA<br/>distributions, pair plots]
B --> C[Feature engineering<br/>ClaimFreq = ClaimNb / Exposure]
C --> D[Train/test split]
D --> E[Encode categoricals<br/>OneHot + Label]
E --> F[Scale numerics<br/>MinMaxScaler]
F --> G[Lasso L1 feature selection<br/>α=5e-5, SelectFromModel]
G --> H{5-fold CV<br/>neg MAE}
H --> RF[Random Forest<br/>tune n_estimators]
H --> PG[Poisson GLM<br/>tune α]
H --> TG[Tweedie GLM<br/>power=1.8, tune α]
H --> XG[XGBoost<br/>tune n_estimators, lr=0.01]
RF --> M[Fit on train<br/>predict on validation]
PG --> M
TG --> M
XG --> M
M --> R[Rank by MAE<br/>select best model]
| File | Purpose |
|---|---|
Insurance_Analytics_Modelling -Git.ipynb |
End-to-end notebook (68 cells, 12 steps) |
archive.zip |
Compressed dataset (freMTPLfreq.csv + freMTPLsev.csv) |
The notebook was originally written against Kaggle's /kaggle/input/fremtpl-french-motor-tpl-insurance-claims/ paths. To run locally:
python -m venv .venv && source .venv/bin/activate
pip install numpy pandas matplotlib seaborn scikit-learn xgboost jupyter
unzip archive.zip -d data/
# update MTPL_filepath at the top of the notebook to point at ./data/
jupyter notebook "Insurance_Analytics_Modelling -Git.ipynb"The notebook is organised into 12 numbered steps:
- Imports —
numpy,pandas,matplotlib,seaborn,sklearn,xgboost. - EDA — distributions of
Exposure,ClaimNb,ClaimAmount; bivariate plots. - Added features — derive
ClaimFreq = ClaimNb / Exposure; aggregate claim amounts to policy grain. - Train/test split —
sklearn.model_selection.train_test_split. - Encoding —
OneHotEncoderandLabelEncoderfor categorical fields (Region, VehBrand, VehGas, etc.). - Descriptive stats —
seabornpair plots for target vs. predictors. - Feature selection —
Lasso(alpha=5e-5).fit(X_train_scale, y_train)followed bySelectFromModelto drop coefficients shrunk to zero. - Scaling —
MinMaxScalerapplied before regularised models. - Cross-validated hyperparameter search — for each model, a scoring function runs 5-fold CV with
neg_mean_absolute_errorand the hyperparameter giving the lowest mean MAE is selected. - Model training — fit each tuned model on the full training set.
- Model evaluation — compute MAE on the validation split and rank.
- Suggestions — limitations and next steps (see below).
| Model | Why this model | Hyperparameter swept |
|---|---|---|
Random Forest (RandomForestRegressor) |
Non-linear baseline, handles mixed types | n_estimators |
Poisson GLM (PoissonRegressor) |
Natural fit for count-like, right-skewed targets | alpha (L2 strength) |
Tweedie GLM (TweedieRegressor, power=1.8) |
Compound Poisson-Gamma for zero-inflated claim costs | alpha (L2 strength) |
XGBoost (XGBRegressor, learning_rate=0.01) |
Strong tabular baseline; captures non-linear feature interactions | n_estimators |
Models are ranked by Mean Absolute Error on the held-out validation split:
MAE_results = {'RF': MAE_RF, 'PGLM': MAE_PGLM, 'TGLM': MAE_TGLM, 'XGB': MAE_XGB}
best_model = min(MAE_results, key=MAE_results.get)MAE was chosen over RMSE because the freMTPL target is heavily zero-inflated (most policies have no claims), and squared-error metrics over-penalise the few large losses in a way that risks overfitting them.
Captured in the notebook's final markdown cell:
- Severity granularity — modelled total per-policy loss rather than per-claim severity. A two-step actuarial approach (frequency × average severity) or a Gamma GLM on non-zero claim amounts would be more aligned with standard practice.
- Single train/test cycle — no walk-forward or repeated re-training was applied; production-grade pricing should iterate.
- Categorical encoding — one-hot encoding on high-cardinality fields like
Regionblows up dimensionality; target/mean encoding or embeddings would be preferable. - Hyperparameter optimisation — currently a 1-D sweep per parameter. A full
GridSearchCVorOptunastudy would explore parameter interactions. - Feature selection — Lasso
alpha=5e-5was fixed; tuning the L1 strength alongside the model would balance bias and variance more rigorously.