Insurance Modelling — French MTPL Claim Severity

Predicts per-policyholder claim severity on the French Motor Third-Party Liability (freMTPL) insurance dataset and compares four regression approaches. Built as a hands-on walkthrough of the ML workflow applied to actuarial pricing.

Data

freMTPLfreq (policy-level exposure and claim count) joined with freMTPLsev (claim-level loss amounts), both shipped in archive.zip:

File	Rows	Description
`freMTPLfreq.csv`	~678K policies	Policy features + `Exposure`, `ClaimNb`
`freMTPLsev.csv`	~26K claim records	`IDpol`, `ClaimAmount`

Target: total claim cost per policy after rolling severity up to policy level.

Pipeline

flowchart LR
    A[freMTPLfreq.csv<br/>freMTPLsev.csv] --> B[EDA<br/>distributions, pair plots]
    B --> C[Feature engineering<br/>ClaimFreq = ClaimNb / Exposure]
    C --> D[Train/test split]
    D --> E[Encode categoricals<br/>OneHot + Label]
    E --> F[Scale numerics<br/>MinMaxScaler]
    F --> G[Lasso L1 feature selection<br/>α=5e-5, SelectFromModel]
    G --> H{5-fold CV<br/>neg MAE}
    H --> RF[Random Forest<br/>tune n_estimators]
    H --> PG[Poisson GLM<br/>tune α]
    H --> TG[Tweedie GLM<br/>power=1.8, tune α]
    H --> XG[XGBoost<br/>tune n_estimators, lr=0.01]
    RF --> M[Fit on train<br/>predict on validation]
    PG --> M
    TG --> M
    XG --> M
    M --> R[Rank by MAE<br/>select best model]

Files

File	Purpose
`Insurance_Analytics_Modelling -Git.ipynb`	End-to-end notebook (68 cells, 12 steps)
`archive.zip`	Compressed dataset (`freMTPLfreq.csv` + `freMTPLsev.csv`)

Setup

The notebook was originally written against Kaggle's /kaggle/input/fremtpl-french-motor-tpl-insurance-claims/ paths. To run locally:

python -m venv .venv && source .venv/bin/activate
pip install numpy pandas matplotlib seaborn scikit-learn xgboost jupyter

unzip archive.zip -d data/
# update MTPL_filepath at the top of the notebook to point at ./data/
jupyter notebook "Insurance_Analytics_Modelling -Git.ipynb"

Workflow

The notebook is organised into 12 numbered steps:

Imports — numpy, pandas, matplotlib, seaborn, sklearn, xgboost.
EDA — distributions of Exposure, ClaimNb, ClaimAmount; bivariate plots.
Added features — derive ClaimFreq = ClaimNb / Exposure; aggregate claim amounts to policy grain.
Train/test split — sklearn.model_selection.train_test_split.
Encoding — OneHotEncoder and LabelEncoder for categorical fields (Region, VehBrand, VehGas, etc.).
Descriptive stats — seaborn pair plots for target vs. predictors.
Feature selection — Lasso(alpha=5e-5).fit(X_train_scale, y_train) followed by SelectFromModel to drop coefficients shrunk to zero.
Scaling — MinMaxScaler applied before regularised models.
Cross-validated hyperparameter search — for each model, a scoring function runs 5-fold CV with neg_mean_absolute_error and the hyperparameter giving the lowest mean MAE is selected.
Model training — fit each tuned model on the full training set.
Model evaluation — compute MAE on the validation split and rank.
Suggestions — limitations and next steps (see below).

Models compared

Model	Why this model	Hyperparameter swept
Random Forest (`RandomForestRegressor`)	Non-linear baseline, handles mixed types	`n_estimators`
Poisson GLM (`PoissonRegressor`)	Natural fit for count-like, right-skewed targets	`alpha` (L2 strength)
Tweedie GLM (`TweedieRegressor`, `power=1.8`)	Compound Poisson-Gamma for zero-inflated claim costs	`alpha` (L2 strength)
XGBoost (`XGBRegressor`, `learning_rate=0.01`)	Strong tabular baseline; captures non-linear feature interactions	`n_estimators`

Evaluation

Models are ranked by Mean Absolute Error on the held-out validation split:

MAE_results = {'RF': MAE_RF, 'PGLM': MAE_PGLM, 'TGLM': MAE_TGLM, 'XGB': MAE_XGB}
best_model = min(MAE_results, key=MAE_results.get)

MAE was chosen over RMSE because the freMTPL target is heavily zero-inflated (most policies have no claims), and squared-error metrics over-penalise the few large losses in a way that risks overfitting them.

Limitations and next steps

Captured in the notebook's final markdown cell:

Severity granularity — modelled total per-policy loss rather than per-claim severity. A two-step actuarial approach (frequency × average severity) or a Gamma GLM on non-zero claim amounts would be more aligned with standard practice.
Single train/test cycle — no walk-forward or repeated re-training was applied; production-grade pricing should iterate.
Categorical encoding — one-hot encoding on high-cardinality fields like Region blows up dimensionality; target/mean encoding or embeddings would be preferable.
Hyperparameter optimisation — currently a 1-D sweep per parameter. A full GridSearchCV or Optuna study would explore parameter interactions.
Feature selection — Lasso alpha=5e-5 was fixed; tuning the L1 strength alongside the model would balance bias and variance more rigorously.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Insurance_Analytics_Modelling -Git.ipynb		Insurance_Analytics_Modelling -Git.ipynb
README.md		README.md
archive.zip		archive.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance Modelling — French MTPL Claim Severity

Data

Pipeline

Files

Setup

Workflow

Models compared

Evaluation

Limitations and next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Insurance Modelling — French MTPL Claim Severity

Data

Pipeline

Files

Setup

Workflow

Models compared

Evaluation

Limitations and next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages