Skip to content

JayDS22/Insurance_Modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Insurance Modelling — French MTPL Claim Severity

Predicts per-policyholder claim severity on the French Motor Third-Party Liability (freMTPL) insurance dataset and compares four regression approaches. Built as a hands-on walkthrough of the ML workflow applied to actuarial pricing.

Data

freMTPLfreq (policy-level exposure and claim count) joined with freMTPLsev (claim-level loss amounts), both shipped in archive.zip:

File Rows Description
freMTPLfreq.csv ~678K policies Policy features + Exposure, ClaimNb
freMTPLsev.csv ~26K claim records IDpol, ClaimAmount

Target: total claim cost per policy after rolling severity up to policy level.

Pipeline

flowchart LR
    A[freMTPLfreq.csv<br/>freMTPLsev.csv] --> B[EDA<br/>distributions, pair plots]
    B --> C[Feature engineering<br/>ClaimFreq = ClaimNb / Exposure]
    C --> D[Train/test split]
    D --> E[Encode categoricals<br/>OneHot + Label]
    E --> F[Scale numerics<br/>MinMaxScaler]
    F --> G[Lasso L1 feature selection<br/>α=5e-5, SelectFromModel]
    G --> H{5-fold CV<br/>neg MAE}
    H --> RF[Random Forest<br/>tune n_estimators]
    H --> PG[Poisson GLM<br/>tune α]
    H --> TG[Tweedie GLM<br/>power=1.8, tune α]
    H --> XG[XGBoost<br/>tune n_estimators, lr=0.01]
    RF --> M[Fit on train<br/>predict on validation]
    PG --> M
    TG --> M
    XG --> M
    M --> R[Rank by MAE<br/>select best model]
Loading

Files

File Purpose
Insurance_Analytics_Modelling -Git.ipynb End-to-end notebook (68 cells, 12 steps)
archive.zip Compressed dataset (freMTPLfreq.csv + freMTPLsev.csv)

Setup

The notebook was originally written against Kaggle's /kaggle/input/fremtpl-french-motor-tpl-insurance-claims/ paths. To run locally:

python -m venv .venv && source .venv/bin/activate
pip install numpy pandas matplotlib seaborn scikit-learn xgboost jupyter

unzip archive.zip -d data/
# update MTPL_filepath at the top of the notebook to point at ./data/
jupyter notebook "Insurance_Analytics_Modelling -Git.ipynb"

Workflow

The notebook is organised into 12 numbered steps:

  1. Importsnumpy, pandas, matplotlib, seaborn, sklearn, xgboost.
  2. EDA — distributions of Exposure, ClaimNb, ClaimAmount; bivariate plots.
  3. Added features — derive ClaimFreq = ClaimNb / Exposure; aggregate claim amounts to policy grain.
  4. Train/test splitsklearn.model_selection.train_test_split.
  5. EncodingOneHotEncoder and LabelEncoder for categorical fields (Region, VehBrand, VehGas, etc.).
  6. Descriptive statsseaborn pair plots for target vs. predictors.
  7. Feature selectionLasso(alpha=5e-5).fit(X_train_scale, y_train) followed by SelectFromModel to drop coefficients shrunk to zero.
  8. ScalingMinMaxScaler applied before regularised models.
  9. Cross-validated hyperparameter search — for each model, a scoring function runs 5-fold CV with neg_mean_absolute_error and the hyperparameter giving the lowest mean MAE is selected.
  10. Model training — fit each tuned model on the full training set.
  11. Model evaluation — compute MAE on the validation split and rank.
  12. Suggestions — limitations and next steps (see below).

Models compared

Model Why this model Hyperparameter swept
Random Forest (RandomForestRegressor) Non-linear baseline, handles mixed types n_estimators
Poisson GLM (PoissonRegressor) Natural fit for count-like, right-skewed targets alpha (L2 strength)
Tweedie GLM (TweedieRegressor, power=1.8) Compound Poisson-Gamma for zero-inflated claim costs alpha (L2 strength)
XGBoost (XGBRegressor, learning_rate=0.01) Strong tabular baseline; captures non-linear feature interactions n_estimators

Evaluation

Models are ranked by Mean Absolute Error on the held-out validation split:

MAE_results = {'RF': MAE_RF, 'PGLM': MAE_PGLM, 'TGLM': MAE_TGLM, 'XGB': MAE_XGB}
best_model = min(MAE_results, key=MAE_results.get)

MAE was chosen over RMSE because the freMTPL target is heavily zero-inflated (most policies have no claims), and squared-error metrics over-penalise the few large losses in a way that risks overfitting them.

Limitations and next steps

Captured in the notebook's final markdown cell:

  1. Severity granularity — modelled total per-policy loss rather than per-claim severity. A two-step actuarial approach (frequency × average severity) or a Gamma GLM on non-zero claim amounts would be more aligned with standard practice.
  2. Single train/test cycle — no walk-forward or repeated re-training was applied; production-grade pricing should iterate.
  3. Categorical encoding — one-hot encoding on high-cardinality fields like Region blows up dimensionality; target/mean encoding or embeddings would be preferable.
  4. Hyperparameter optimisation — currently a 1-D sweep per parameter. A full GridSearchCV or Optuna study would explore parameter interactions.
  5. Feature selection — Lasso alpha=5e-5 was fixed; tuning the L1 strength alongside the model would balance bias and variance more rigorously.

About

This repository contains a ML Project utilizing Linear Regression Models, GLMs, and various methodlogies to perform indepth analysis of a Motor Insurnace Data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors