Skip to content

Latest commit

Β 

History

History
265 lines (186 loc) Β· 7.83 KB

File metadata and controls

265 lines (186 loc) Β· 7.83 KB

StackingEnsemble - Multi-Layer Stacking & Blending Library

A robust Python library for building multi-layer stacking and blending ensemble models, designed for regression tasks with full scikit-learn compatibility.

Python 3.8+ scikit-learn 1.0+ License: MIT

Overview

The StackingEnsemble class provides a complete framework for implementing two powerful ensemble strategies:

  • βœ… Stacking: Uses K-fold cross-validation for out-of-fold predictions (reduces overfitting)
  • βœ… Blending: Uses hold-out validation set (faster training, better for large datasets)

Key features:

  • Unlimited number of layers with arbitrary models per layer
  • Feature passthrough (original features + predictions for each layer)
  • Full scikit-learn estimator compatibility
  • Comprehensive input validation and error handling
  • Model persistence (save/load)
  • Performance metrics per layer and model
  • Overfitting detection
  • Tree-style structure visualization

Installation

# Install requirements
pip install numpy pandas scikit-learn joblib

Class: StackingEnsemble

Parameters:

Parameter Type Default Description
layers list of lists required List of layers, each containing scikit-learn compatible models. Each model must implement fit() and predict() methods.
meta_model estimator required Model that combines predictions from the final layer into final output.
n_folds int 5 Number of folds for K-fold cross-validation (stacking mode only). Minimum value: 2.
blending bool False If True, uses hold-out blending instead of K-fold stacking.
blend_size float 0.2 Proportion of training data to reserve as hold-out set (blending mode only). Must be between 0 and 1.
random_state int None Seed for reproducible data splitting and training.
passthrough_features bool False If True, original input features are concatenated with predictions for every layer.

Attributes (after fitting):

Attribute Type Description
fitted_layer_models_ list Stored fitted models for each layer
fitted_meta_model_ estimator Fitted meta-model instance
is_fitted_ bool Flag indicating if model has been trained
training_features_ pd.Index Column names from training data
_version_ str Library version string

Methods

__init__(self, layers, meta_model, **kwargs)

Initialize and validate ensemble parameters.

Raises:

  • ValueError: If invalid parameters are provided
  • ValueError: If models don't implement required methods

fit(self, X, y)

Fits the entire ensemble to training data.

Parameters:

  • X: pandas.DataFrame or numpy.ndarray - Feature matrix
  • y: pandas.Series, numpy.ndarray, or list - Target vector

Returns: self (fitted estimator instance)

Raises:

  • TypeError: For invalid input types
  • ValueError: If X and y have mismatched dimensions
  • RuntimeError: For errors during model training

predict(self, X)

Generate predictions using the fitted ensemble.

Parameters:

  • X: pandas.DataFrame or numpy.ndarray - Feature matrix for prediction

Returns: numpy.ndarray - Predicted values

Raises:

  • NotFittedError: If model hasn't been fitted
  • RuntimeError: For prediction errors

print_structure(self)

Prints ensemble structure in tree format showing layers, models, and non-default parameters.


save(self, path)

Save fitted model to disk using joblib serialization.

Parameters:

  • path: str - File path for saved model

load(cls, path) classmethod

Load previously saved model from disk.

Parameters:

  • path: str - Path to saved model file

Returns: StackingEnsemble - Loaded model instance


score(self, X, y, sample_weight=None)

Calculate RΒ² score (scikit-learn compatible).

Returns: float - Coefficient of determination


get_layer_performance(self, X, y)

Calculate detailed performance metrics for every model in every layer.

Returns: Dict with metrics: r2, rmse, mae, mape, explained_variance


check_for_overfitting(self, X_train, y_train, X_test, y_test, threshold=0.1)

Compare train vs test performance to detect overfitting.

Parameters:

  • threshold: float - RΒ² drop threshold to flag overfitting

Returns: Dict with overfitting assessment


get_params(self, deep=True)

Get estimator parameters (scikit-learn compatible).


set_params(self, **params)

Set estimator parameters (scikit-learn compatible).


Example Usage

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define ensemble structure
layer_1 = [
    LinearRegression(),
    RandomForestRegressor(n_estimators=100, max_depth=5),
    SVR(kernel='rbf', C=1.0)
]

layer_2 = [
    GradientBoostingRegressor(n_estimators=50),
    RandomForestRegressor(n_estimators=50)
]

meta_model = LinearRegression()

# Create ensemble
ensemble = StackingEnsemble(
    layers=[layer_1, layer_2],
    meta_model=meta_model,
    n_folds=5,
    blending=False,
    random_state=42,
    passthrough_features=False
)

# Train model
ensemble.fit(X_train, y_train)

# Make predictions
y_pred = ensemble.predict(X_test)

# Print structure
ensemble.print_structure()

# Get performance metrics
performance = ensemble.get_layer_performance(X_test, y_test)
print("\nTest RΒ² Score:", ensemble.score(X_test, y_test))

# Check for overfitting
overfit_check = ensemble.check_for_overfitting(X_train, y_train, X_test, y_test)
print("\nOverfitting:", overfit_check["overfitting_detected"])

# Save model
ensemble.save("stacking_model.joblib")

# Load model
loaded_ensemble = StackingEnsemble.load("stacking_model.joblib")

Stacking vs Blending Comparison

Feature Stacking Blending
Method K-fold cross-validation Hold-out split
Data Usage Uses all data for training Reserves portion for hold-out
Overfitting Risk Lower Higher
Training Speed Slower (n_folds Γ— training) Faster
Bias Lower Slightly higher
Recommended For Small/medium datasets Large datasets, production

Best Practices

  1. Layer Design:

    • Start with 2-3 layers maximum
    • Use diverse model types per layer
    • Avoid putting strong models only in early layers
  2. Meta Model Selection:

    • Simple linear models work best for meta-model
    • Avoid complex models for meta level (causes overfitting)
  3. Performance Tips:

    • Use passthrough_features=True for better performance
    • For large datasets use blending mode
    • Keep n_folds between 3-7 for stacking
  4. Troubleshooting:

    • If overfitting detected: reduce model complexity, increase n_folds, enable blending
    • If poor performance: try adding more diverse models, enable feature passthrough

API Reference

Full method documentation with type hints and detailed descriptions available in source code docstrings.

License

MIT License - See LICENSE file for details.

Repository

https://github.com/suraj5424/Stacking-library