SynapseML version
1.0.11
System information
- Language version (e.g. python 3.8, scala 2.12): 2.12
- Spark Version (e.g. 3.2.3): 3.5.0
- Spark Platform (e.g. Synapse, Databricks): Databricks
Describe the problem
The current LightGBM based on spark has a lots of bugs:
- we don't have AUC-PR as the metrics, but under pandas, they have the
[average_precision-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) to balance and measure the imbalanced data. I see we have a previous issue related with that, but finally we still don't add that function. AUC-ROC (normally we call this as AUC), and AUC-PR is not same. Even we have the same AUC-ROC, the AUC-PR decide the model's ability to measure those minority class.
- our
validationIndicatorCol param cannot work and fit well. Even I set this as boolean value, and set this parameters, I found that this performance is almost same as not set.
- we cannot see the training process after we set verbosity=1, like the loss dropping process. For example: iteration 1 loss... iteration 2 loss... iteration 3 loss...
- ** most importantly, I found that our model's performance is not comparable with pandas version. My pandas version and pyspark version all have the same or similar AUC-ROC, but have different AUC-PR. The synapse performance is even 4 times worse than pandas version.
Our library is not maintained for a long time, actually, I hope any support team could pay attention to this problem. I would highly appreciate your help! Below I enclosed the two version of Pyspark and Pandas.
Code to reproduce issue
"""
-----------------------
training AUC: 0.9142080523534258
validation AUC: 0.9124695449945541
validation AUPRC: 0.17989272651979454
2024 Test AUC-ROC: 0.8760969857493492
2024 Test AUC-PR: 0.19127180999636964
2024 Test RMSE: 0.010704145654228314
2025 Test AUC-ROC: 0.8492785593472886
2025 Test AUC-PR: 0.10618822272549298
2025 Test RMSE: 0.01013130584232711
"""
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMClassifier
import numpy as np
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Get rid of feature column names
feature_cols = np.load("xxx").tolist()
# Create VectorAssembler
lgbm_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
lgbm = LightGBMClassifier(
featuresCol="features",
labelCol="conversion_flag",
numIterations=500.0,
learningRate=0.810127,
numLeaves=50.0,
lambdaL1=0.799914,
lambdaL2=0.080473,
maxDepth=8.0,
minSumHessianInLeaf=13.061164,
baggingFraction=0.686218,
featureFraction=0.579906,
objective='binary', metric='binary_logloss', isProvideTrainingMetric=True, validationIndicatorCol='val_col',
# featuresShapCol="shap_values",
earlyStoppingRound=20,
# useBarrierExecutionMode=True,
# dataTransferMode='streaming',
)
evaluator_auc = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
evaluator_pr = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderPR")
evaluator_rmse = RegressionEvaluator(labelCol="conversion_flag", predictionCol="prediction", metricName="rmse")
# lgbm.setPassThroughArgs("print_every_n_iterations=5")
pipeline = Pipeline(stages=[lgbm_assembler, lgbm])
model = pipeline.fit(train_val_data)
predictions_train = model.transform(sampled_train)
predictions = model.transform(val_data)
train_auc = evaluator_auc.evaluate(predictions_train)
val_auc = evaluator_auc.evaluate(predictions)
val_auprc = evaluator_pr.evaluate(predictions)
# Make predictions on val set
# avg_auc = model.avgMetrics[0]
print('-----------------------')
print('training AUC:', train_auc)
print('validation AUC:', val_auc)
print('validation AUPRC:', val_auprc)
# %%
# Evaluate model on test set 2024
test_predictions_2024 = model.transform(test_data_2024)
test_auc_roc_2024 = evaluator_auc.evaluate(test_predictions_2024)
test_auc_pr_2024 = evaluator_pr.evaluate(test_predictions_2024)
test_rmse_2024 = evaluator_rmse.evaluate(test_predictions_2024)
print(f"2024 Test AUC-ROC: {test_auc_roc_2024}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024}")
print(f"2024 Test RMSE: {test_rmse_2024}")
# %%
# %% Evaluate model on test set 2025
test_predictions_2025 = model.transform(test_data_2025)
test_auc_roc_2025 = evaluator_auc.evaluate(test_predictions_2025)
test_auc_pr_2025 = evaluator_pr.evaluate(test_predictions_2025)
test_rmse_2025 = evaluator_rmse.evaluate(test_predictions_2025)
print(f"2025 Test AUC-ROC: {test_auc_roc_2025}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025}")
print(f"2025 Test RMSE: {test_rmse_2025}")
"""
Training AUC: 0.999317
Validation AUC: 0.916813
Validation AUPRC: 0.477665
2024 Test AUC-ROC: 0.882917
2024 Test AUC-PR: 0.428740
2024 Test RMSE: 0.015985
2025 Test AUC-ROC: 0.838868
2025 Test AUC-PR: 0.311869
2025 Test RMSE: 0.014423
"""
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score, mean_squared_error
from sklearn.model_selection import train_test_split
# use the best parameters
best_params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'lambda_l1': 0.799914,
'lambda_l2': 0.080473,
'learning_rate': 0.810127,
'max_depth': int(8.0),
'min_sum_hessian_in_leaf': 13.061164,
'num_iterations': int(167.0),
'num_leaves': int(50.0),
'feature_fraction': 0.579906,
'bagging_fraction': 0.686218,
'verbose': -1,
'random_state': 52725
}
# load the feature list
feature_cols = np.load("xxx.npy").tolist()
# train
X_train_sampled = sampled_train[feature_cols]
y_train_sampled = sampled_train['conversion_flag']
# validation
X_val = val_data[feature_cols]
y_val = val_data['conversion_flag']
# create lgbm dataset
train_dataset = lgb.Dataset(X_train_sampled, label=y_train_sampled)
val_dataset = lgb.Dataset(X_val, label=y_val, reference=train_dataset)
# model trianing
model = lgb.train(
best_params,
train_dataset,
valid_sets=val_dataset,
# valid_sets=[train_dataset, val_dataset],
# valid_names=['train', 'val'],
callbacks=[
lgb.early_stopping(stopping_rounds=20),
lgb.log_evaluation(period=5)
]
)
# prediction
train_predictions = model.predict(X_train_sampled, num_iteration=model.best_iteration)
val_predictions = model.predict(X_val, num_iteration=model.best_iteration)
# evaluate model performance
train_auc = roc_auc_score(y_train_sampled, train_predictions)
val_auc = roc_auc_score(y_val, val_predictions)
val_auprc = average_precision_score(y_val, val_predictions)
print('-----------------------')
print(f'Training AUC: {train_auc:.6f}')
print(f'Validation AUC: {val_auc:.6f}')
print(f'Validation AUPRC: {val_auprc:.6f}')
# evlaute test data - 2024
X_test_2024 = test_data_2024[feature_cols]
y_test_2024 = test_data_2024['conversion_flag']
test_predictions_2024 = model.predict(X_test_2024, num_iteration=model.best_iteration)
test_auc_roc_2024 = roc_auc_score(y_test_2024, test_predictions_2024)
test_auc_pr_2024 = average_precision_score(y_test_2024, test_predictions_2024)
test_rmse_2024 = np.sqrt(mean_squared_error(y_test_2024, test_predictions_2024))
print(f"2024 Test AUC-ROC: {test_auc_roc_2024:.6f}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024:.6f}")
print(f"2024 Test RMSE: {test_rmse_2024:.6f}")
# evluate test data - 2025
X_test_2025 = test_data_2025[feature_cols]
y_test_2025 = test_data_2025['conversion_flag']
test_predictions_2025 = model.predict(X_test_2025, num_iteration=model.best_iteration)
test_auc_roc_2025 = roc_auc_score(y_test_2025, test_predictions_2025)
test_auc_pr_2025 = average_precision_score(y_test_2025, test_predictions_2025)
test_rmse_2025 = np.sqrt(mean_squared_error(y_test_2025, test_predictions_2025))
print(f"2025 Test AUC-ROC: {test_auc_roc_2025:.6f}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025:.6f}")
print(f"2025 Test RMSE: {test_rmse_2025:.6f}")
Other info / logs
No response
What component(s) does this bug affect?
What language(s) does this bug affect?
What integration(s) does this bug affect?
SynapseML version
1.0.11
System information
Describe the problem
The current LightGBM based on spark has a lots of bugs:
[average_precision-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)to balance and measure the imbalanced data. I see we have a previous issue related with that, but finally we still don't add that function. AUC-ROC (normally we call this as AUC), and AUC-PR is not same. Even we have the same AUC-ROC, the AUC-PR decide the model's ability to measure those minority class.validationIndicatorColparam cannot work and fit well. Even I set this as boolean value, and set this parameters, I found that this performance is almost same as not set.Our library is not maintained for a long time, actually, I hope any support team could pay attention to this problem. I would highly appreciate your help! Below I enclosed the two version of Pyspark and Pandas.
Code to reproduce issue
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive: Cognitive projectarea/core: Core projectarea/deep-learning: DeepLearning projectarea/lightgbm: Lightgbm projectarea/opencv: Opencv projectarea/vw: VW projectarea/website: Websitearea/build: Project build systemarea/notebooks: Samples under notebooks folderarea/docker: Docker usagearea/models: models related issueWhat language(s) does this bug affect?
language/scala: Scala source codelanguage/python: Pyspark APIslanguage/r: R APIslanguage/csharp: .NET APIslanguage/new: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse: Azure Synapse integrationsintegrations/azureml: Azure ML integrationsintegrations/databricks: Databricks integrations