Skip to content

[BUG] Cannot achieve same performance as pandas enviroment based LightGBM #2390

@stupidoge

Description

@stupidoge

SynapseML version

1.0.11

System information

  • Language version (e.g. python 3.8, scala 2.12): 2.12
  • Spark Version (e.g. 3.2.3): 3.5.0
  • Spark Platform (e.g. Synapse, Databricks): Databricks

Describe the problem

The current LightGBM based on spark has a lots of bugs:

  1. we don't have AUC-PR as the metrics, but under pandas, they have the [average_precision-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) to balance and measure the imbalanced data. I see we have a previous issue related with that, but finally we still don't add that function. AUC-ROC (normally we call this as AUC), and AUC-PR is not same. Even we have the same AUC-ROC, the AUC-PR decide the model's ability to measure those minority class.
  2. our validationIndicatorCol param cannot work and fit well. Even I set this as boolean value, and set this parameters, I found that this performance is almost same as not set.
  3. we cannot see the training process after we set verbosity=1, like the loss dropping process. For example: iteration 1 loss... iteration 2 loss... iteration 3 loss...
  4. ** most importantly, I found that our model's performance is not comparable with pandas version. My pandas version and pyspark version all have the same or similar AUC-ROC, but have different AUC-PR. The synapse performance is even 4 times worse than pandas version.

Our library is not maintained for a long time, actually, I hope any support team could pay attention to this problem. I would highly appreciate your help! Below I enclosed the two version of Pyspark and Pandas.

Code to reproduce issue

"""
-----------------------
training AUC: 0.9142080523534258
validation AUC: 0.9124695449945541
validation AUPRC: 0.17989272651979454
2024 Test AUC-ROC: 0.8760969857493492
2024 Test AUC-PR: 0.19127180999636964
2024 Test RMSE: 0.010704145654228314
2025 Test AUC-ROC: 0.8492785593472886
2025 Test AUC-PR: 0.10618822272549298
2025 Test RMSE: 0.01013130584232711
"""

from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMClassifier
import numpy as np
from pyspark.ml.evaluation import BinaryClassificationEvaluator



# Get rid of feature column names
feature_cols = np.load("xxx").tolist()

# Create VectorAssembler
lgbm_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
lgbm = LightGBMClassifier(
    featuresCol="features",
    labelCol="conversion_flag",
    numIterations=500.0,
    learningRate=0.810127,
    numLeaves=50.0,
    lambdaL1=0.799914,
    lambdaL2=0.080473,
    maxDepth=8.0,
    minSumHessianInLeaf=13.061164,
    baggingFraction=0.686218,
    featureFraction=0.579906,
    objective='binary', metric='binary_logloss', isProvideTrainingMetric=True, validationIndicatorCol='val_col',
    # featuresShapCol="shap_values",
    earlyStoppingRound=20,
    # useBarrierExecutionMode=True, 
    # dataTransferMode='streaming',
)

evaluator_auc = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
evaluator_pr = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderPR")
evaluator_rmse = RegressionEvaluator(labelCol="conversion_flag", predictionCol="prediction", metricName="rmse")

# lgbm.setPassThroughArgs("print_every_n_iterations=5")
pipeline = Pipeline(stages=[lgbm_assembler, lgbm])
model = pipeline.fit(train_val_data)
predictions_train = model.transform(sampled_train)
predictions = model.transform(val_data)

train_auc = evaluator_auc.evaluate(predictions_train)
val_auc = evaluator_auc.evaluate(predictions)
val_auprc = evaluator_pr.evaluate(predictions)  

# Make predictions on val set
# avg_auc = model.avgMetrics[0]
print('-----------------------')
print('training AUC:', train_auc)
print('validation AUC:', val_auc)
print('validation AUPRC:', val_auprc)

# %%
# Evaluate model on test set 2024
test_predictions_2024 = model.transform(test_data_2024)

test_auc_roc_2024 = evaluator_auc.evaluate(test_predictions_2024)
test_auc_pr_2024 = evaluator_pr.evaluate(test_predictions_2024)
test_rmse_2024 = evaluator_rmse.evaluate(test_predictions_2024)

print(f"2024 Test AUC-ROC: {test_auc_roc_2024}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024}")
print(f"2024 Test RMSE: {test_rmse_2024}")
# %%

# %% Evaluate model on test set 2025
test_predictions_2025 = model.transform(test_data_2025)

test_auc_roc_2025 = evaluator_auc.evaluate(test_predictions_2025)
test_auc_pr_2025 = evaluator_pr.evaluate(test_predictions_2025)
test_rmse_2025 = evaluator_rmse.evaluate(test_predictions_2025)

print(f"2025 Test AUC-ROC: {test_auc_roc_2025}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025}")
print(f"2025 Test RMSE: {test_rmse_2025}")

"""
Training AUC: 0.999317
Validation AUC: 0.916813
Validation AUPRC: 0.477665
2024 Test AUC-ROC: 0.882917
2024 Test AUC-PR: 0.428740
2024 Test RMSE: 0.015985
2025 Test AUC-ROC: 0.838868
2025 Test AUC-PR: 0.311869
2025 Test RMSE: 0.014423
"""
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score, mean_squared_error
from sklearn.model_selection import train_test_split

# use the best parameters
best_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'lambda_l1': 0.799914,
    'lambda_l2': 0.080473,
    'learning_rate': 0.810127,
    'max_depth': int(8.0),
    'min_sum_hessian_in_leaf': 13.061164,
    'num_iterations': int(167.0),
    'num_leaves': int(50.0),
    'feature_fraction': 0.579906,  
    'bagging_fraction': 0.686218,  
    'verbose': -1,
    'random_state': 52725
}



# load the feature list
feature_cols = np.load("xxx.npy").tolist()

# train
X_train_sampled = sampled_train[feature_cols]
y_train_sampled = sampled_train['conversion_flag']

# validation
X_val = val_data[feature_cols]
y_val = val_data['conversion_flag']

# create lgbm dataset
train_dataset = lgb.Dataset(X_train_sampled, label=y_train_sampled)
val_dataset = lgb.Dataset(X_val, label=y_val, reference=train_dataset)

# model trianing
model = lgb.train(
    best_params,
    train_dataset,
    valid_sets=val_dataset,
    # valid_sets=[train_dataset, val_dataset],
    # valid_names=['train', 'val'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=20),
        lgb.log_evaluation(period=5)
    ]
)

# prediction
train_predictions = model.predict(X_train_sampled, num_iteration=model.best_iteration)
val_predictions = model.predict(X_val, num_iteration=model.best_iteration)

# evaluate model performance
train_auc = roc_auc_score(y_train_sampled, train_predictions)
val_auc = roc_auc_score(y_val, val_predictions)
val_auprc = average_precision_score(y_val, val_predictions)

print('-----------------------')
print(f'Training AUC: {train_auc:.6f}')
print(f'Validation AUC: {val_auc:.6f}')
print(f'Validation AUPRC: {val_auprc:.6f}')

# evlaute test data - 2024
X_test_2024 = test_data_2024[feature_cols]
y_test_2024 = test_data_2024['conversion_flag']
test_predictions_2024 = model.predict(X_test_2024, num_iteration=model.best_iteration)

test_auc_roc_2024 = roc_auc_score(y_test_2024, test_predictions_2024)
test_auc_pr_2024 = average_precision_score(y_test_2024, test_predictions_2024)
test_rmse_2024 = np.sqrt(mean_squared_error(y_test_2024, test_predictions_2024))

print(f"2024 Test AUC-ROC: {test_auc_roc_2024:.6f}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024:.6f}")
print(f"2024 Test RMSE: {test_rmse_2024:.6f}")

# evluate test data - 2025
X_test_2025 = test_data_2025[feature_cols]
y_test_2025 = test_data_2025['conversion_flag']
test_predictions_2025 = model.predict(X_test_2025, num_iteration=model.best_iteration)

test_auc_roc_2025 = roc_auc_score(y_test_2025, test_predictions_2025)
test_auc_pr_2025 = average_precision_score(y_test_2025, test_predictions_2025)
test_rmse_2025 = np.sqrt(mean_squared_error(y_test_2025, test_predictions_2025))

print(f"2025 Test AUC-ROC: {test_auc_roc_2025:.6f}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025:.6f}")
print(f"2025 Test RMSE: {test_rmse_2025:.6f}")

Other info / logs

No response

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions