[BUG] Cannot achieve same performance as pandas enviroment based LightGBM

### SynapseML version

1.0.11

### System information

- **Language version** (e.g. python 3.8, scala 2.12): 2.12
- **Spark Version** (e.g. 3.2.3): 3.5.0
- **Spark Platform** (e.g. Synapse, Databricks): Databricks



### Describe the problem

The current LightGBM based on spark has a lots of bugs:
1. we don't have AUC-PR as the metrics, but under pandas, they have the `[average_precision-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)` to balance and measure the imbalanced data. I see we have a previous issue related with that, but finally we still don't add that function. AUC-ROC (normally we call this as AUC), and AUC-PR is not same. Even we have the same AUC-ROC, the AUC-PR decide the model's ability to measure those minority class.
2. our `validationIndicatorCol` param cannot work and fit well. Even I set this as boolean value, and set this parameters, I found that this performance is almost same as not set.
3. we cannot see the training process after we set verbosity=1, like the loss dropping process. For example: iteration 1 loss... iteration 2 loss... iteration 3 loss...
4. ** most importantly, I found that our model's performance is not comparable with pandas version. My pandas version and pyspark version all have the same or similar AUC-ROC, but have different AUC-PR. The synapse performance is even 4 times worse than pandas version.


Our library is not maintained for a long time, actually, I hope any support team could pay attention to this problem. I would highly appreciate your help! Below I enclosed the two version of Pyspark and Pandas.




### Code to reproduce issue

```
"""
-----------------------
training AUC: 0.9142080523534258
validation AUC: 0.9124695449945541
validation AUPRC: 0.17989272651979454
2024 Test AUC-ROC: 0.8760969857493492
2024 Test AUC-PR: 0.19127180999636964
2024 Test RMSE: 0.010704145654228314
2025 Test AUC-ROC: 0.8492785593472886
2025 Test AUC-PR: 0.10618822272549298
2025 Test RMSE: 0.01013130584232711
"""

from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMClassifier
import numpy as np
from pyspark.ml.evaluation import BinaryClassificationEvaluator



# Get rid of feature column names
feature_cols = np.load("xxx").tolist()

# Create VectorAssembler
lgbm_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
lgbm = LightGBMClassifier(
    featuresCol="features",
    labelCol="conversion_flag",
    numIterations=500.0,
    learningRate=0.810127,
    numLeaves=50.0,
    lambdaL1=0.799914,
    lambdaL2=0.080473,
    maxDepth=8.0,
    minSumHessianInLeaf=13.061164,
    baggingFraction=0.686218,
    featureFraction=0.579906,
    objective='binary', metric='binary_logloss', isProvideTrainingMetric=True, validationIndicatorCol='val_col',
    # featuresShapCol="shap_values",
    earlyStoppingRound=20,
    # useBarrierExecutionMode=True, 
    # dataTransferMode='streaming',
)

evaluator_auc = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
evaluator_pr = BinaryClassificationEvaluator(labelCol="conversion_flag", rawPredictionCol="rawPrediction", metricName="areaUnderPR")
evaluator_rmse = RegressionEvaluator(labelCol="conversion_flag", predictionCol="prediction", metricName="rmse")

# lgbm.setPassThroughArgs("print_every_n_iterations=5")
pipeline = Pipeline(stages=[lgbm_assembler, lgbm])
model = pipeline.fit(train_val_data)
predictions_train = model.transform(sampled_train)
predictions = model.transform(val_data)

train_auc = evaluator_auc.evaluate(predictions_train)
val_auc = evaluator_auc.evaluate(predictions)
val_auprc = evaluator_pr.evaluate(predictions)  

# Make predictions on val set
# avg_auc = model.avgMetrics[0]
print('-----------------------')
print('training AUC:', train_auc)
print('validation AUC:', val_auc)
print('validation AUPRC:', val_auprc)

# %%
# Evaluate model on test set 2024
test_predictions_2024 = model.transform(test_data_2024)

test_auc_roc_2024 = evaluator_auc.evaluate(test_predictions_2024)
test_auc_pr_2024 = evaluator_pr.evaluate(test_predictions_2024)
test_rmse_2024 = evaluator_rmse.evaluate(test_predictions_2024)

print(f"2024 Test AUC-ROC: {test_auc_roc_2024}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024}")
print(f"2024 Test RMSE: {test_rmse_2024}")
# %%

# %% Evaluate model on test set 2025
test_predictions_2025 = model.transform(test_data_2025)

test_auc_roc_2025 = evaluator_auc.evaluate(test_predictions_2025)
test_auc_pr_2025 = evaluator_pr.evaluate(test_predictions_2025)
test_rmse_2025 = evaluator_rmse.evaluate(test_predictions_2025)

print(f"2025 Test AUC-ROC: {test_auc_roc_2025}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025}")
print(f"2025 Test RMSE: {test_rmse_2025}")

```



```
"""
Training AUC: 0.999317
Validation AUC: 0.916813
Validation AUPRC: 0.477665
2024 Test AUC-ROC: 0.882917
2024 Test AUC-PR: 0.428740
2024 Test RMSE: 0.015985
2025 Test AUC-ROC: 0.838868
2025 Test AUC-PR: 0.311869
2025 Test RMSE: 0.014423
"""
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score, mean_squared_error
from sklearn.model_selection import train_test_split

# use the best parameters
best_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'lambda_l1': 0.799914,
    'lambda_l2': 0.080473,
    'learning_rate': 0.810127,
    'max_depth': int(8.0),
    'min_sum_hessian_in_leaf': 13.061164,
    'num_iterations': int(167.0),
    'num_leaves': int(50.0),
    'feature_fraction': 0.579906,  
    'bagging_fraction': 0.686218,  
    'verbose': -1,
    'random_state': 52725
}



# load the feature list
feature_cols = np.load("xxx.npy").tolist()

# train
X_train_sampled = sampled_train[feature_cols]
y_train_sampled = sampled_train['conversion_flag']

# validation
X_val = val_data[feature_cols]
y_val = val_data['conversion_flag']

# create lgbm dataset
train_dataset = lgb.Dataset(X_train_sampled, label=y_train_sampled)
val_dataset = lgb.Dataset(X_val, label=y_val, reference=train_dataset)

# model trianing
model = lgb.train(
    best_params,
    train_dataset,
    valid_sets=val_dataset,
    # valid_sets=[train_dataset, val_dataset],
    # valid_names=['train', 'val'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=20),
        lgb.log_evaluation(period=5)
    ]
)

# prediction
train_predictions = model.predict(X_train_sampled, num_iteration=model.best_iteration)
val_predictions = model.predict(X_val, num_iteration=model.best_iteration)

# evaluate model performance
train_auc = roc_auc_score(y_train_sampled, train_predictions)
val_auc = roc_auc_score(y_val, val_predictions)
val_auprc = average_precision_score(y_val, val_predictions)

print('-----------------------')
print(f'Training AUC: {train_auc:.6f}')
print(f'Validation AUC: {val_auc:.6f}')
print(f'Validation AUPRC: {val_auprc:.6f}')

# evlaute test data - 2024
X_test_2024 = test_data_2024[feature_cols]
y_test_2024 = test_data_2024['conversion_flag']
test_predictions_2024 = model.predict(X_test_2024, num_iteration=model.best_iteration)

test_auc_roc_2024 = roc_auc_score(y_test_2024, test_predictions_2024)
test_auc_pr_2024 = average_precision_score(y_test_2024, test_predictions_2024)
test_rmse_2024 = np.sqrt(mean_squared_error(y_test_2024, test_predictions_2024))

print(f"2024 Test AUC-ROC: {test_auc_roc_2024:.6f}")
print(f"2024 Test AUC-PR: {test_auc_pr_2024:.6f}")
print(f"2024 Test RMSE: {test_rmse_2024:.6f}")

# evluate test data - 2025
X_test_2025 = test_data_2025[feature_cols]
y_test_2025 = test_data_2025['conversion_flag']
test_predictions_2025 = model.predict(X_test_2025, num_iteration=model.best_iteration)

test_auc_roc_2025 = roc_auc_score(y_test_2025, test_predictions_2025)
test_auc_pr_2025 = average_precision_score(y_test_2025, test_predictions_2025)
test_rmse_2025 = np.sqrt(mean_squared_error(y_test_2025, test_predictions_2025))

print(f"2025 Test AUC-ROC: {test_auc_roc_2025:.6f}")
print(f"2025 Test AUC-PR: {test_auc_pr_2025:.6f}")
print(f"2025 Test RMSE: {test_rmse_2025:.6f}")

```

### Other info / logs

_No response_

### What component(s) does this bug affect?

- [ ] `area/cognitive`: Cognitive project
- [ ] `area/core`: Core project
- [ ] `area/deep-learning`: DeepLearning project
- [x] `area/lightgbm`: Lightgbm project
- [ ] `area/opencv`: Opencv project
- [ ] `area/vw`: VW project
- [ ] `area/website`: Website
- [ ] `area/build`: Project build system
- [ ] `area/notebooks`: Samples under notebooks folder
- [ ] `area/docker`: Docker usage
- [ ] `area/models`: models related issue

### What language(s) does this bug affect?

- [ ] `language/scala`: Scala source code
- [x] `language/python`: Pyspark APIs
- [ ] `language/r`: R APIs
- [ ] `language/csharp`: .NET APIs
- [ ] `language/new`: Proposals for new client languages

### What integration(s) does this bug affect?

- [x] `integrations/synapse`: Azure Synapse integrations
- [ ] `integrations/azureml`: Azure ML integrations
- [x] `integrations/databricks`: Databricks integrations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot achieve same performance as pandas enviroment based LightGBM #2390

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Cannot achieve same performance as pandas enviroment based LightGBM #2390

Description

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions