Skip to content

an error about 'could not find the monitored key in the returned metrics:' #20

@storyofblue

Description

@storyofblue

@yellowcap @oliverroick @alukach @sunu @AliceR

Hi, there,

I meet a problem as below. I hope you can help me, thanks!

INFO:numexpr.utils:Note: detected 168 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 168 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:albumentations.check_version:A new version of Albumentations is available: 2.0.8 (you have 1.4.10). Upgrade using: pip install --upgrade albumentations
/mnt/scratch/users/quinnnew/multi-temporal-crop-classification_classes4_xjcrops/
INFO: Seed set to 0
INFO:lightning.fabric.utilities.seed:Seed set to 0
/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python example_multitemporalcrop_class4_linux.py ...
INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO:lightning.pytorch.utilities.rank_zero:Using bfloat16 Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: Trainer(limit_predict_batches=1) was configured so 1 batch will be used.
INFO:lightning.pytorch.utilities.rank_zero:Trainer(limit_predict_batches=1) was configured so 1 batch will be used.
INFO:root:Loaded weights for HLSBands.BLUE in position 0 of patch embed
INFO:root:Loaded weights for HLSBands.GREEN in position 1 of patch embed
INFO:root:Loaded weights for HLSBands.RED in position 2 of patch embed
INFO:root:Loaded weights for HLSBands.NIR_NARROW in position 3 of patch embed
INFO:root:Loaded weights for HLSBands.SWIR_1 in position 4 of patch embed
INFO:root:Loaded weights for HLSBands.SWIR_2 in position 5 of patch embed
WARNING:root:Decoder UperNetDecoder does not have an includes_head attribute. Falling back to the value of the registry.
/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/terratorch/models/decoders/upernet_decoder.py:37: UserWarning: DeprecationWarning: scale_modules is deprecated and will be removed in future versions. Use LearnedInterpolateToPyramidal neck instead.
warnings.warn(
INFO: You are using a CUDA device ('NVIDIA L4') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO:lightning.pytorch.utilities.rank_zero:You are using a CUDA device ('NVIDIA L4') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:lightning.fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:numexpr.utils:Note: detected 168 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 168 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:albumentations.check_version:A new version of Albumentations is available: 2.0.8 (you have 1.4.10). Upgrade using: pip install --upgrade albumentations
/mnt/scratch/users/quinnnew/multi-temporal-crop-classification_classes4_xjcrops/
INFO: [rank: 1] Seed set to 0
INFO:lightning.fabric.utilities.seed:[rank: 1] Seed set to 0
INFO:root:Loaded weights for HLSBands.BLUE in position 0 of patch embed
INFO:root:Loaded weights for HLSBands.GREEN in position 1 of patch embed
INFO:root:Loaded weights for HLSBands.RED in position 2 of patch embed
INFO:root:Loaded weights for HLSBands.NIR_NARROW in position 3 of patch embed
INFO:root:Loaded weights for HLSBands.SWIR_1 in position 4 of patch embed
INFO:root:Loaded weights for HLSBands.SWIR_2 in position 5 of patch embed
WARNING:root:Decoder UperNetDecoder does not have an includes_head attribute. Falling back to the value of the registry.
/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/terratorch/models/decoders/upernet_decoder.py:37: UserWarning: DeprecationWarning: scale_modules is deprecated and will be removed in future versions. Use LearnedInterpolateToPyramidal neck instead.
warnings.warn(
INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:lightning.fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
INFO: ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

INFO:lightning.pytorch.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO: LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:
INFO:
| Name | Type | Params | Mode

0 | model | PixelWiseModel | 364 M | train
1 | criterion | CrossEntropyLoss | 0 | train
2 | train_metrics | MetricCollection | 0 | train
3 | val_metrics | MetricCollection | 0 | train
4 | test_metrics | ModuleList | 0 | train

364 M Trainable params
0 Non-trainable params
364 M Total params
1,457.832 Total estimated model params size (MB)
625 Modules in train mode
0 Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
| Name | Type | Params | Mode

0 | model | PixelWiseModel | 364 M | train
1 | criterion | CrossEntropyLoss | 0 | train
2 | train_metrics | MetricCollection | 0 | train
3 | val_metrics | MetricCollection | 0 | train
4 | test_metrics | ModuleList | 0 | train

364 M Trainable params
0 Non-trainable params
364 M Total params
1,457.832 Total estimated model params size (MB)
625 Modules in train mode
0 Modules in eval mode
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [04:15<00:00, 3.02it/s, v_num=4][rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/scratch/users/quinnnew/Prithvi-EO-2.0-main/examples/example_multitemporalcrop_class4_linux.py", line 190, in
[rank1]: trainer.fit(model, datamodule=data_module)
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank1]: results = self._run_stage()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank1]: self.on_advance_end()
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank1]: call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[rank1]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 325, in on_train_epoch_end
[rank1]: self._save_topk_checkpoint(trainer, monitor_candidates)
[rank1]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank1]: raise MisconfigurationException(m)
[rank1]: lightning.fabric.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/Multiclass_Jaccard_Index') could not find the monitored key in the returned metrics: ['train/loss', 'val/loss', 'val/Accuracy', 'val/multiclassaccuracy_0', 'val/multiclassaccuracy_1', 'val/multiclassaccuracy_2', 'val/multiclassaccuracy_3', 'val/multiclassaccuracy_4', 'val/multiclassaccuracy_5', 'val/multiclassaccuracy_6', 'val/multiclassaccuracy_7', 'val/multiclassaccuracy_8', 'val/multiclassaccuracy_9', 'val/multiclassaccuracy_10', 'val/multiclassaccuracy_11', 'val/multiclassaccuracy_12', 'val/F1_Score', 'val/multiclassjaccardindex_0', 'val/multiclassjaccardindex_1', 'val/multiclassjaccardindex_2', 'val/multiclassjaccardindex_3', 'val/multiclassjaccardindex_4', 'val/multiclassjaccardindex_5', 'val/multiclassjaccardindex_6', 'val/multiclassjaccardindex_7', 'val/multiclassjaccardindex_8', 'val/multiclassjaccardindex_9', 'val/multiclassjaccardindex_10', 'val/multiclassjaccardindex_11', 'val/multiclassjaccardindex_12', 'val/Pixel_Accuracy', 'val/mIoU', 'val/mIoU_Micro', 'train/Accuracy', 'train/multiclassaccuracy_0', 'train/multiclassaccuracy_1', 'train/multiclassaccuracy_2', 'train/multiclassaccuracy_3', 'train/multiclassaccuracy_4', 'train/multiclassaccuracy_5', 'train/multiclassaccuracy_6', 'train/multiclassaccuracy_7', 'train/multiclassaccuracy_8', 'train/multiclassaccuracy_9', 'train/multiclassaccuracy_10', 'train/multiclassaccuracy_11', 'train/multiclassaccuracy_12', 'train/F1_Score', 'train/multiclassjaccardindex_0', 'train/multiclassjaccardindex_1', 'train/multiclassjaccardindex_2', 'train/multiclassjaccardindex_3', 'train/multiclassjaccardindex_4', 'train/multiclassjaccardindex_5', 'train/multiclassjaccardindex_6', 'train/multiclassjaccardindex_7', 'train/multiclassjaccardindex_8', 'train/multiclassjaccardindex_9', 'train/multiclassjaccardindex_10', 'train/multiclassjaccardindex_11', 'train/multiclassjaccardindex_12', 'train/Pixel_Accuracy', 'train/mIoU', 'train/mIoU_Micro', 'epoch', 'step']. HINT: Did you call log('val/Multiclass_Jaccard_Index', value) in the LightningModule?
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/scratch/users/quinnnew/Prithvi-EO-2.0-main/examples/example_multitemporalcrop_class4_linux.py", line 190, in
[rank0]: trainer.fit(model, datamodule=data_module)
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[rank0]: results = self._run_stage()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank0]: self.on_advance_end()
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank0]: call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[rank0]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 325, in on_train_epoch_end
[rank0]: self._save_topk_checkpoint(trainer, monitor_candidates)
[rank0]: File "/mnt/scratch/users/quinnnew/anaconda3/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank0]: raise MisconfigurationException(m)
[rank0]: lightning.fabric.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/Multiclass_Jaccard_Index') could not find the monitored key in the returned metrics: ['train/loss', 'val/loss', 'val/Accuracy', 'val/multiclassaccuracy_0', 'val/multiclassaccuracy_1', 'val/multiclassaccuracy_2', 'val/multiclassaccuracy_3', 'val/multiclassaccuracy_4', 'val/multiclassaccuracy_5', 'val/multiclassaccuracy_6', 'val/multiclassaccuracy_7', 'val/multiclassaccuracy_8', 'val/multiclassaccuracy_9', 'val/multiclassaccuracy_10', 'val/multiclassaccuracy_11', 'val/multiclassaccuracy_12', 'val/F1_Score', 'val/multiclassjaccardindex_0', 'val/multiclassjaccardindex_1', 'val/multiclassjaccardindex_2', 'val/multiclassjaccardindex_3', 'val/multiclassjaccardindex_4', 'val/multiclassjaccardindex_5', 'val/multiclassjaccardindex_6', 'val/multiclassjaccardindex_7', 'val/multiclassjaccardindex_8', 'val/multiclassjaccardindex_9', 'val/multiclassjaccardindex_10', 'val/multiclassjaccardindex_11', 'val/multiclassjaccardindex_12', 'val/Pixel_Accuracy', 'val/mIoU', 'val/mIoU_Micro', 'train/Accuracy', 'train/multiclassaccuracy_0', 'train/multiclassaccuracy_1', 'train/multiclassaccuracy_2', 'train/multiclassaccuracy_3', 'train/multiclassaccuracy_4', 'train/multiclassaccuracy_5', 'train/multiclassaccuracy_6', 'train/multiclassaccuracy_7', 'train/multiclassaccuracy_8', 'train/multiclassaccuracy_9', 'train/multiclassaccuracy_10', 'train/multiclassaccuracy_11', 'train/multiclassaccuracy_12', 'train/F1_Score', 'train/multiclassjaccardindex_0', 'train/multiclassjaccardindex_1', 'train/multiclassjaccardindex_2', 'train/multiclassjaccardindex_3', 'train/multiclassjaccardindex_4', 'train/multiclassjaccardindex_5', 'train/multiclassjaccardindex_6', 'train/multiclassjaccardindex_7', 'train/multiclassjaccardindex_8', 'train/multiclassjaccardindex_9', 'train/multiclassjaccardindex_10', 'train/multiclassjaccardindex_11', 'train/multiclassjaccardindex_12', 'train/Pixel_Accuracy', 'train/mIoU', 'train/mIoU_Micro', 'epoch', 'step']. HINT: Did you call log('val/Multiclass_Jaccard_Index', value) in the LightningModule?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions