-
Notifications
You must be signed in to change notification settings - Fork 371
🐛[bug] JSON Serialization issue when profiling is enabled #10263
Copy link
Copy link
Open
Labels
Description
Describe the bug
<none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 754, in _train_for_op <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] metrics = self._aggregate_training_metrics(training_metrics) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 355, in _aggregate_training_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] self.core_context.train.report_training_metrics( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 124, in report_training_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] self._report_trial_metrics(util._LEGACY_TRAINING, steps_completed, metrics, batch_metrics) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 91, in _report_trial_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] self._metrics.report( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 51, in report <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] self._maybe_raise_exception() <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 65, in _maybe_raise_exception <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] raise err_msg <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 156, in run <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] self._post_metrics( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 186, in _post_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] bindings.post_ReportTrialMetrics(self._session, body=body, metrics_trialId=self._trial_id) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/bindings.py", line 21817, in post_ReportTrialMetrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] _resp = session._do_request( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 272, in _do_request <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] return _do_request( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 43, in _do_request <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] data = det.util.json_encode(json) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/util.py", line 246, in json_encode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] return json.dumps(jsonable(obj), indent=indent, sort_keys=sort_keys) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/opt/conda/lib/python3.8/json/__init__.py", line 231, in dumps <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] return _default_encoder.encode(obj) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/opt/conda/lib/python3.8/json/encoder.py", line 199, in encode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] chunks = self.iterencode(o, _one_shot=True) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] File "/opt/conda/lib/python3.8/json/encoder.py", line 257, in iterencode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] return _iterencode(o, 0) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] TypeError: keys must be str, int, float, bool or None, not bytes
Determined 0.32.1. This error raises when I enable profiling by setting
profiling:
enabled: truein the experiment config. Error vanishes and trainer reports metrics without any error when I disable profiling.
Reproduction Steps
- Create a PyTorchTrial
- Set profiling enable: true in the experiment config
- Submit the experiment to the master with 4 GPU's
Expected Behavior
System metrics should have been reported to the WebUI
Screenshot
Error trace is presented above.
Environment
- Device or hardware: 4xNvidia A100
- OS: Ubuntu 20.04
- Browser: Edge
- Version: 0.32.1 (Determined)
Additional Context
No response
Reactions are currently unavailable