Skip to content

🐛[bug] JSON Serialization issue when profiling is enabled #10263

@melihakay

Description

@melihakay

Describe the bug

<none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 754, in _train_for_op <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     metrics = self._aggregate_training_metrics(training_metrics) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 355, in _aggregate_training_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     self.core_context.train.report_training_metrics( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 124, in report_training_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     self._report_trial_metrics(util._LEGACY_TRAINING, steps_completed, metrics, batch_metrics) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 91, in _report_trial_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     self._metrics.report( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 51, in report <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     self._maybe_raise_exception() <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 65, in _maybe_raise_exception <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     raise err_msg <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 156, in run <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     self._post_metrics( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_metrics.py", line 186, in _post_metrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     bindings.post_ReportTrialMetrics(self._session, body=body, metrics_trialId=self._trial_id) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/bindings.py", line 21817, in post_ReportTrialMetrics <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     _resp = session._do_request( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 272, in _do_request <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     return _do_request( <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 43, in _do_request <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     data = det.util.json_encode(json) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/util.py", line 246, in json_encode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     return json.dumps(jsonable(obj), indent=indent, sort_keys=sort_keys) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/opt/conda/lib/python3.8/json/__init__.py", line 231, in dumps <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     return _default_encoder.encode(obj) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/opt/conda/lib/python3.8/json/encoder.py", line 199, in encode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     chunks = self.iterencode(o, _one_shot=True) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]   File "/opt/conda/lib/python3.8/json/encoder.py", line 257, in iterencode <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0]     return _iterencode(o, 0) <none> [2025-07-31 11:26:47]
[e20eda90] [rank=0] TypeError: keys must be str, int, float, bool or None, not bytes 

Determined 0.32.1. This error raises when I enable profiling by setting

profiling:
  enabled: true

in the experiment config. Error vanishes and trainer reports metrics without any error when I disable profiling.

Reproduction Steps

  1. Create a PyTorchTrial
  2. Set profiling enable: true in the experiment config
  3. Submit the experiment to the master with 4 GPU's

Expected Behavior

System metrics should have been reported to the WebUI

Screenshot

Error trace is presented above.

Environment

  • Device or hardware: 4xNvidia A100
  • OS: Ubuntu 20.04
  • Browser: Edge
  • Version: 0.32.1 (Determined)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions