Skip to content

Consolidate Training/Evaluation Run Outputs into a Single runs/ Directory #293

@sadamov

Description

@sadamov

Problem

A single training run scatters output across five unrelated directories:

Artifact Current location
Model checkpoints saved_models/<run-name>/
Lightning CSV logs lightning_logs/version_N/ (auto-incremented, not named)
W&B offline cache wandb/offline-run-<timestamp>-<id>/
Eval artifacts (PDFs, CSVs, .pt files) wandb/ or mlruns/ via self.logger.save_dir
MLFlow artifacts mlruns/ (hard-coded in CustomMLFlowLogger.save_dir)

Run Name Format

Auto-generated in neural_lam/train_model.py as:
{prefix}{model}-{processor_layers}x{hidden_dim}-{MM_DD_HH}-{random_4digits}
e.g. train-graph_lam-2x64-02_21_12-4571. Override with --logger_run_name.
The run name is applied to saved_models/ and the external logger, but not to lightning_logs/.

Proposed Change

All output should go under runs/<run-name>/:

  1. train_model.py: set ModelCheckpoint(dirpath=f"runs/{run_name}/checkpoints") and default_root_dir=f"runs/{run_name}" on pl.Trainer.
  2. utils.py: pass save_dir=f"runs/{run_name}" to WandbLogger and CustomMLFlowLogger.
  3. custom_loggers.py: make save_dir return the run-scoped path instead of "mlruns".
  4. ar_model.py: no changes needed — artifacts already use self.logger.save_dir.
  5. README.md / .gitignore: replace saved_models/, lightning_logs/, wandb/ references with runs/.

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions