Skip to content

fix consolidate training run outputs into a single runs/ directory#580

Open
sudhansu-24 wants to merge 1 commit intomllam:mainfrom
sudhansu-24:fix-run-outputs
Open

fix consolidate training run outputs into a single runs/ directory#580
sudhansu-24 wants to merge 1 commit intomllam:mainfrom
sudhansu-24:fix-run-outputs

Conversation

@sudhansu-24
Copy link
Copy Markdown

Describe your changes

Training and evaluation artifacts are written under a single directory runs/<run-name>/: ModelCheckpoint uses runs/<run-name>/checkpoints/, Trainer(default_root_dir=...) keeps Lightning CSV logs under that run instead of a top-level lightning_logs/, and WandbLogger / CustomMLFlowLogger use save_dir=run_dir so internal logger paths and code using self.logger.save_dir (e.g. plots) stay under the run root. Checkpoints remain outside W&B’s wandb/ subtree so large files are not synced by default.

Motivation: Issue #293 and maintainer feedback (W&B selective sync, common run root, MLflow temp images not in CWD).
Dependencies: None

Issue Link

closes #293

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • [] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug
    • maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
  • author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@joeloskarsson
Copy link
Copy Markdown
Collaborator

@sadamov assigning you here to decide later if this should be closed in favor of #297 or what is the best path forward with this.

@sudhansu-24
Copy link
Copy Markdown
Author

thanks @sadamov for clarifying.

I’ll continue implementation/revisions on #580 and coordinate here.
@Shyam-Sunder-saini @techaadii, if you have any pending changes or preferences from your earlier work that should be included please share them and i will incorporate them into this pr so we can converge quickly

@Shyam-Sunder-saini
Copy link
Copy Markdown

Thanks @sudhansu-24 for taking this forward!

From my side, I’ve aligned all training artifacts so they are now scoped under runs/<run-name>/, including checkpoints, Lightning logs, W&B, and MLflow outputs.

I also updated the logger setup so that both WandbLogger and CustomMLFlowLogger use the same save_dir (run directory). Additionally, MLflow now falls back to the run directory if MLFLOW_TRACKING_URI is not set.

Currently, only one logger is active at a time (default is W&B), and MLflow artifacts are generated when explicitly running with --logger mlflow.

If there are any preferences around structure or logging behavior from earlier work, I’m happy to incorporate them.

Let me know if you’d like me to push any additional changes to the PR.

@sudhansu-24
Copy link
Copy Markdown
Author

sudhansu-24 commented Apr 8, 2026

Thanks @Shyam-Sunder-saini

could you share the exact changes you want added beyond the current #580 state (especially around the MLFLOW_TRACKING_URI fallback) either as:

a short checklist by file, or a commit/PR branch we can cherry-pick from?

if you post that i will incorporate it quickly so we can finalize review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate Training/Evaluation Run Outputs into a Single runs/ Directory

4 participants