feat(truss): add training checkpoint viewer to truss cli#2271
feat(truss): add training checkpoint viewer to truss cli#2271William-Gao1 wants to merge 3 commits intomainfrom
Conversation
…navigation Replace the flat file list view with a directory-aware file explorer for checkpoint drill-down. Users can navigate nested directories (e.g. rank-0/, rank-1/), view small files in a pager, and get download URLs for large files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…irs, and safetensor viewing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5bf0cef to
5dea218
Compare
rcano-baseten
left a comment
There was a problem hiding this comment.
If i wanted to use this with the cache summary (which wouldn't support file previews), how hard would that be to make things work?
truss/cli/train/common.py
Outdated
| return f"{bytes} B" | ||
|
|
||
|
|
||
| def _normalize_iso_timestamp(iso_timestamp: str) -> str: |
There was a problem hiding this comment.
Can you check if we do this in the codebase already?
There was a problem hiding this comment.
Good catch, i'll use that instead
truss/cli/train/checkpoint.py
Outdated
|
|
||
| def _show_url(url: str) -> None: | ||
| """Display a download URL and wait for left-arrow to dismiss.""" | ||
| from prompt_toolkit import Application |
There was a problem hiding this comment.
Is there a reason we prefer inline imports here?
truss/cli/train/checkpoint.py
Outdated
| choices: list[dict] = [] | ||
| if path_stack: | ||
| choices.append({"name": "..", "value": ("back", None)}) | ||
| for d in sorted(dirs, key=lambda x: x["name"]): | ||
| size_str = cli_common.format_bytes_to_human_readable(d["total_size"]) | ||
| label = f"{d['name']}/ ({size_str}, {d['file_count']} files)" | ||
| if d.get("checkpoint_type"): | ||
| ckpt_type = d["checkpoint_type"] | ||
| base_model = d.get("base_model", "") | ||
| annotation_parts = [ckpt_type] | ||
| if base_model: | ||
| annotation_parts.append(base_model) | ||
| label += f" [{' \u00b7 '.join(annotation_parts)}]" | ||
| choices.append({"name": label, "value": ("dir", d["name"])}) | ||
| for f in sorted(dir_files, key=lambda x: x["_rel_path"]): | ||
| name = f["_rel_path"].split("/")[-1] | ||
| size_str = cli_common.format_bytes_to_human_readable(f.get("size_bytes", 0)) | ||
| choices.append({"name": f"{name} ({size_str})", "value": ("file", f)}) | ||
| choices.append({"name": EXIT_OPTION, "value": ("exit", None)}) |
There was a problem hiding this comment.
it might read more cleanly if you you put this in a subroutine
| path_stack.append(payload) | ||
| elif action == "file": | ||
| file_name = payload.get("relative_file_name", "") | ||
| if file_name.endswith(".safetensors"): |
There was a problem hiding this comment.
there are other weights files - do they provide the same headers as safetensors?
There was a problem hiding this comment.
No only safetensors provides this metadata
truss/cli/train/checkpoint.py
Outdated
| size_resp.raise_for_status() | ||
| header_size = struct.unpack("<Q", size_resp.content)[0] | ||
|
|
||
| if header_size > 10_000_000: |
There was a problem hiding this comment.
what is this constant? Why does it need to exist? How frequently would we expect to see something bigger?
There was a problem hiding this comment.
In case the safetensors file is corrupt. It should really never be above a few kb so 10mb is more than enough
truss/cli/train/checkpoint.py
Outdated
| if metadata: | ||
| lines.append("Metadata:") | ||
| for k, v in sorted(metadata.items()): | ||
| lines.append(f" {k}: {v}") | ||
| lines.append("") | ||
|
|
||
| lines.append( | ||
| f"Tensors: {len(tensors)} | Parameters: {total_params:,} | Size: {cli_common.format_bytes_to_human_readable(total_bytes)}" | ||
| ) | ||
| lines.append("") |
There was a problem hiding this comment.
it might be better to build an object here and then define a serialization method on the class (e.g. str)
|
|
||
| def _highlight_content(content: str, file_name: str) -> str: | ||
| """Apply syntax highlighting with ANSI escape codes if a lexer is available.""" | ||
| try: |
There was a problem hiding this comment.
why do we need try/except here?
There was a problem hiding this comment.
get_lexer_for_filename can raise class not found error
truss/cli/train/checkpoint.py
Outdated
| num_params = 1 | ||
| for dim in shape: | ||
| num_params *= dim | ||
| size_bytes = num_params * DTYPE_SIZES.get(dtype, 0) |
There was a problem hiding this comment.
if 0 a sane default here? Would it be better for us to communicate that that we weren't able to calc the size?
There was a problem hiding this comment.
changed to None so unknown dtypes surface as "?" in the size column
truss/cli/train/checkpoint.py
Outdated
| for dim in shape: | ||
| num_params *= dim | ||
| size_bytes = num_params * DTYPE_SIZES.get(dtype, 0) | ||
| tensors.append((name, dtype, shape, num_params, size_bytes)) |
There was a problem hiding this comment.
it would be good to have a class, e.g. TensorSummary
| pass | ||
|
|
||
| _open_in_pager(content, file_name) | ||
| except requests.RequestException as e: |
There was a problem hiding this comment.
if we are in a cli viewer and the fails, what does the console print look like? Does it hijack the cli viewer? Does it emit a log at the bottom? Would be good to see what this looks like (you can just console.print from a happy path to see)
There was a problem hiding this comment.
consol will clear first, error will print and then prompt the user to press entrer to go back to file viewer
5dea218 to
26dee61
Compare
rcano-baseten
left a comment
There was a problem hiding this comment.
Some comments on scoping of the command
There was a problem hiding this comment.
might be worth scoping this to checkpoint_viewer.py instead of checkpoint.py
| ) | ||
|
|
||
|
|
||
| @train.group(name="checkpoints") |
There was a problem hiding this comment.
we're putting this at the highest level e.g. truss checkpoints - what's the reason for not putting this in truss train? I think without the train , job_id becomes unintuitive
|
|
||
|
|
||
| @checkpoints.command(name="list") | ||
| @click.argument("job_id", type=str) |
There was a problem hiding this comment.
is job_id or job-id more canonical?
There was a problem hiding this comment.
why force a job id? Why not let them list all checkpoints?
| default=checkpoint_mod.SORT_BY_CREATED, | ||
| help="Sort checkpoints by checkpoint-id, size, created date, or type.", | ||
| ) | ||
| @click.option( |
There was a problem hiding this comment.
do we want them to be able to look at a specific checkpoint? e.g. --checkpoint-name
🚀 What
Adds the
truss train checkpoints list <job_id>CLI command, which lists checkpoints for a training job with sorting, multiple output formats, and an interactive file explorer.Features:
💻 How
truss/cli/train/checkpoint.py— Viewer classes (CLI table, CSV, JSON) behind an ABC, interactive file explorer using InquirerPy'sFuzzyPromptwith a custom_ColoredFuzzyControlfor per-choice styling, safetensor header parsing via range requests, and pygments syntax highlightingtruss/cli/train_commands.py—truss train checkpoints listclick command wiring🔬 Testing