Skip to content

feat(truss): add training checkpoint viewer to truss cli#2271

Open
William-Gao1 wants to merge 3 commits intomainfrom
will/train-checkpoint-cli
Open

feat(truss): add training checkpoint viewer to truss cli#2271
William-Gao1 wants to merge 3 commits intomainfrom
will/train-checkpoint-cli

Conversation

@William-Gao1
Copy link
Contributor

@William-Gao1 William-Gao1 commented Mar 6, 2026

🚀 What

Adds the truss train checkpoints list <job_id> CLI command, which lists checkpoints for a training job with sorting, multiple output formats, and an interactive file explorer.

Features:

  • Non-interactive mode: Table, CSV, or JSON output with sort/order options
  • Interactive mode: Fuzzy-searchable file explorer with directory navigation, colored directory entries, checkpoint type annotations, safetensor header inspection, syntax-highlighted file viewing, and presigned URL display

💻 How

  • truss/cli/train/checkpoint.py — Viewer classes (CLI table, CSV, JSON) behind an ABC, interactive file explorer using InquirerPy's FuzzyPrompt with a custom _ColoredFuzzyControl for per-choice styling, safetensor header parsing via range requests, and pygments syntax highlighting
  • truss/cli/train_commands.pytruss train checkpoints list click command wiring

🔬 Testing

  • 16 unit tests covering sort ordering, all output formats, empty states, API errors, directory listing with/without checkpoint annotations, nested directories, file fetching
  • Manual testing of interactive explorer against live training jobs

William-Gao1 and others added 2 commits March 6, 2026 11:46
…navigation

Replace the flat file list view with a directory-aware file explorer for
checkpoint drill-down. Users can navigate nested directories (e.g. rank-0/,
rank-1/), view small files in a pager, and get download URLs for large files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…irs, and safetensor viewing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@William-Gao1 William-Gao1 force-pushed the will/train-checkpoint-cli branch from 5bf0cef to 5dea218 Compare March 6, 2026 19:46
Copy link
Contributor

@rcano-baseten rcano-baseten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i wanted to use this with the cache summary (which wouldn't support file previews), how hard would that be to make things work?

return f"{bytes} B"


def _normalize_iso_timestamp(iso_timestamp: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if we do this in the codebase already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, i'll use that instead


def _show_url(url: str) -> None:
"""Display a download URL and wait for left-arrow to dismiss."""
from prompt_toolkit import Application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we prefer inline imports here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

Comment on lines +386 to +404
choices: list[dict] = []
if path_stack:
choices.append({"name": "..", "value": ("back", None)})
for d in sorted(dirs, key=lambda x: x["name"]):
size_str = cli_common.format_bytes_to_human_readable(d["total_size"])
label = f"{d['name']}/ ({size_str}, {d['file_count']} files)"
if d.get("checkpoint_type"):
ckpt_type = d["checkpoint_type"]
base_model = d.get("base_model", "")
annotation_parts = [ckpt_type]
if base_model:
annotation_parts.append(base_model)
label += f" [{' \u00b7 '.join(annotation_parts)}]"
choices.append({"name": label, "value": ("dir", d["name"])})
for f in sorted(dir_files, key=lambda x: x["_rel_path"]):
name = f["_rel_path"].split("/")[-1]
size_str = cli_common.format_bytes_to_human_readable(f.get("size_bytes", 0))
choices.append({"name": f"{name} ({size_str})", "value": ("file", f)})
choices.append({"name": EXIT_OPTION, "value": ("exit", None)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might read more cleanly if you you put this in a subroutine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

path_stack.append(payload)
elif action == "file":
file_name = payload.get("relative_file_name", "")
if file_name.endswith(".safetensors"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are other weights files - do they provide the same headers as safetensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No only safetensors provides this metadata

size_resp.raise_for_status()
header_size = struct.unpack("<Q", size_resp.content)[0]

if header_size > 10_000_000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this constant? Why does it need to exist? How frequently would we expect to see something bigger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the safetensors file is corrupt. It should really never be above a few kb so 10mb is more than enough

Comment on lines +528 to +537
if metadata:
lines.append("Metadata:")
for k, v in sorted(metadata.items()):
lines.append(f" {k}: {v}")
lines.append("")

lines.append(
f"Tensors: {len(tensors)} | Parameters: {total_params:,} | Size: {cli_common.format_bytes_to_human_readable(total_bytes)}"
)
lines.append("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to build an object here and then define a serialization method on the class (e.g. str)


def _highlight_content(content: str, file_name: str) -> str:
"""Apply syntax highlighting with ANSI escape codes if a lexer is available."""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need try/except here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_lexer_for_filename can raise class not found error

num_params = 1
for dim in shape:
num_params *= dim
size_bytes = num_params * DTYPE_SIZES.get(dtype, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 0 a sane default here? Would it be better for us to communicate that that we weren't able to calc the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to None so unknown dtypes surface as "?" in the size column

for dim in shape:
num_params *= dim
size_bytes = num_params * DTYPE_SIZES.get(dtype, 0)
tensors.append((name, dtype, shape, num_params, size_bytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have a class, e.g. TensorSummary

pass

_open_in_pager(content, file_name)
except requests.RequestException as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are in a cli viewer and the fails, what does the console print look like? Does it hijack the cli viewer? Does it emit a log at the bottom? Would be good to see what this looks like (you can just console.print from a happy path to see)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consol will clear first, error will print and then prompt the user to press entrer to go back to file viewer

@William-Gao1 William-Gao1 force-pushed the will/train-checkpoint-cli branch from 5dea218 to 26dee61 Compare March 9, 2026 21:14
Copy link
Contributor

@rcano-baseten rcano-baseten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on scoping of the command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth scoping this to checkpoint_viewer.py instead of checkpoint.py

)


@train.group(name="checkpoints")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're putting this at the highest level e.g. truss checkpoints - what's the reason for not putting this in truss train? I think without the train , job_id becomes unintuitive



@checkpoints.command(name="list")
@click.argument("job_id", type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is job_id or job-id more canonical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why force a job id? Why not let them list all checkpoints?

default=checkpoint_mod.SORT_BY_CREATED,
help="Sort checkpoints by checkpoint-id, size, created date, or type.",
)
@click.option(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want them to be able to look at a specific checkpoint? e.g. --checkpoint-name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants