samples_per_second_per_gpu or tokens_per_second_per_gpu?

I'm probably missing something but isn't this tokens per second per GPU:
https://github.com/mlfoundations/open_lm/blob/083fa31449c3456e889269e44913578acfced67a/open_lm/train.py#L282

inputs.numel() gives all tokens; for samples it would be inputs.shape[0], no?