I'm probably missing something but isn't this tokens per second per GPU: https://github.com/mlfoundations/open_lm/blob/083fa31449c3456e889269e44913578acfced67a/open_lm/train.py#L282 inputs.numel() gives all tokens; for samples it would be inputs.shape[0], no?