Skip to content

Struggling to reproduce reference performance, much longer step times #839

@alat-rights

Description

@alat-rights

I'm on a fresh Runpod 8xH100 instance with the runpod/parameter-golf:latest (y5cejece4j) template.

I am struggling to reproduce performance for either the naive baseline or 2026-03-23_LeakyReLU_LegalTTT_ParallelMuon with identical seeds (1337), using the provided training code and invocations.

I present logs for the naive baseline.

root@72de56139f89:/workspace/parameter-golf# nvidia-smi
Thu Mar 26 10:28:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.211.01             Driver Version: 570.211.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
| N/A   22C    P0             72W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:2A:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:3A:00.0 Off |                    0 |
| N/A   23C    P0             73W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   21C    P0             69W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9A:00.0 Off |                    0 |
| N/A   20C    P0             69W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:AB:00.0 Off |                    0 |
| N/A   22C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:BA:00.0 Off |                    0 |
| N/A   21C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   19C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I provide my training log for the naive baseline.


W0326 10:07:52.181000 38731 torch/distributed/run.py:803] 
W0326 10:07:52.181000 38731 torch/distributed/run.py:803] *****************************************
W0326 10:07:52.181000 38731 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0326 10:07:52.181000 38731 torch/distributed/run.py:803] *****************************************
logs/hf_verify_sp1024_8gpu.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:17059912
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9357 val_bpb:4.1077 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9370 train_time:41ms step_avg:41.14ms
step:2/20000 train_loss:16.8366 train_time:93ms step_avg:46.31ms
step:3/20000 train_loss:8.7609 train_time:147ms step_avg:48.87ms
step:4/20000 train_loss:6.6385 train_time:205ms step_avg:51.23ms
step:5/20000 train_loss:6.6117 train_time:262ms step_avg:52.38ms
step:6/20000 train_loss:7.4221 train_time:321ms step_avg:53.47ms
step:7/20000 train_loss:6.3509 train_time:403ms step_avg:57.57ms
step:8/20000 train_loss:6.1585 train_time:470ms step_avg:58.76ms
step:9/20000 train_loss:6.0680 train_time:541ms step_avg:60.06ms
step:10/20000 train_loss:5.9747 train_time:613ms step_avg:61.34ms
step:50/20000 train_loss:4.1046 train_time:2364ms step_avg:47.29ms
step:100/20000 train_loss:3.4084 train_time:4542ms step_avg:45.42ms
step:150/20000 train_loss:3.0593 train_time:6722ms step_avg:44.81ms
step:200/20000 train_loss:2.8436 train_time:12222ms step_avg:61.11ms
step:200/20000 val_loss:2.8381 val_bpb:1.6809 train_time:12259ms step_avg:61.30ms
step:250/20000 train_loss:2.7501 train_time:14419ms step_avg:57.67ms
step:300/20000 train_loss:2.4808 train_time:17439ms step_avg:58.13ms
step:350/20000 train_loss:2.6710 train_time:19618ms step_avg:56.05ms
step:400/20000 train_loss:2.3628 train_time:25779ms step_avg:64.45ms
step:400/20000 val_loss:2.5640 val_bpb:1.5186 train_time:25817ms step_avg:64.54ms
step:450/20000 train_loss:2.5109 train_time:27981ms step_avg:62.18ms
step:500/20000 train_loss:2.5001 train_time:30847ms step_avg:61.69ms
step:550/20000 train_loss:2.3996 train_time:33032ms step_avg:60.06ms
step:600/20000 train_loss:2.5468 train_time:38617ms step_avg:64.36ms
step:600/20000 val_loss:2.4480 val_bpb:1.4498 train_time:38640ms step_avg:64.40ms
step:650/20000 train_loss:2.3817 train_time:40804ms step_avg:62.77ms
step:700/20000 train_loss:2.4430 train_time:42993ms step_avg:61.42ms
step:750/20000 train_loss:2.2734 train_time:45185ms step_avg:60.25ms
step:800/20000 train_loss:2.2909 train_time:50714ms step_avg:63.39ms
step:800/20000 val_loss:2.3806 val_bpb:1.4099 train_time:50736ms step_avg:63.42ms
step:850/20000 train_loss:2.7210 train_time:52900ms step_avg:62.24ms
step:900/20000 train_loss:2.3387 train_time:55100ms step_avg:61.22ms
step:950/20000 train_loss:2.3968 train_time:57300ms step_avg:60.32ms
step:1000/20000 train_loss:2.3723 train_time:64450ms step_avg:64.45ms
step:1000/20000 val_loss:2.3352 val_bpb:1.3830 train_time:64474ms step_avg:64.47ms
step:1050/20000 train_loss:2.4881 train_time:66641ms step_avg:63.47ms
step:1100/20000 train_loss:2.2629 train_time:68830ms step_avg:62.57ms
step:1150/20000 train_loss:2.2566 train_time:75617ms step_avg:65.75ms
step:1200/20000 train_loss:2.3887 train_time:77807ms step_avg:64.84ms
step:1200/20000 val_loss:2.3032 val_bpb:1.3641 train_time:77832ms step_avg:64.86ms
step:1250/20000 train_loss:2.2103 train_time:79996ms step_avg:64.00ms
step:1300/20000 train_loss:2.3581 train_time:82189ms step_avg:63.22ms
step:1350/20000 train_loss:2.2702 train_time:87208ms step_avg:64.60ms
step:1400/20000 train_loss:2.4319 train_time:89401ms step_avg:63.86ms
step:1400/20000 val_loss:2.2820 val_bpb:1.3515 train_time:89423ms step_avg:63.87ms
step:1450/20000 train_loss:2.2385 train_time:91589ms step_avg:63.16ms
step:1500/20000 train_loss:2.2232 train_time:93784ms step_avg:62.52ms
step:1550/20000 train_loss:2.1570 train_time:100390ms step_avg:64.77ms
step:1600/20000 train_loss:2.1008 train_time:102573ms step_avg:64.11ms
step:1600/20000 val_loss:2.2662 val_bpb:1.3422 train_time:102598ms step_avg:64.12ms
step:1650/20000 train_loss:2.2292 train_time:104762ms step_avg:63.49ms
step:1700/20000 train_loss:2.1752 train_time:106953ms step_avg:62.91ms
step:1750/20000 train_loss:2.2534 train_time:112810ms step_avg:64.46ms
step:1800/20000 train_loss:2.2015 train_time:115004ms step_avg:63.89ms
step:1800/20000 val_loss:2.2513 val_bpb:1.3333 train_time:115030ms step_avg:63.91ms
step:1850/20000 train_loss:2.3017 train_time:117199ms step_avg:63.35ms
step:1900/20000 train_loss:2.1939 train_time:119395ms step_avg:62.84ms
step:1950/20000 train_loss:2.2104 train_time:125293ms step_avg:64.25ms
step:2000/20000 train_loss:2.2502 train_time:127487ms step_avg:63.74ms
step:2000/20000 val_loss:2.2357 val_bpb:1.3241 train_time:127512ms step_avg:63.76ms
step:2050/20000 train_loss:2.2524 train_time:129682ms step_avg:63.26ms
step:2100/20000 train_loss:2.2698 train_time:137123ms step_avg:65.30ms
step:2150/20000 train_loss:2.1906 train_time:139329ms step_avg:64.80ms
step:2200/20000 train_loss:2.0769 train_time:141533ms step_avg:64.33ms
step:2200/20000 val_loss:2.2273 val_bpb:1.3192 train_time:141579ms step_avg:64.35ms
step:2250/20000 train_loss:2.1630 train_time:143747ms step_avg:63.89ms
step:2300/20000 train_loss:2.3797 train_time:149807ms step_avg:65.13ms
step:2350/20000 train_loss:2.2003 train_time:152011ms step_avg:64.69ms
step:2400/20000 train_loss:2.1982 train_time:154212ms step_avg:64.26ms
step:2400/20000 val_loss:2.2171 val_bpb:1.3131 train_time:154250ms step_avg:64.27ms
step:2450/20000 train_loss:2.2077 train_time:156411ms step_avg:63.84ms
step:2500/20000 train_loss:2.1232 train_time:162731ms step_avg:65.09ms
step:2550/20000 train_loss:2.1367 train_time:164917ms step_avg:64.67ms
step:2600/20000 train_loss:2.4127 train_time:167115ms step_avg:64.28ms
step:2600/20000 val_loss:2.2175 val_bpb:1.3133 train_time:167138ms step_avg:64.28ms
step:2650/20000 train_loss:2.2415 train_time:169303ms step_avg:63.89ms
step:2700/20000 train_loss:2.1581 train_time:177437ms step_avg:65.72ms
step:2750/20000 train_loss:2.3616 train_time:179624ms step_avg:65.32ms
step:2800/20000 train_loss:2.2357 train_time:181812ms step_avg:64.93ms
step:2800/20000 val_loss:2.2028 val_bpb:1.3046 train_time:181906ms step_avg:64.97ms
step:2850/20000 train_loss:2.1887 train_time:184070ms step_avg:64.59ms
step:2900/20000 train_loss:2.1762 train_time:189909ms step_avg:65.49ms
step:2950/20000 train_loss:2.2392 train_time:192104ms step_avg:65.12ms
step:3000/20000 train_loss:2.2283 train_time:194296ms step_avg:64.77ms
step:3000/20000 val_loss:2.1952 val_bpb:1.3001 train_time:195206ms step_avg:65.07ms
step:3050/20000 train_loss:2.1716 train_time:197366ms step_avg:64.71ms
step:3100/20000 train_loss:2.2066 train_time:203719ms step_avg:65.72ms
step:3150/20000 train_loss:2.1595 train_time:205914ms step_avg:65.37ms
step:3200/20000 train_loss:2.1908 train_time:208116ms step_avg:65.04ms
step:3200/20000 val_loss:2.1901 val_bpb:1.2971 train_time:208156ms step_avg:65.05ms
step:3250/20000 train_loss:2.0910 train_time:215242ms step_avg:66.23ms
step:3300/20000 train_loss:2.2378 train_time:217437ms step_avg:65.89ms
step:3350/20000 train_loss:2.0963 train_time:219626ms step_avg:65.56ms
step:3400/20000 train_loss:2.1563 train_time:221809ms step_avg:65.24ms
step:3400/20000 val_loss:2.1869 val_bpb:1.2952 train_time:221831ms step_avg:65.24ms
step:3450/20000 train_loss:2.1115 train_time:229289ms step_avg:66.46ms
step:3500/20000 train_loss:2.2524 train_time:231474ms step_avg:66.14ms
step:3550/20000 train_loss:2.3867 train_time:233669ms step_avg:65.82ms
step:3600/20000 train_loss:2.1157 train_time:235875ms step_avg:65.52ms
step:3600/20000 val_loss:2.1791 val_bpb:1.2906 train_time:235903ms step_avg:65.53ms
step:3650/20000 train_loss:2.2171 train_time:241787ms step_avg:66.24ms
step:3700/20000 train_loss:2.1504 train_time:243979ms step_avg:65.94ms
step:3750/20000 train_loss:2.1465 train_time:246173ms step_avg:65.65ms
step:3800/20000 train_loss:2.2223 train_time:248367ms step_avg:65.36ms
step:3800/20000 val_loss:2.1752 val_bpb:1.2883 train_time:248390ms step_avg:65.37ms
step:3850/20000 train_loss:2.1783 train_time:254159ms step_avg:66.02ms
step:3900/20000 train_loss:1.9947 train_time:256347ms step_avg:65.73ms
step:3950/20000 train_loss:2.1285 train_time:258533ms step_avg:65.45ms
step:4000/20000 train_loss:2.1629 train_time:260724ms step_avg:65.18ms
step:4000/20000 val_loss:2.1705 val_bpb:1.2855 train_time:260757ms step_avg:65.19ms
step:4050/20000 train_loss:2.1010 train_time:266555ms step_avg:65.82ms
step:4100/20000 train_loss:2.1858 train_time:268748ms step_avg:65.55ms
step:4150/20000 train_loss:2.3291 train_time:270932ms step_avg:65.28ms
step:4200/20000 train_loss:2.1738 train_time:276543ms step_avg:65.84ms
step:4200/20000 val_loss:2.1666 val_bpb:1.2832 train_time:276557ms step_avg:65.85ms
step:4250/20000 train_loss:2.1285 train_time:278756ms step_avg:65.59ms
step:4300/20000 train_loss:2.0297 train_time:280979ms step_avg:65.34ms
step:4350/20000 train_loss:2.2138 train_time:283203ms step_avg:65.10ms
step:4400/20000 train_loss:2.1096 train_time:290188ms step_avg:65.95ms
step:4400/20000 val_loss:2.1664 val_bpb:1.2831 train_time:290211ms step_avg:65.96ms
step:4450/20000 train_loss:2.0655 train_time:292407ms step_avg:65.71ms
step:4500/20000 train_loss:2.2575 train_time:294619ms step_avg:65.47ms
step:4550/20000 train_loss:2.0548 train_time:296809ms step_avg:65.23ms
step:4600/20000 train_loss:1.9728 train_time:302894ms step_avg:65.85ms
step:4600/20000 val_loss:2.1624 val_bpb:1.2807 train_time:302918ms step_avg:65.85ms
step:4650/20000 train_loss:2.0727 train_time:305083ms step_avg:65.61ms
step:4700/20000 train_loss:2.2706 train_time:307271ms step_avg:65.38ms
step:4750/20000 train_loss:1.9790 train_time:309461ms step_avg:65.15ms
step:4800/20000 train_loss:2.2608 train_time:315638ms step_avg:65.76ms
step:4800/20000 val_loss:2.1583 val_bpb:1.2783 train_time:315661ms step_avg:65.76ms
step:4850/20000 train_loss:2.1497 train_time:317826ms step_avg:65.53ms
step:4900/20000 train_loss:2.1648 train_time:320015ms step_avg:65.31ms
step:4950/20000 train_loss:2.3433 train_time:322208ms step_avg:65.09ms
step:5000/20000 train_loss:2.0301 train_time:328572ms step_avg:65.71ms
step:5000/20000 val_loss:2.1535 val_bpb:1.2754 train_time:328597ms step_avg:65.72ms
step:5050/20000 train_loss:2.2050 train_time:330762ms step_avg:65.50ms
step:5100/20000 train_loss:2.0267 train_time:332946ms step_avg:65.28ms
step:5150/20000 train_loss:2.2798 train_time:339056ms step_avg:65.84ms
step:5200/20000 train_loss:2.1755 train_time:341258ms step_avg:65.63ms
step:5200/20000 val_loss:2.1542 val_bpb:1.2758 train_time:341283ms step_avg:65.63ms
step:5250/20000 train_loss:2.1231 train_time:343446ms step_avg:65.42ms
step:5300/20000 train_loss:2.2174 train_time:345632ms step_avg:65.21ms
step:5350/20000 train_loss:2.1410 train_time:351113ms step_avg:65.63ms
step:5400/20000 train_loss:2.1844 train_time:353348ms step_avg:65.43ms
step:5400/20000 val_loss:2.1496 val_bpb:1.2731 train_time:353367ms step_avg:65.44ms
step:5450/20000 train_loss:2.1989 train_time:355535ms step_avg:65.24ms
step:5500/20000 train_loss:2.1371 train_time:357727ms step_avg:65.04ms
step:5550/20000 train_loss:2.1076 train_time:363334ms step_avg:65.47ms
step:5600/20000 train_loss:2.1843 train_time:365526ms step_avg:65.27ms
step:5600/20000 val_loss:2.1500 val_bpb:1.2734 train_time:365549ms step_avg:65.28ms
step:5650/20000 train_loss:2.0571 train_time:367715ms step_avg:65.08ms
step:5700/20000 train_loss:2.1784 train_time:369906ms step_avg:64.90ms
step:5750/20000 train_loss:2.2154 train_time:375724ms step_avg:65.34ms
step:5800/20000 train_loss:2.1433 train_time:377912ms step_avg:65.16ms
step:5800/20000 val_loss:2.1469 val_bpb:1.2715 train_time:377936ms step_avg:65.16ms
step:5850/20000 train_loss:2.1833 train_time:380100ms step_avg:64.97ms
step:5900/20000 train_loss:2.0915 train_time:382302ms step_avg:64.80ms
step:5950/20000 train_loss:2.1358 train_time:388213ms step_avg:65.25ms
step:6000/20000 train_loss:2.2211 train_time:390455ms step_avg:65.08ms
step:6000/20000 val_loss:2.1449 val_bpb:1.2703 train_time:390475ms step_avg:65.08ms
step:6050/20000 train_loss:2.1209 train_time:392668ms step_avg:64.90ms
step:6100/20000 train_loss:2.1219 train_time:394883ms step_avg:64.73ms
step:6150/20000 train_loss:2.0993 train_time:401261ms step_avg:65.25ms
step:6200/20000 train_loss:2.0869 train_time:403473ms step_avg:65.08ms
step:6200/20000 val_loss:2.1427 val_bpb:1.2690 train_time:403494ms step_avg:65.08ms
step:6250/20000 train_loss:2.1561 train_time:405679ms step_avg:64.91ms
step:6300/20000 train_loss:2.0395 train_time:411922ms step_avg:65.38ms
step:6350/20000 train_loss:2.0195 train_time:414125ms step_avg:65.22ms
step:6400/20000 train_loss:2.1599 train_time:416352ms step_avg:65.06ms
step:6400/20000 val_loss:2.1393 val_bpb:1.2670 train_time:416394ms step_avg:65.06ms
step:6450/20000 train_loss:2.0753 train_time:418559ms step_avg:64.89ms
step:6500/20000 train_loss:2.0809 train_time:426220ms step_avg:65.57ms
step:6550/20000 train_loss:2.2109 train_time:428424ms step_avg:65.41ms
step:6600/20000 train_loss:2.1235 train_time:430623ms step_avg:65.25ms
step:6600/20000 val_loss:2.1362 val_bpb:1.2652 train_time:430657ms step_avg:65.25ms
step:6650/20000 train_loss:2.2952 train_time:432825ms step_avg:65.09ms
step:6700/20000 train_loss:2.1620 train_time:439261ms step_avg:65.56ms
step:6750/20000 train_loss:2.3302 train_time:441450ms step_avg:65.40ms
step:6800/20000 train_loss:2.1887 train_time:443640ms step_avg:65.24ms
step:6800/20000 val_loss:2.1360 val_bpb:1.2651 train_time:443662ms step_avg:65.24ms
step:6850/20000 train_loss:2.0244 train_time:445832ms step_avg:65.09ms
step:6900/20000 train_loss:2.0972 train_time:451414ms step_avg:65.42ms
step:6950/20000 train_loss:2.1799 train_time:453602ms step_avg:65.27ms
step:7000/20000 train_loss:2.2235 train_time:455859ms step_avg:65.12ms
step:7000/20000 val_loss:2.1329 val_bpb:1.2632 train_time:455884ms step_avg:65.13ms
step:7050/20000 train_loss:2.2507 train_time:458047ms step_avg:64.97ms
step:7100/20000 train_loss:2.0697 train_time:464827ms step_avg:65.47ms
step:7150/20000 train_loss:2.1542 train_time:467014ms step_avg:65.32ms
step:7200/20000 train_loss:2.2003 train_time:469201ms step_avg:65.17ms
step:7200/20000 val_loss:2.1329 val_bpb:1.2632 train_time:469227ms step_avg:65.17ms
step:7250/20000 train_loss:2.1001 train_time:475975ms step_avg:65.65ms
step:7300/20000 train_loss:2.0894 train_time:478170ms step_avg:65.50ms
step:7350/20000 train_loss:2.1803 train_time:480359ms step_avg:65.35ms
step:7400/20000 train_loss:2.1214 train_time:482546ms step_avg:65.21ms
step:7400/20000 val_loss:2.1293 val_bpb:1.2611 train_time:483026ms step_avg:65.27ms
step:7450/20000 train_loss:2.1153 train_time:489585ms step_avg:65.72ms
step:7500/20000 train_loss:2.1144 train_time:491792ms step_avg:65.57ms
step:7550/20000 train_loss:2.1721 train_time:494230ms step_avg:65.46ms
step:7600/20000 train_loss:1.9986 train_time:496419ms step_avg:65.32ms
step:7600/20000 val_loss:2.1283 val_bpb:1.2605 train_time:496443ms step_avg:65.32ms
step:7650/20000 train_loss:2.2902 train_time:502545ms step_avg:65.69ms
step:7700/20000 train_loss:2.0891 train_time:504741ms step_avg:65.55ms
step:7750/20000 train_loss:2.1101 train_time:506979ms step_avg:65.42ms
step:7800/20000 train_loss:2.1444 train_time:509172ms step_avg:65.28ms
step:7800/20000 val_loss:2.1257 val_bpb:1.2590 train_time:509213ms step_avg:65.28ms
step:7850/20000 train_loss:1.9981 train_time:514829ms step_avg:65.58ms
step:7900/20000 train_loss:2.1303 train_time:517029ms step_avg:65.45ms
step:7950/20000 train_loss:2.0969 train_time:519240ms step_avg:65.31ms
step:8000/20000 train_loss:2.1145 train_time:521446ms step_avg:65.18ms
step:8000/20000 val_loss:2.1236 val_bpb:1.2577 train_time:521478ms step_avg:65.18ms
step:8050/20000 train_loss:2.0790 train_time:527243ms step_avg:65.50ms
step:8100/20000 train_loss:2.1468 train_time:529432ms step_avg:65.36ms
step:8150/20000 train_loss:2.2525 train_time:531621ms step_avg:65.23ms
step:8200/20000 train_loss:2.1836 train_time:533816ms step_avg:65.10ms
step:8200/20000 val_loss:2.1161 val_bpb:1.2532 train_time:533840ms step_avg:65.10ms
step:8250/20000 train_loss:2.1361 train_time:539681ms step_avg:65.42ms
step:8300/20000 train_loss:2.1134 train_time:541888ms step_avg:65.29ms
step:8350/20000 train_loss:2.2161 train_time:544084ms step_avg:65.16ms
step:8400/20000 train_loss:2.1214 train_time:549747ms step_avg:65.45ms
step:8400/20000 val_loss:2.1075 val_bpb:1.2482 train_time:549770ms step_avg:65.45ms
step:8450/20000 train_loss:2.2130 train_time:551942ms step_avg:65.32ms
step:8500/20000 train_loss:2.1096 train_time:554150ms step_avg:65.19ms
step:8550/20000 train_loss:2.1797 train_time:556356ms step_avg:65.07ms
step:8600/20000 train_loss:2.1220 train_time:562016ms step_avg:65.35ms
step:8600/20000 val_loss:2.0976 val_bpb:1.2423 train_time:562039ms step_avg:65.35ms
step:8650/20000 train_loss:2.0762 train_time:564239ms step_avg:65.23ms
step:8700/20000 train_loss:2.0090 train_time:566459ms step_avg:65.11ms
step:8750/20000 train_loss:2.1720 train_time:568680ms step_avg:64.99ms
step:8800/20000 train_loss:2.0775 train_time:574264ms step_avg:65.26ms
step:8800/20000 val_loss:2.0900 val_bpb:1.2378 train_time:574295ms step_avg:65.26ms
step:8850/20000 train_loss:2.2884 train_time:576460ms step_avg:65.14ms
step:8900/20000 train_loss:2.1719 train_time:578676ms step_avg:65.02ms
step:8950/20000 train_loss:2.1322 train_time:580878ms step_avg:64.90ms
step:9000/20000 train_loss:1.9979 train_time:586592ms step_avg:65.18ms
step:9000/20000 val_loss:2.0827 val_bpb:1.2335 train_time:586620ms step_avg:65.18ms
step:9050/20000 train_loss:2.0320 train_time:588783ms step_avg:65.06ms
step:9100/20000 train_loss:2.2803 train_time:590976ms step_avg:64.94ms
step:9150/20000 train_loss:1.9698 train_time:593174ms step_avg:64.83ms
step:9195/20000 val_loss:2.0771 val_bpb:1.2302 train_time:599279ms step_avg:65.17ms
stopping_early: wallclock_cap train_time:599279ms step:9195/20000
peak memory allocated: 10119 MiB reserved: 10438 MiB
Serialized model: 67224983 bytes
Code size: 47686 bytes
Total submission size: 67272669 bytes
Serialized model int8+zlib: 15812832 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
Total submission size int8+zlib: 15860518 bytes
final_int8_zlib_roundtrip val_loss:2.0966 val_bpb:1.2417 eval_time:1385ms
final_int8_zlib_roundtrip_exact val_loss:2.09656445 val_bpb:1.24170356

As you can see, it is identical in configuration to the reference training log below:

W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] 
W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] *****************************************
W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0318 14:37:59.159000 871689 site-packages/torch/distributed/run.py:852] *****************************************
[W318 14:38:11.514156940 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.543417305 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.552597211 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
NCCL version 2.27.5+cuda12.9
[W318 14:38:11.832390267 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.842257581 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.842253680 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.899166383 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W318 14:38:11.901800020 Utils.hpp:137] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())

[2026-03-18 14:38:12] pgut1-0:871784:871848 [5] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871784:871848 [5] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871786:871849 [7] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871786:871849 [7] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871779:871850 [0] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871779:871850 [0] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871780:871857 [1] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871780:871857 [1] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871781:871858 [2] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871781:871858 [2] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871783:871859 [4] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871783:871859 [4] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871782:871864 [3] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871782:871864 [3] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0

[2026-03-18 14:38:12] pgut1-0:871785:871865 [6] ibvwrap.c:94 NCCL WARN Call to ibv_open_device failed

[2026-03-18 14:38:12] pgut1-0:871785:871865 [6] p2p_plugin.c:565 NCCL WARN NET/IB : Unable to open device mlx5_an0
logs/hf_verify_sp1024_8gpu.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:25
val_loader:shards pattern=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:63779840
[rank0]:[W318 14:38:18.833454927 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
model_params:17059912
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
[rank3]:[W318 14:38:18.835915381 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank7]:[W318 14:38:18.835951425 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank6]:[W318 14:38:18.835967008 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank2]:[W318 14:38:18.836023454 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank5]:[W318 14:38:18.836119632 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank4]:[W318 14:38:18.836127772 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank1]:[W318 14:38:18.836354967 Utils.hpp:112] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9370 val_bpb:4.0978 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9408 train_time:24ms step_avg:23.99ms
step:2/20000 train_loss:16.8763 train_time:67ms step_avg:33.39ms
step:3/20000 train_loss:9.0044 train_time:110ms step_avg:36.62ms
step:4/20000 train_loss:6.5686 train_time:152ms step_avg:37.99ms
step:5/20000 train_loss:6.6665 train_time:195ms step_avg:38.97ms
step:6/20000 train_loss:6.5027 train_time:239ms step_avg:39.81ms
step:7/20000 train_loss:6.2808 train_time:280ms step_avg:40.05ms
step:8/20000 train_loss:5.9951 train_time:324ms step_avg:40.52ms
step:9/20000 train_loss:6.0187 train_time:367ms step_avg:40.77ms
step:10/20000 train_loss:5.9718 train_time:409ms step_avg:40.93ms
step:50/20000 train_loss:3.9508 train_time:2126ms step_avg:42.52ms
step:100/20000 train_loss:3.3373 train_time:4267ms step_avg:42.67ms
step:150/20000 train_loss:2.9651 train_time:6414ms step_avg:42.76ms
step:200/20000 train_loss:2.8041 train_time:8677ms step_avg:43.38ms
step:200/20000 val_loss:2.8397 val_bpb:1.6774 train_time:8699ms step_avg:43.49ms
step:250/20000 train_loss:2.7379 train_time:10816ms step_avg:43.27ms
step:300/20000 train_loss:2.6613 train_time:12958ms step_avg:43.19ms
step:350/20000 train_loss:2.6434 train_time:15097ms step_avg:43.13ms
step:400/20000 train_loss:2.7684 train_time:17357ms step_avg:43.39ms
step:400/20000 val_loss:2.5687 val_bpb:1.5174 train_time:17382ms step_avg:43.45ms
step:450/20000 train_loss:2.6035 train_time:19502ms step_avg:43.34ms
step:500/20000 train_loss:2.5265 train_time:21643ms step_avg:43.29ms
step:550/20000 train_loss:2.4803 train_time:23782ms step_avg:43.24ms
step:600/20000 train_loss:2.4731 train_time:26034ms step_avg:43.39ms
step:600/20000 val_loss:2.4456 val_bpb:1.4447 train_time:26059ms step_avg:43.43ms
step:650/20000 train_loss:2.3204 train_time:28175ms step_avg:43.35ms
step:700/20000 train_loss:2.5926 train_time:30315ms step_avg:43.31ms
step:750/20000 train_loss:2.4301 train_time:32457ms step_avg:43.28ms
step:800/20000 train_loss:2.4775 train_time:34707ms step_avg:43.38ms
step:800/20000 val_loss:2.3868 val_bpb:1.4099 train_time:34732ms step_avg:43.42ms
step:850/20000 train_loss:2.3941 train_time:36851ms step_avg:43.35ms
step:900/20000 train_loss:2.3716 train_time:38990ms step_avg:43.32ms
step:950/20000 train_loss:2.3216 train_time:41131ms step_avg:43.30ms
step:1000/20000 train_loss:2.3030 train_time:43390ms step_avg:43.39ms
step:1000/20000 val_loss:2.3370 val_bpb:1.3805 train_time:43415ms step_avg:43.42ms
step:1050/20000 train_loss:2.3893 train_time:45532ms step_avg:43.36ms
step:1100/20000 train_loss:2.4145 train_time:47675ms step_avg:43.34ms
step:1150/20000 train_loss:2.2261 train_time:49933ms step_avg:43.42ms
step:1200/20000 train_loss:2.2607 train_time:52072ms step_avg:43.39ms
step:1200/20000 val_loss:2.3026 val_bpb:1.3602 train_time:52097ms step_avg:43.41ms
step:1250/20000 train_loss:2.3312 train_time:54219ms step_avg:43.38ms
step:1300/20000 train_loss:2.3575 train_time:56363ms step_avg:43.36ms
step:1350/20000 train_loss:2.2774 train_time:58628ms step_avg:43.43ms
step:1400/20000 train_loss:2.2436 train_time:60772ms step_avg:43.41ms
step:1400/20000 val_loss:2.2812 val_bpb:1.3475 train_time:60797ms step_avg:43.43ms
step:1450/20000 train_loss:2.3006 train_time:62917ms step_avg:43.39ms
step:1500/20000 train_loss:2.2831 train_time:65060ms step_avg:43.37ms
step:1550/20000 train_loss:2.2957 train_time:67324ms step_avg:43.43ms
step:1600/20000 train_loss:2.2187 train_time:69467ms step_avg:43.42ms
step:1600/20000 val_loss:2.2631 val_bpb:1.3368 train_time:69491ms step_avg:43.43ms
step:1650/20000 train_loss:2.2629 train_time:71614ms step_avg:43.40ms
step:1700/20000 train_loss:2.2619 train_time:73759ms step_avg:43.39ms
step:1750/20000 train_loss:2.1068 train_time:76028ms step_avg:43.44ms
step:1800/20000 train_loss:2.3312 train_time:78171ms step_avg:43.43ms
step:1800/20000 val_loss:2.2479 val_bpb:1.3279 train_time:78197ms step_avg:43.44ms
step:1850/20000 train_loss:2.2211 train_time:80317ms step_avg:43.41ms
step:1900/20000 train_loss:2.2477 train_time:82462ms step_avg:43.40ms
step:1950/20000 train_loss:2.2707 train_time:84723ms step_avg:43.45ms
step:2000/20000 train_loss:2.2346 train_time:86867ms step_avg:43.43ms
step:2000/20000 val_loss:2.2368 val_bpb:1.3213 train_time:86892ms step_avg:43.45ms
step:2050/20000 train_loss:2.0689 train_time:89013ms step_avg:43.42ms
step:2100/20000 train_loss:2.3382 train_time:91276ms step_avg:43.46ms
step:2150/20000 train_loss:2.1161 train_time:93418ms step_avg:43.45ms
step:2200/20000 train_loss:2.2380 train_time:95565ms step_avg:43.44ms
step:2200/20000 val_loss:2.2251 val_bpb:1.3144 train_time:95590ms step_avg:43.45ms
step:2250/20000 train_loss:2.2362 train_time:97711ms step_avg:43.43ms
step:2300/20000 train_loss:2.2390 train_time:99973ms step_avg:43.47ms
step:2350/20000 train_loss:2.1494 train_time:102118ms step_avg:43.45ms
step:2400/20000 train_loss:2.1004 train_time:104264ms step_avg:43.44ms
step:2400/20000 val_loss:2.2158 val_bpb:1.3089 train_time:104288ms step_avg:43.45ms
step:2450/20000 train_loss:2.2078 train_time:106409ms step_avg:43.43ms
step:2500/20000 train_loss:2.2990 train_time:108679ms step_avg:43.47ms
step:2550/20000 train_loss:2.3510 train_time:110825ms step_avg:43.46ms
step:2600/20000 train_loss:2.1989 train_time:112969ms step_avg:43.45ms
step:2600/20000 val_loss:2.2097 val_bpb:1.3053 train_time:112994ms step_avg:43.46ms
step:2650/20000 train_loss:2.0953 train_time:115115ms step_avg:43.44ms
step:2700/20000 train_loss:2.2119 train_time:117382ms step_avg:43.47ms
step:2750/20000 train_loss:2.2833 train_time:119524ms step_avg:43.46ms
step:2800/20000 train_loss:2.2056 train_time:121673ms step_avg:43.45ms
step:2800/20000 val_loss:2.2011 val_bpb:1.3002 train_time:121697ms step_avg:43.46ms
step:2850/20000 train_loss:2.1613 train_time:123815ms step_avg:43.44ms
step:2900/20000 train_loss:2.2400 train_time:126078ms step_avg:43.48ms
step:2950/20000 train_loss:2.2531 train_time:128222ms step_avg:43.47ms
step:3000/20000 train_loss:2.1098 train_time:130368ms step_avg:43.46ms
step:3000/20000 val_loss:2.1953 val_bpb:1.2968 train_time:130392ms step_avg:43.46ms
step:3050/20000 train_loss:2.4246 train_time:132514ms step_avg:43.45ms
step:3100/20000 train_loss:2.1884 train_time:134780ms step_avg:43.48ms
step:3150/20000 train_loss:2.2749 train_time:136926ms step_avg:43.47ms
step:3200/20000 train_loss:2.1492 train_time:139071ms step_avg:43.46ms
step:3200/20000 val_loss:2.1881 val_bpb:1.2925 train_time:139096ms step_avg:43.47ms
step:3250/20000 train_loss:2.1286 train_time:141341ms step_avg:43.49ms
step:3300/20000 train_loss:2.1058 train_time:143485ms step_avg:43.48ms
step:3350/20000 train_loss:2.2214 train_time:145628ms step_avg:43.47ms
step:3400/20000 train_loss:2.2454 train_time:147773ms step_avg:43.46ms
step:3400/20000 val_loss:2.1854 val_bpb:1.2909 train_time:147798ms step_avg:43.47ms
step:3450/20000 train_loss:2.2601 train_time:150039ms step_avg:43.49ms
step:3500/20000 train_loss:2.1183 train_time:152184ms step_avg:43.48ms
step:3550/20000 train_loss:2.0846 train_time:154329ms step_avg:43.47ms
step:3600/20000 train_loss:2.2507 train_time:156472ms step_avg:43.46ms
step:3600/20000 val_loss:2.1784 val_bpb:1.2868 train_time:156496ms step_avg:43.47ms
step:3650/20000 train_loss:2.1383 train_time:158738ms step_avg:43.49ms
step:3700/20000 train_loss:2.2848 train_time:160882ms step_avg:43.48ms
step:3750/20000 train_loss:2.1982 train_time:163029ms step_avg:43.47ms
step:3800/20000 train_loss:2.1399 train_time:165176ms step_avg:43.47ms
step:3800/20000 val_loss:2.1767 val_bpb:1.2858 train_time:165200ms step_avg:43.47ms
step:3850/20000 train_loss:2.3361 train_time:167438ms step_avg:43.49ms
step:3900/20000 train_loss:2.2756 train_time:169582ms step_avg:43.48ms
step:3950/20000 train_loss:2.1261 train_time:171729ms step_avg:43.48ms
step:4000/20000 train_loss:2.1437 train_time:173878ms step_avg:43.47ms
step:4000/20000 val_loss:2.1718 val_bpb:1.2829 train_time:173903ms step_avg:43.48ms
step:4050/20000 train_loss:2.1718 train_time:176147ms step_avg:43.49ms
step:4100/20000 train_loss:2.1899 train_time:178291ms step_avg:43.49ms
step:4150/20000 train_loss:2.1285 train_time:180438ms step_avg:43.48ms
step:4200/20000 train_loss:2.0498 train_time:182707ms step_avg:43.50ms
step:4200/20000 val_loss:2.1666 val_bpb:1.2798 train_time:182731ms step_avg:43.51ms
step:4250/20000 train_loss:2.2487 train_time:184852ms step_avg:43.49ms
step:4300/20000 train_loss:2.1979 train_time:186996ms step_avg:43.49ms
step:4350/20000 train_loss:2.1314 train_time:189141ms step_avg:43.48ms
step:4400/20000 train_loss:2.1727 train_time:191402ms step_avg:43.50ms
step:4400/20000 val_loss:2.1625 val_bpb:1.2774 train_time:191427ms step_avg:43.51ms
step:4450/20000 train_loss:2.1882 train_time:193549ms step_avg:43.49ms
step:4500/20000 train_loss:2.0735 train_time:195696ms step_avg:43.49ms
step:4550/20000 train_loss:2.1347 train_time:197840ms step_avg:43.48ms
step:4600/20000 train_loss:2.1710 train_time:200091ms step_avg:43.50ms
step:4600/20000 val_loss:2.1597 val_bpb:1.2757 train_time:200114ms step_avg:43.50ms
step:4650/20000 train_loss:2.2563 train_time:202236ms step_avg:43.49ms
step:4700/20000 train_loss:2.2077 train_time:204381ms step_avg:43.49ms
step:4750/20000 train_loss:2.1328 train_time:206643ms step_avg:43.50ms
step:4800/20000 train_loss:2.1473 train_time:208788ms step_avg:43.50ms
step:4800/20000 val_loss:2.1579 val_bpb:1.2747 train_time:208812ms step_avg:43.50ms
step:4850/20000 train_loss:2.2067 train_time:210933ms step_avg:43.49ms
step:4900/20000 train_loss:2.1119 train_time:213078ms step_avg:43.49ms
step:4950/20000 train_loss:2.0031 train_time:215339ms step_avg:43.50ms
step:5000/20000 train_loss:2.1104 train_time:217483ms step_avg:43.50ms
step:5000/20000 val_loss:2.1532 val_bpb:1.2719 train_time:217508ms step_avg:43.50ms
step:5050/20000 train_loss:2.0232 train_time:219627ms step_avg:43.49ms
step:5100/20000 train_loss:2.1995 train_time:221774ms step_avg:43.49ms
step:5150/20000 train_loss:2.0709 train_time:224038ms step_avg:43.50ms
step:5200/20000 train_loss:2.0972 train_time:226182ms step_avg:43.50ms
step:5200/20000 val_loss:2.1501 val_bpb:1.2701 train_time:226207ms step_avg:43.50ms
step:5250/20000 train_loss:2.1395 train_time:228330ms step_avg:43.49ms
step:5300/20000 train_loss:2.0947 train_time:230476ms step_avg:43.49ms
step:5350/20000 train_loss:2.0819 train_time:232740ms step_avg:43.50ms
step:5400/20000 train_loss:2.2099 train_time:234884ms step_avg:43.50ms
step:5400/20000 val_loss:2.1475 val_bpb:1.2685 train_time:234909ms step_avg:43.50ms
step:5450/20000 train_loss:2.1314 train_time:237031ms step_avg:43.49ms
step:5500/20000 train_loss:2.2057 train_time:239295ms step_avg:43.51ms
step:5550/20000 train_loss:2.0856 train_time:241437ms step_avg:43.50ms
step:5600/20000 train_loss:2.1448 train_time:243583ms step_avg:43.50ms
step:5600/20000 val_loss:2.1455 val_bpb:1.2674 train_time:243608ms step_avg:43.50ms
step:5650/20000 train_loss:2.0312 train_time:245730ms step_avg:43.49ms
step:5700/20000 train_loss:2.1392 train_time:247996ms step_avg:43.51ms
step:5750/20000 train_loss:2.0206 train_time:250140ms step_avg:43.50ms
step:5800/20000 train_loss:2.2107 train_time:252283ms step_avg:43.50ms
step:5800/20000 val_loss:2.1439 val_bpb:1.2664 train_time:252308ms step_avg:43.50ms
step:5850/20000 train_loss:2.0973 train_time:254429ms step_avg:43.49ms
step:5900/20000 train_loss:2.1270 train_time:256697ms step_avg:43.51ms
step:5950/20000 train_loss:2.0899 train_time:258840ms step_avg:43.50ms
step:6000/20000 train_loss:2.2182 train_time:260985ms step_avg:43.50ms
step:6000/20000 val_loss:2.1445 val_bpb:1.2668 train_time:261009ms step_avg:43.50ms
step:6050/20000 train_loss:2.1230 train_time:263130ms step_avg:43.49ms
step:6100/20000 train_loss:2.1640 train_time:265401ms step_avg:43.51ms
step:6150/20000 train_loss:2.1960 train_time:267547ms step_avg:43.50ms
step:6200/20000 train_loss:2.1217 train_time:269692ms step_avg:43.50ms
step:6200/20000 val_loss:2.1416 val_bpb:1.2651 train_time:269717ms step_avg:43.50ms
step:6250/20000 train_loss:2.1106 train_time:271837ms step_avg:43.49ms
step:6300/20000 train_loss:2.1989 train_time:274105ms step_avg:43.51ms
step:6350/20000 train_loss:2.1738 train_time:276249ms step_avg:43.50ms
step:6400/20000 train_loss:2.1333 train_time:278396ms step_avg:43.50ms
step:6400/20000 val_loss:2.1377 val_bpb:1.2628 train_time:278421ms step_avg:43.50ms
step:6450/20000 train_loss:1.9696 train_time:280544ms step_avg:43.50ms
step:6500/20000 train_loss:2.1279 train_time:282815ms step_avg:43.51ms
step:6550/20000 train_loss:2.2768 train_time:284958ms step_avg:43.51ms
step:6600/20000 train_loss:2.1060 train_time:287102ms step_avg:43.50ms
step:6600/20000 val_loss:2.1354 val_bpb:1.2614 train_time:287126ms step_avg:43.50ms
step:6650/20000 train_loss:2.1036 train_time:289368ms step_avg:43.51ms
step:6700/20000 train_loss:2.1438 train_time:291511ms step_avg:43.51ms
step:6750/20000 train_loss:1.8938 train_time:293654ms step_avg:43.50ms
step:6800/20000 train_loss:2.1809 train_time:295799ms step_avg:43.50ms
step:6800/20000 val_loss:2.1342 val_bpb:1.2607 train_time:295824ms step_avg:43.50ms
step:6850/20000 train_loss:2.0978 train_time:298068ms step_avg:43.51ms
step:6900/20000 train_loss:2.1146 train_time:300210ms step_avg:43.51ms
step:6950/20000 train_loss:2.1328 train_time:302354ms step_avg:43.50ms
step:7000/20000 train_loss:2.1537 train_time:304499ms step_avg:43.50ms
step:7000/20000 val_loss:2.1326 val_bpb:1.2598 train_time:304523ms step_avg:43.50ms
step:7050/20000 train_loss:2.1382 train_time:306765ms step_avg:43.51ms
step:7100/20000 train_loss:2.1078 train_time:308911ms step_avg:43.51ms
step:7150/20000 train_loss:2.1952 train_time:311056ms step_avg:43.50ms
step:7200/20000 train_loss:2.1143 train_time:313204ms step_avg:43.50ms
step:7200/20000 val_loss:2.1299 val_bpb:1.2582 train_time:313228ms step_avg:43.50ms
step:7250/20000 train_loss:2.1009 train_time:315469ms step_avg:43.51ms
step:7300/20000 train_loss:2.1529 train_time:317612ms step_avg:43.51ms
step:7350/20000 train_loss:2.1532 train_time:319759ms step_avg:43.50ms
step:7400/20000 train_loss:2.1137 train_time:321901ms step_avg:43.50ms
step:7400/20000 val_loss:2.1282 val_bpb:1.2572 train_time:321927ms step_avg:43.50ms
step:7450/20000 train_loss:2.4067 train_time:324167ms step_avg:43.51ms
step:7500/20000 train_loss:2.0751 train_time:326311ms step_avg:43.51ms
step:7550/20000 train_loss:2.1258 train_time:328457ms step_avg:43.50ms
step:7600/20000 train_loss:2.1723 train_time:330730ms step_avg:43.52ms
step:7600/20000 val_loss:2.1289 val_bpb:1.2576 train_time:330754ms step_avg:43.52ms
step:7650/20000 train_loss:2.2193 train_time:332878ms step_avg:43.51ms
step:7700/20000 train_loss:2.1329 train_time:335023ms step_avg:43.51ms
step:7750/20000 train_loss:2.0562 train_time:337169ms step_avg:43.51ms
step:7800/20000 train_loss:2.1669 train_time:339436ms step_avg:43.52ms
step:7800/20000 val_loss:2.1252 val_bpb:1.2554 train_time:339460ms step_avg:43.52ms
step:7850/20000 train_loss:2.0994 train_time:341583ms step_avg:43.51ms
step:7900/20000 train_loss:2.1585 train_time:343729ms step_avg:43.51ms
step:7950/20000 train_loss:2.1319 train_time:345873ms step_avg:43.51ms
step:8000/20000 train_loss:2.2613 train_time:348141ms step_avg:43.52ms
step:8000/20000 val_loss:2.1232 val_bpb:1.2542 train_time:348165ms step_avg:43.52ms
step:8050/20000 train_loss:2.1775 train_time:350287ms step_avg:43.51ms
step:8100/20000 train_loss:1.9587 train_time:352431ms step_avg:43.51ms
step:8150/20000 train_loss:2.0401 train_time:354575ms step_avg:43.51ms
step:8200/20000 train_loss:2.1076 train_time:356845ms step_avg:43.52ms
step:8200/20000 val_loss:2.1228 val_bpb:1.2540 train_time:356869ms step_avg:43.52ms
step:8250/20000 train_loss:2.0951 train_time:358988ms step_avg:43.51ms
step:8300/20000 train_loss:2.2244 train_time:361133ms step_avg:43.51ms
step:8350/20000 train_loss:2.0681 train_time:363279ms step_avg:43.51ms
step:8400/20000 train_loss:2.1494 train_time:365552ms step_avg:43.52ms
step:8400/20000 val_loss:2.1201 val_bpb:1.2524 train_time:365577ms step_avg:43.52ms
step:8450/20000 train_loss:2.1278 train_time:367698ms step_avg:43.51ms
step:8500/20000 train_loss:2.0289 train_time:369845ms step_avg:43.51ms
step:8550/20000 train_loss:2.0465 train_time:372114ms step_avg:43.52ms
step:8600/20000 train_loss:2.0682 train_time:374259ms step_avg:43.52ms
step:8600/20000 val_loss:2.1206 val_bpb:1.2526 train_time:374282ms step_avg:43.52ms
step:8650/20000 train_loss:2.2717 train_time:376403ms step_avg:43.51ms
step:8700/20000 train_loss:2.1795 train_time:378549ms step_avg:43.51ms
step:8750/20000 train_loss:2.0492 train_time:380817ms step_avg:43.52ms
step:8800/20000 train_loss:2.1100 train_time:382964ms step_avg:43.52ms
step:8800/20000 val_loss:2.1192 val_bpb:1.2518 train_time:382989ms step_avg:43.52ms
step:8850/20000 train_loss:2.4323 train_time:385110ms step_avg:43.52ms
step:8900/20000 train_loss:2.1016 train_time:387258ms step_avg:43.51ms
step:8950/20000 train_loss:2.0290 train_time:389530ms step_avg:43.52ms
step:9000/20000 train_loss:2.1119 train_time:391675ms step_avg:43.52ms
step:9000/20000 val_loss:2.1204 val_bpb:1.2525 train_time:391698ms step_avg:43.52ms
step:9050/20000 train_loss:2.0826 train_time:393819ms step_avg:43.52ms
step:9100/20000 train_loss:2.0427 train_time:395963ms step_avg:43.51ms
step:9150/20000 train_loss:2.1201 train_time:398238ms step_avg:43.52ms
step:9200/20000 train_loss:2.1490 train_time:400385ms step_avg:43.52ms
step:9200/20000 val_loss:2.1170 val_bpb:1.2505 train_time:400409ms step_avg:43.52ms
step:9250/20000 train_loss:2.1221 train_time:402534ms step_avg:43.52ms
step:9300/20000 train_loss:2.4550 train_time:404680ms step_avg:43.51ms
step:9350/20000 train_loss:2.0384 train_time:406932ms step_avg:43.52ms
step:9400/20000 train_loss:2.0736 train_time:409077ms step_avg:43.52ms
step:9400/20000 val_loss:2.1139 val_bpb:1.2487 train_time:409102ms step_avg:43.52ms
step:9450/20000 train_loss:2.1096 train_time:411223ms step_avg:43.52ms
step:9500/20000 train_loss:2.1070 train_time:413493ms step_avg:43.53ms
step:9550/20000 train_loss:2.0249 train_time:415641ms step_avg:43.52ms
step:9600/20000 train_loss:2.1141 train_time:417785ms step_avg:43.52ms
step:9600/20000 val_loss:2.1138 val_bpb:1.2486 train_time:417809ms step_avg:43.52ms
step:9650/20000 train_loss:2.0183 train_time:419932ms step_avg:43.52ms
step:9700/20000 train_loss:2.1482 train_time:422212ms step_avg:43.53ms
step:9750/20000 train_loss:2.1811 train_time:424359ms step_avg:43.52ms
step:9800/20000 train_loss:2.1011 train_time:426503ms step_avg:43.52ms
step:9800/20000 val_loss:2.1143 val_bpb:1.2489 train_time:426528ms step_avg:43.52ms
step:9850/20000 train_loss:2.1134 train_time:428771ms step_avg:43.53ms
step:9900/20000 train_loss:2.0497 train_time:430915ms step_avg:43.53ms
step:9950/20000 train_loss:2.1989 train_time:433061ms step_avg:43.52ms
step:10000/20000 train_loss:2.1982 train_time:435207ms step_avg:43.52ms
step:10000/20000 val_loss:2.1122 val_bpb:1.2477 train_time:435232ms step_avg:43.52ms
step:10050/20000 train_loss:2.0940 train_time:437485ms step_avg:43.53ms
step:10100/20000 train_loss:2.1277 train_time:439630ms step_avg:43.53ms
step:10150/20000 train_loss:2.0896 train_time:441773ms step_avg:43.52ms
step:10200/20000 train_loss:2.0642 train_time:443918ms step_avg:43.52ms
step:10200/20000 val_loss:2.1112 val_bpb:1.2471 train_time:443941ms step_avg:43.52ms
step:10250/20000 train_loss:2.0627 train_time:446192ms step_avg:43.53ms
step:10300/20000 train_loss:2.2191 train_time:448339ms step_avg:43.53ms
step:10350/20000 train_loss:2.1354 train_time:450485ms step_avg:43.53ms
step:10400/20000 train_loss:2.0705 train_time:452630ms step_avg:43.52ms
step:10400/20000 val_loss:2.1098 val_bpb:1.2463 train_time:452654ms step_avg:43.52ms
step:10450/20000 train_loss:2.0663 train_time:454900ms step_avg:43.53ms
step:10500/20000 train_loss:2.1334 train_time:457046ms step_avg:43.53ms
step:10550/20000 train_loss:2.1931 train_time:459192ms step_avg:43.53ms
step:10600/20000 train_loss:2.0978 train_time:461337ms step_avg:43.52ms
step:10600/20000 val_loss:2.1081 val_bpb:1.2453 train_time:461361ms step_avg:43.52ms
step:10650/20000 train_loss:2.0676 train_time:463610ms step_avg:43.53ms
step:10700/20000 train_loss:2.2333 train_time:465754ms step_avg:43.53ms
step:10750/20000 train_loss:2.1661 train_time:467899ms step_avg:43.53ms
step:10800/20000 train_loss:2.0966 train_time:470044ms step_avg:43.52ms
step:10800/20000 val_loss:2.1081 val_bpb:1.2453 train_time:470069ms step_avg:43.52ms
step:10850/20000 train_loss:2.0708 train_time:472323ms step_avg:43.53ms
step:10900/20000 train_loss:2.1666 train_time:474468ms step_avg:43.53ms
step:10950/20000 train_loss:2.1079 train_time:476615ms step_avg:43.53ms
step:11000/20000 train_loss:2.0774 train_time:478893ms step_avg:43.54ms
step:11000/20000 val_loss:2.1069 val_bpb:1.2446 train_time:478917ms step_avg:43.54ms
step:11050/20000 train_loss:2.1288 train_time:481038ms step_avg:43.53ms
step:11100/20000 train_loss:2.0801 train_time:483185ms step_avg:43.53ms
step:11150/20000 train_loss:1.8743 train_time:485331ms step_avg:43.53ms
step:11200/20000 train_loss:2.1471 train_time:487603ms step_avg:43.54ms
step:11200/20000 val_loss:2.1080 val_bpb:1.2452 train_time:487627ms step_avg:43.54ms
step:11250/20000 train_loss:2.2046 train_time:489748ms step_avg:43.53ms
step:11300/20000 train_loss:2.0957 train_time:491892ms step_avg:43.53ms
step:11350/20000 train_loss:2.0963 train_time:494038ms step_avg:43.53ms
step:11400/20000 train_loss:2.3223 train_time:496318ms step_avg:43.54ms
step:11400/20000 val_loss:2.1051 val_bpb:1.2435 train_time:496342ms step_avg:43.54ms
step:11450/20000 train_loss:2.0724 train_time:498464ms step_avg:43.53ms
step:11500/20000 train_loss:2.1197 train_time:500609ms step_avg:43.53ms
step:11550/20000 train_loss:2.0975 train_time:502754ms step_avg:43.53ms
step:11600/20000 train_loss:2.1091 train_time:505029ms step_avg:43.54ms
step:11600/20000 val_loss:2.1054 val_bpb:1.2437 train_time:505053ms step_avg:43.54ms
step:11650/20000 train_loss:2.1235 train_time:507175ms step_avg:43.53ms
step:11700/20000 train_loss:2.0795 train_time:509324ms step_avg:43.53ms
step:11750/20000 train_loss:2.0662 train_time:511469ms step_avg:43.53ms
step:11800/20000 train_loss:2.0765 train_time:513742ms step_avg:43.54ms
step:11800/20000 val_loss:2.1048 val_bpb:1.2433 train_time:513766ms step_avg:43.54ms
step:11850/20000 train_loss:2.1202 train_time:515888ms step_avg:43.53ms
step:11900/20000 train_loss:2.1029 train_time:518033ms step_avg:43.53ms
step:11950/20000 train_loss:2.1512 train_time:520308ms step_avg:43.54ms
step:12000/20000 train_loss:2.1814 train_time:522453ms step_avg:43.54ms
step:12000/20000 val_loss:2.1029 val_bpb:1.2422 train_time:522477ms step_avg:43.54ms
step:12050/20000 train_loss:2.1085 train_time:524601ms step_avg:43.54ms
step:12100/20000 train_loss:2.0347 train_time:526747ms step_avg:43.53ms
step:12150/20000 train_loss:2.0601 train_time:529018ms step_avg:43.54ms
step:12200/20000 train_loss:2.0387 train_time:531162ms step_avg:43.54ms
step:12200/20000 val_loss:2.1021 val_bpb:1.2418 train_time:531186ms step_avg:43.54ms
step:12250/20000 train_loss:2.0381 train_time:533312ms step_avg:43.54ms
step:12300/20000 train_loss:2.1302 train_time:535458ms step_avg:43.53ms
step:12350/20000 train_loss:2.1272 train_time:537727ms step_avg:43.54ms
step:12400/20000 train_loss:2.1828 train_time:539873ms step_avg:43.54ms
step:12400/20000 val_loss:2.1001 val_bpb:1.2406 train_time:539897ms step_avg:43.54ms
step:12450/20000 train_loss:2.1003 train_time:542019ms step_avg:43.54ms
step:12500/20000 train_loss:2.0696 train_time:544164ms step_avg:43.53ms
step:12550/20000 train_loss:2.1302 train_time:546436ms step_avg:43.54ms
step:12600/20000 train_loss:2.0527 train_time:548582ms step_avg:43.54ms
step:12600/20000 val_loss:2.0998 val_bpb:1.2404 train_time:548606ms step_avg:43.54ms
step:12650/20000 train_loss:2.1438 train_time:550728ms step_avg:43.54ms
step:12700/20000 train_loss:2.2689 train_time:552877ms step_avg:43.53ms
step:12750/20000 train_loss:2.1438 train_time:555147ms step_avg:43.54ms
step:12800/20000 train_loss:2.0105 train_time:557293ms step_avg:43.54ms
step:12800/20000 val_loss:2.0930 val_bpb:1.2364 train_time:557317ms step_avg:43.54ms
step:12850/20000 train_loss:2.0413 train_time:559440ms step_avg:43.54ms
step:12900/20000 train_loss:2.0630 train_time:561586ms step_avg:43.53ms
step:12950/20000 train_loss:2.1627 train_time:563863ms step_avg:43.54ms
step:13000/20000 train_loss:1.9579 train_time:566009ms step_avg:43.54ms
step:13000/20000 val_loss:2.0859 val_bpb:1.2322 train_time:566032ms step_avg:43.54ms
step:13050/20000 train_loss:2.0206 train_time:568155ms step_avg:43.54ms
step:13100/20000 train_loss:1.9294 train_time:570432ms step_avg:43.54ms
step:13150/20000 train_loss:2.0689 train_time:572576ms step_avg:43.54ms
step:13200/20000 train_loss:2.0074 train_time:574722ms step_avg:43.54ms
step:13200/20000 val_loss:2.0790 val_bpb:1.2281 train_time:574747ms step_avg:43.54ms
step:13250/20000 train_loss:2.0596 train_time:576871ms step_avg:43.54ms
step:13300/20000 train_loss:1.9474 train_time:579143ms step_avg:43.54ms
step:13350/20000 train_loss:2.0459 train_time:581289ms step_avg:43.54ms
step:13400/20000 train_loss:2.0441 train_time:583434ms step_avg:43.54ms
step:13400/20000 val_loss:2.0718 val_bpb:1.2239 train_time:583458ms step_avg:43.54ms
step:13450/20000 train_loss:2.1638 train_time:585582ms step_avg:43.54ms
step:13500/20000 train_loss:2.1216 train_time:587857ms step_avg:43.54ms
step:13550/20000 train_loss:2.1855 train_time:590003ms step_avg:43.54ms
step:13600/20000 train_loss:2.0234 train_time:592147ms step_avg:43.54ms
step:13600/20000 val_loss:2.0649 val_bpb:1.2197 train_time:592172ms step_avg:43.54ms
step:13650/20000 train_loss:2.0316 train_time:594295ms step_avg:43.54ms
step:13700/20000 train_loss:2.0323 train_time:596577ms step_avg:43.55ms
step:13750/20000 train_loss:1.9910 train_time:598726ms step_avg:43.54ms
step:13780/20000 val_loss:2.0606 val_bpb:1.2172 train_time:600038ms step_avg:43.54ms
stopping_early: wallclock_cap train_time:600038ms step:13780/20000
peak memory allocated: 10184 MiB reserved: 10200 MiB
Serialized model: 67224983 bytes
Code size: 47642 bytes
Total submission size: 67272625 bytes
Serialized model int8+zlib: 15815847 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)
Total submission size int8+zlib: 15863489 bytes
final_int8_zlib_roundtrip val_loss:2.0727 val_bpb:1.2244 eval_time:1401ms
final_int8_zlib_roundtrip_exact val_loss:2.07269931 val_bpb:1.22436570

As you can see, the divergence here in step time (and thus step count) is incredibly substantial (13,780 vs 9,195).

Very confused about what could possibly be going on here. Would be curious if anyone is experiencing similar problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions