Skip to content

On-Policy Distillation (OPD) Trainer hangs indefinitely after "Starting training loop" #1570

@harisarang

Description

@harisarang

Description

I am attempting to run On-Policy Distillation (OPD) using the reverse-text example. I am distilling a 0.6B SFT model (Qwen3-0.6B-Reverse-Text-SFT) using a 0.6B RL teacher model.

The initialization process completes successfully (orchestrator, teacher, and trainer processes start), the model weights load, and the optimizer initializes. However, once the log Starting training loop appears, the process hangs completely. No rollouts are collected, and no training steps are performed.

System Configuration

  • Setup: 4 GPUs (IDs 0, 1, 2, 3) (4xRTX 5090)
    • GPU 0: Inference
    • GPU 1: Trainer
    • GPU 2, 3: Teacher Inference
  • Python Version: 3.12
  • Task: reverse-text

Reproduction Steps

1. Configuration (examples/reverse_text/opd.toml):

max_steps = 20
seq_len = 2048

teacher_gpu_ids = [2, 3]

[teacher_inference.model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-RL"

[teacher_inference.server]
port = 8001

[ckpt]

[model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT"

[trainer.loss]
teacher_tau = 0.5

[orchestrator]
batch_size = 128
rollouts_per_example = 16

[orchestrator.sampling]
max_tokens = 128

[[orchestrator.env]]
id = "reverse-text"

[trainer.optim]
lr = 3e-6

[inference]

2. Command used:

uv run rl @ examples/reverse_text/opd.toml \
  --model.name PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT \
  --wandb.project prime-rl \
  --wandb.name reverse-text-opd-1

Logs

10:19:14    INFO Starting RL run
...
10:19:14    INFO Starting inference process on GPU(s) 0
10:19:14    INFO Starting teacher inference process on GPU(s) 2 3
10:19:14    INFO Starting orchestrator process
10:19:14    INFO Starting trainer process on GPU(s) 1
...
[default0]:10:19:33    INFO Starting RL trainer in World(world_size=1, rank=0, local_rank=0, local_world_size=1, num_nodes=1)
...
[default0]:10:19:35    INFO Building 1-D device mesh with ['dp_shard'], [1]
...
[default0]:10:19:41    INFO Initializing tokenizer (name='Qwen/Qwen3-0.6B' trust_remote_code=False chat_template=None)
[default0]:10:19:41    INFO Initializing optimizer (lr=3e-06 weight_decay=0.01 max_norm=1.0 type='adamw' betas1=0.9 betas2=0.999)
[default0]:10:19:41    INFO Using `token` importance ratio ...
[default0]:10:19:41    INFO Using `constant` scheduler (type='constant')
[default0]:10:19:41    INFO Initializing weight broadcast (type='filesystem' save_sharded=True save_format='safetensors')
[default0]:10:19:41    INFO Starting from step 0 (total_tokens=0, total_samples=0)
[default0]:10:19:41    INFO Initializing data loader (fake=None)
[default0]:10:19:41    INFO Starting training loop (max_steps=20)
# (PROCESS HANGS HERE INDEFINITELY)

Can you please help me out here ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions