On-Policy Distillation (OPD) Trainer hangs indefinitely after "Starting training loop"

 ### Description
I am attempting to run On-Policy Distillation (OPD) using the `reverse-text` example. I am distilling a 0.6B SFT model (`Qwen3-0.6B-Reverse-Text-SFT`) using a 0.6B RL teacher model.

The initialization process completes successfully (orchestrator, teacher, and trainer processes start), the model weights load, and the optimizer initializes. However, once the log `Starting training loop` appears, the process hangs completely. No rollouts are collected, and no training steps are performed.

### System Configuration
* **Setup:** 4 GPUs (IDs 0, 1, 2, 3) (4xRTX 5090)
    * GPU 0: Inference
    * GPU 1: Trainer
    * GPU 2, 3: Teacher Inference
* **Python Version:** 3.12
* **Task:** `reverse-text`

### Reproduction Steps

**1. Configuration (`examples/reverse_text/opd.toml`):**
```toml
max_steps = 20
seq_len = 2048

teacher_gpu_ids = [2, 3]

[teacher_inference.model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-RL"

[teacher_inference.server]
port = 8001

[ckpt]

[model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT"

[trainer.loss]
teacher_tau = 0.5

[orchestrator]
batch_size = 128
rollouts_per_example = 16

[orchestrator.sampling]
max_tokens = 128

[[orchestrator.env]]
id = "reverse-text"

[trainer.optim]
lr = 3e-6

[inference]
```

**2. Command used:**
```bash
uv run rl @ examples/reverse_text/opd.toml \
  --model.name PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT \
  --wandb.project prime-rl \
  --wandb.name reverse-text-opd-1
```

**Logs**
```log
10:19:14    INFO Starting RL run
...
10:19:14    INFO Starting inference process on GPU(s) 0
10:19:14    INFO Starting teacher inference process on GPU(s) 2 3
10:19:14    INFO Starting orchestrator process
10:19:14    INFO Starting trainer process on GPU(s) 1
...
[default0]:10:19:33    INFO Starting RL trainer in World(world_size=1, rank=0, local_rank=0, local_world_size=1, num_nodes=1)
...
[default0]:10:19:35    INFO Building 1-D device mesh with ['dp_shard'], [1]
...
[default0]:10:19:41    INFO Initializing tokenizer (name='Qwen/Qwen3-0.6B' trust_remote_code=False chat_template=None)
[default0]:10:19:41    INFO Initializing optimizer (lr=3e-06 weight_decay=0.01 max_norm=1.0 type='adamw' betas1=0.9 betas2=0.999)
[default0]:10:19:41    INFO Using `token` importance ratio ...
[default0]:10:19:41    INFO Using `constant` scheduler (type='constant')
[default0]:10:19:41    INFO Initializing weight broadcast (type='filesystem' save_sharded=True save_format='safetensors')
[default0]:10:19:41    INFO Starting from step 0 (total_tokens=0, total_samples=0)
[default0]:10:19:41    INFO Initializing data loader (fake=None)
[default0]:10:19:41    INFO Starting training loop (max_steps=20)
# (PROCESS HANGS HERE INDEFINITELY)
```

Can you please help me out here ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-Policy Distillation (OPD) Trainer hangs indefinitely after "Starting training loop" #1570

Description

System Configuration

Reproduction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

On-Policy Distillation (OPD) Trainer hangs indefinitely after "Starting training loop" #1570

Description

Description

System Configuration

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions