-
Notifications
You must be signed in to change notification settings - Fork 239
Open
Description
Description
I am attempting to run On-Policy Distillation (OPD) using the reverse-text example. I am distilling a 0.6B SFT model (Qwen3-0.6B-Reverse-Text-SFT) using a 0.6B RL teacher model.
The initialization process completes successfully (orchestrator, teacher, and trainer processes start), the model weights load, and the optimizer initializes. However, once the log Starting training loop appears, the process hangs completely. No rollouts are collected, and no training steps are performed.
System Configuration
- Setup: 4 GPUs (IDs 0, 1, 2, 3) (4xRTX 5090)
- GPU 0: Inference
- GPU 1: Trainer
- GPU 2, 3: Teacher Inference
- Python Version: 3.12
- Task:
reverse-text
Reproduction Steps
1. Configuration (examples/reverse_text/opd.toml):
max_steps = 20
seq_len = 2048
teacher_gpu_ids = [2, 3]
[teacher_inference.model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-RL"
[teacher_inference.server]
port = 8001
[ckpt]
[model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT"
[trainer.loss]
teacher_tau = 0.5
[orchestrator]
batch_size = 128
rollouts_per_example = 16
[orchestrator.sampling]
max_tokens = 128
[[orchestrator.env]]
id = "reverse-text"
[trainer.optim]
lr = 3e-6
[inference]2. Command used:
uv run rl @ examples/reverse_text/opd.toml \
--model.name PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT \
--wandb.project prime-rl \
--wandb.name reverse-text-opd-1Logs
10:19:14 INFO Starting RL run
...
10:19:14 INFO Starting inference process on GPU(s) 0
10:19:14 INFO Starting teacher inference process on GPU(s) 2 3
10:19:14 INFO Starting orchestrator process
10:19:14 INFO Starting trainer process on GPU(s) 1
...
[default0]:10:19:33 INFO Starting RL trainer in World(world_size=1, rank=0, local_rank=0, local_world_size=1, num_nodes=1)
...
[default0]:10:19:35 INFO Building 1-D device mesh with ['dp_shard'], [1]
...
[default0]:10:19:41 INFO Initializing tokenizer (name='Qwen/Qwen3-0.6B' trust_remote_code=False chat_template=None)
[default0]:10:19:41 INFO Initializing optimizer (lr=3e-06 weight_decay=0.01 max_norm=1.0 type='adamw' betas1=0.9 betas2=0.999)
[default0]:10:19:41 INFO Using `token` importance ratio ...
[default0]:10:19:41 INFO Using `constant` scheduler (type='constant')
[default0]:10:19:41 INFO Initializing weight broadcast (type='filesystem' save_sharded=True save_format='safetensors')
[default0]:10:19:41 INFO Starting from step 0 (total_tokens=0, total_samples=0)
[default0]:10:19:41 INFO Initializing data loader (fake=None)
[default0]:10:19:41 INFO Starting training loop (max_steps=20)
# (PROCESS HANGS HERE INDEFINITELY)
Can you please help me out here ?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels