Skip to content

Releases: PrimeIntellect-ai/prime-rl

v0.4.0 release

06 Feb 22:29
f870f3c

Choose a tag to compare

1. Bring Your Own Algorithms

Researchers can now plug in custom loss functions and advantage functions without modifying the core training code. Define your own RL objectives and advantage estimators, configure them via TOML, and experiment freely.

  • Custom Loss: provide a per-sequence loss function via LossInputs / LossOutputs dataclasses
  • Custom Advantage: provide a per-problem advantage function via AdvantageInputs / AdvantageOutputs dataclasses
  • Configure everything in your TOML config with type = "custom", import_path and kwargs
# Custom loss
[loss]
type = "custom"
import_path = "my_module.ppo_clip_loss"
kwargs = { clip_eps = 0.2 }

# Custom advantage
[advantage]
type = "custom"
import_path = "my_module.normalized_advantage"
kwargs = { eps = 1e-8 }

See docs/bring-your-own-algorithms.md for full documentation.

#1715 — Bring your own algorithms

2. Multimodal RL Training

Added experimental support for multimodal reinforcement learning training, enabling RL fine-tuning of vision-language models (VLMs). This opens up new possibilities for training models that can reason over both text and images using reinforcement learning.

Key capabilities:

  • Train VLMs with the same GRPO/PPO algorithms used for text-only models
  • Multi-turn conversation support for multi-modal interactions, allowing complex dialogue flows with interleaved images and text
  • Compatible with existing reward functions and verifiers

#1680 — Add multimodal training (experimental)
#1703 — Add multi-turn support for multi-modal RL

3. Performance & Parallelism

Expert Parallelism (EP)

Added support for Expert Parallelism, a distributed training strategy for Mixture of Experts (MoE) models.

#1595 — Expert Parallelism support
#1614 — Add CP and EP to benchmarks

Flash Attention 4

Added FA4 support for fast attention on Blackwell.

#1726 — Flash Attention 4

FA3 Ring-Attention Kernel

Previously our ring attention algorithm was still using the Flash Attention 2 kernel. We now allow using FA3 instead for significant speedup on long context training.

#1727 — Add FA3 ring-attention kernel wrapper and benchmark coverage

Optimizer State CPU Offload

Offload optimizer states (e.g. Adam first and second moments) to CPU memory. Particularly useful to reduce memory usage when doing RL experiments at smaller scale, allowing large MoE models to fit on a couple of training nodes. The performance reduction is negligible in RL because large batch sizes mean many gradient accumulation steps, and the cost of offloading weights to CPU is amortized.

#1694 — Add optimizer state CPU offload

3-Stage Chunked LM Head Loss

Improved memory efficiency for the language model head loss computation via a 3-stage chunked approach. Instead of materializing the full logit tensor, the loss is computed in chunks, reducing peak memory usage. This is especially beneficial for large-vocabulary models where the logit tensor can be a major memory bottleneck during the backward pass.

#1649 — 3-stage logic for chunked lm head loss

4. Other Improvements

  • Elastic Inference Pool: New elastic inference pool with DNS-based service discovery for dynamic scaling of inference servers at runtime. Add or remove servers without restarting the training loop, with automatic health checking and failover. #1617, #1704
  • Temperature Scheduler: Control sampling temperature throughout training with various scheduling strategies, enabling curriculum-style exploration. #1624
  • JSON Structured Logging: JSON structured logging for easier log aggregation and analysis in production. #1681
  • Gemma3 Support: Added native support for Gemma3 models. #1648
  • Worker Rate Limiting: Rate limiting for worker job submissions to control dispatch pace. #1711
  • K8s Health Probes: Health probes for inference and trainer, plus parallel pod management for faster scaling. #1719, #1718
  • Multi-run Checkpointing: Checkpoint support for multiple concurrent training runs. #1593, #1632
  • RunsManager Refactor: Renamed Runs → RunsManager with hook cleanup, and ability to evict runs with bad batches. #1619, #1634

Breaking Changes

  • vLLM upgraded to 0.14: Upgraded vLLM dependency to version 0.14. This may require updating your environment. Token chat preprocessing has been aligned with vLLM 0.14 behavior. #1625, #1637

  • Liger kernel model deprecated: The Liger kernel model implementation has been deprecated. #1691


Bug Fixes

#1717 — Fix race condition
#1725 — Fix int64 JSON serialization in Chinese character metrics
#1720 — Handle empty completion_temperatures in prepare_sample
#1712 — Use stable checkpoints for orchestrator resume
#1702 — Fix eval watcher only picks up checkpoints in increasing order
#1693 — Fix NCCL update
#1690 — Don't create config dir on trainer during config validation
#1686 — Make NCCL broadcast compatible with DP
#1683 — Fix bug where hosted RL rollouts were missing final message
#1670 — Zombie guard on checkpoint
#1678 — Only master clean weight
#1665 — Fix support for NCCL mode when resuming from checkpoint
#1650 — Fix KL mismatch by resetting prefix cache
#1644 — Fix weight update when enforce_eager=True
#1642 — Use discovery in eval
#1636 — Fix CPU offloading
#1630 — Make search for line more robust
#1612 — Fix timeout overcounting
#1609 — Auto-restart env workers on unexpected death
#1596 — Fix trainer crash when all rollouts in a batch fail
#1613 — Use step change instead of batch size to demarcate when to update

Misc

#1722 — Add AMD Instinct MI300X/MI325X peak FLOPS for MFU calculation
#1724 — Strip @Version suffix from env IDs before loading as Python modules
#1700 — Track Chinese characters
#1677 — Wandb async RL inflight
#1671 — Cancel all rollout eval
#1640 — Add mismatch-KL stability checks for nightly math runs
#1635 — Weights reload configuration
#1638 — Add INFO log when orchestrator resumes after checkpoint wait
#1631 — Ensure eval results upload before existing subprocess
#1629 — Assert when only trainer or orchestrator wandb is configured
#1622 — Add retry with exponential backoff for empty training batches
#1601 — Add health endpoint for worker nodes in multi-node training
#1604 — Check for current step based on progress to know what is valid for this step
[#1543](https://githu...

Read more

v0.3.0 release

16 Jan 02:36
32adbba

Choose a tag to compare

Highlights

1. Fused LM head / chunking (logits + loss)

We introduced a fused lm head with selective logprobs, significantly decreasing the peak vram required for the RL loss function. This is now enabled by default and should greatly reduce the vram requirements for doing rl training.

Example on Qwen/Qwen3-0.6B at 16384 sequence length where we reduced the peak vram from 44.2GiB -> 3.3 GiB:
With previous implementation

                                                            Benchmark                                                            
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃    Step ┃             MFU              ┃           Throughput            ┃            Step Time            ┃   Peak Memory    ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│       1 │            7.85%             │             12.12K              │             21.63s              │     44.2 GiB     │
│       2 │            7.89%             │             12.19K              │             21.39s              │     44.2 GiB     │
│       3 │            7.83%             │             12.10K              │             21.99s              │     44.2 GiB     │
│         │                              │                                 │                                 │                  │
│ Overall │ 7.86% ± 0.03% [7.83%, 7.89%] │ 12.13K ± 46.45 [12.10K, 12.19K] │ 21.67s ± 0.30s [21.39s, 21.99s] │ 44.2 GiB (93.1%) │
└─────────┴──────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┴──────────────────┘

With fused chunked lm head

                                                           Benchmark                                                           
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃    Step ┃             MFU              ┃           Throughput            ┃            Step Time            ┃    Peak Memory   ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│       1 │            8.07%             │             12.47K              │             21.02s              │      3.3 GiB     │
│       2 │            8.10%             │             12.51K              │             20.90s              │      3.3 GiB     │
│       3 │            8.14%             │             12.56K              │             20.67s              │      3.3 GiB     │
│         │                              │                                 │                                 │                  │
│ Overall │ 8.10% ± 0.03% [8.07%, 8.14%] │ 12.51K ± 48.19 [12.47K, 12.56K] │ 20.86s ± 0.18s [20.67s, 21.02s] │   3.3 GiB (6.9%) │
└─────────┴──────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┴──────────────────┘

#1525 — Fused LM Head implementation.
#1544 — Default fused_lm_head_chunk_size=2048 for RL.
#1545 — Enable loss chunking for non-custom HF models.

2. On Policy distillation

Added on policy distillation : the student learns from a teacher on the student’s own rollouts, so it stays on-policy while still getting dense, step-by-step guidance. Compared to normal (off-policy) distillation, this reduces the “teacher-only states” mismatch and helps the model learn to recover from its own mistakes instead of only imitating perfect trajectories

Quickstart docs at https://github.com/PrimeIntellect-ai/prime-rl/blob/main/docs/on_policy_distillation.md

#1458 - Add support for On Policy distillation

3. Advanced multi LoRA support

LoRA are now first class citizen in prime-rl. This release, we add some preliminary features to support training multiple separate lora from different runs with the same trainer and inference deployment. We also now support training loras for MoE experts.

#1571 — Update LoRA default alpha to 32.
#1567 — Change LoRA alpha default to 32.
#1526 — MoE LoRA support.
#1520 — Retry load_lora_adapter (NFS delays).

4. New model support

We now natively support AFMoE!

#1515 — AFMoE support

5. Trainer observability / metrics

Prime-rl RL trainer can now optionally expose metrics through a prometheus metrics server

#1547 — Prometheus metrics server for trainer.

5. Refactor logging for environment

We can now redirect the log of a given environment to a specific logging file, we also intercept verifier logger into prime-rl format

#1594
#1561


Breaking changes

  • Config rename: ckpt.keep → ckpt.keep_last (and new ckpt.keep_interval). Update configs that still set ckpt.keep. (2025-12-31)

  • Behavior change / defaults: MultiLoRAMoE / QwenMoE now enables training expert LoRAs by default via target_modules changes.

  • Behavior change (RL defaults): RL training auto-sets model.fused_lm_head_chunk_size=2048 when unset (except impl='liger_kernel'). This can change memory/throughput characteristics vs v0.2. (2026-01-05)

  • Default change: model.lora.alpha default changed 16.0 → 32.0 (impacts effective LoRA scaling if you relied on the old default). (2026-01-10)


Bug fixes

#1568 — Unique rollout request IDs to avoid collisions.
#1546 — Detect dead worker process in collect_responses.
#1563 — Fix orchestrator null-batch handling.
#1537 — Fix checkpoint cleanup on resume + cancelled rollout metric.
#1531 — Fix NCCL handshake.
#1529 — Fix W&B integration.
#1520 — Retry load_lora_adapter for NFS delays.

misc

#1551 — TrainingSample reward: adds reward to TrainingSample for logging/consumption in training pipelines.
#1521 — Checkpoint retention policy: adds keep_interval to keep periodic checkpoints in addition to “last N”.
#1536 — Blackwell kernels: enables grouped_mm on Blackwell GPUs.
#1557 — Cumsum dtype: switches multilinear cumsum dtype to int32 (avoids wider dtype overhead).
#1571 — Update LoRA layer default alpha from 16.0 to 32.0.
#1567 — Change LoRA alpha default to 32.
#1538 — Add step param to Monitor.log() interface.
#1518 — Refactor online eval.
#1533 — README updates.
#1550 — Remove PR template.
#1522 — Docs/changelog entry for ckpt.keep_last + ckpt.keep_interval.
#1506 — Inference readiness handshake (later reverted).
#1528 — Revert inference readiness handshake from #1506.
#1516 — Remove step usage in W&B monitor (later reverted).
#1530 — Revert W&B monitor “step” removal from #1516.
#1580 — add fa3 dependency

v0.2

31 Dec 08:16
017b3fe

Choose a tag to compare

Second major release of prime-rl.

This release include all the major redesign of the library that was used to train Intellect-3

Prime-rl is entering a more stable phase where we validated most of our design at scale and believe are maintainable in the long run. We also adopted nightly tests that run a diverse set training run for many hours including single turn, multi turn and agentic workflow. This allow us to catch any regression in performance or convergence.

prime-rl will now adopt a regular release schedule.

v0.1

11 Jul 19:48
8c44bb8

Choose a tag to compare

Pre v1 refacto release. First and last release before v1 betas