[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658
[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658vanking20000918 wants to merge 4 commits intojingyaogong:masterfrom
Conversation
…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling
…orithm This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability. Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.
|
The DAPO algorithm in code is add clip-higher and dynamic sampling compared to GRPO algorithm, which refered to this paper. In the experiment, I found these will stabilize the training process and policy loss. |
|
This main addition in CISPO algorithm refered to GRPO algorithm : By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability. The refered sourced website ps: CISPO algorithm in minimind project works very bad in experiment, so this script (train_cispo.py) only be refered. |




The code in the spo part maybe a little incomplete, that is, it misses the part of Prioritized Prompt Sampling in the original paper.

ps: The modification is refered from my presonal understanding of papaer and Germini. It could be wrong and takes it with caution.