Skip to content

[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658

Closed
vanking20000918 wants to merge 4 commits intojingyaogong:masterfrom
vanking20000918:qingguofan
Closed

[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658
vanking20000918 wants to merge 4 commits intojingyaogong:masterfrom
vanking20000918:qingguofan

Conversation

@vanking20000918
Copy link

@vanking20000918 vanking20000918 commented Jan 30, 2026

The code in the spo part maybe a little incomplete, that is, it misses the part of Prioritized Prompt Sampling in the original paper.
image

ps: The modification is refered from my presonal understanding of papaer and Germini. It could be wrong and takes it with caution.

@vanking20000918
Copy link
Author

The main results of the test are that the policy_loss and advantage_mean differ from the source code:

  1. policy_loss: It can be seen that the variance of the modified policy loss has significantly decreased, indicating a more certain direction for policy optimization, with trajectories leaning towards the correct direction. Personally, I believe this is due to the adoption of priority sampling from the original literature, which clarifies the direction of trajectory optimization and avoids the impact of intra-group degeneration.
  2. adventage_mean: After modification, there is a noticeable trend of change opposite to the original, but the variance of the change is almost the same, which cannot be explained..
image

Copy link
Author

@vanking20000918 vanking20000918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is some difference from the original test, such as policy loss... Maybe it's due to the difference of SFT model.

@vanking20000918 vanking20000918 marked this pull request as ready for review January 30, 2026 04:24
…Optimization)

I mainly add two points  into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling
…orithm

This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability.
Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.
@vanking20000918 vanking20000918 changed the title [mod] fix spo algorithm in RLAIF part [mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part. Feb 2, 2026
@vanking20000918
Copy link
Author

The DAPO algorithm in code is add clip-higher and dynamic sampling compared to GRPO algorithm, which refered to this paper.
image

In the experiment, I found these will stabilize the training process and policy loss.
image

@vanking20000918
Copy link
Author

This main addition in CISPO algorithm refered to GRPO algorithm : By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability.
Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.
image

The refered sourced website

ps: CISPO algorithm in minimind project works very bad in experiment, so this script (train_cispo.py) only be refered.

@vanking20000918 vanking20000918 deleted the qingguofan branch March 1, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant