[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part. by vanking20000918 · Pull Request #658 · jingyaogong/minimind

vanking20000918 · 2026-01-30T03:16:38Z

The code in the spo part maybe a little incomplete, that is, it misses the part of Prioritized Prompt Sampling in the original paper.

ps: The modification is refered from my presonal understanding of papaer and Germini. It could be wrong and takes it with caution.

vanking20000918 · 2026-01-30T03:54:03Z

The main results of the test are that the policy_loss and advantage_mean differ from the source code:

policy_loss: It can be seen that the variance of the modified policy loss has significantly decreased, indicating a more certain direction for policy optimization, with trajectories leaning towards the correct direction. Personally, I believe this is due to the adoption of priority sampling from the original literature, which clarifies the direction of trajectory optimization and avoids the impact of intra-group degeneration.
adventage_mean: After modification, there is a noticeable trend of change opposite to the original, but the variance of the change is almost the same, which cannot be explained..

vanking20000918

Actually there is some difference from the original test, such as policy loss... Maybe it's due to the difference of SFT model.

…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling

…orithm This the main addition reference GRPO algorithm: By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability. Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.

vanking20000918 · 2026-02-02T14:32:52Z

The DAPO algorithm in code is add clip-higher and dynamic sampling compared to GRPO algorithm, which refered to this paper.

In the experiment, I found these will stabilize the training process and policy loss.

vanking20000918 · 2026-02-02T14:40:12Z

This main addition in CISPO algorithm refered to GRPO algorithm : By changing the gradient of out-of-bounds tokens from "directly set to 0" to "bounded clipping", we ensure that high-value exploration tokens can continue to participate in parameter updates while maintaining training stability.
Actually, this change is not very well in experiment, because the ratio is almost nearly at 1, which means seldom out-of-bound.

The refered sourced website

ps: CISPO algorithm in minimind project works very bad in experiment, so this script (train_cispo.py) only be refered.

[mod] fix spo algorithm in RLAIF part

020bd44

vanking20000918 commented Jan 30, 2026

View reviewed changes

vanking20000918 marked this pull request as ready for review January 30, 2026 04:24

vanking20000918 added 3 commits February 2, 2026 16:33

Update README.md

35fe139

[add] add DAPO argorithm (Decoupled Clip and Dynamic sAmpling Policy …

7389f64

…Optimization) I mainly add two points into original grpo algorithm according to this [paper](https://arxiv.org/pdf/2503.14476): Clip-Higher & Dynamic Sampling

vanking20000918 changed the title ~~[mod] fix spo algorithm in RLAIF part~~ [mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part. Feb 2, 2026

vanking20000918 closed this Mar 1, 2026

vanking20000918 deleted the qingguofan branch March 1, 2026 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658

[mod & add] fix spo algorithm, add dapo and cispo algorithm in the RLAIF part.#658
vanking20000918 wants to merge 4 commits intojingyaogong:masterfrom
vanking20000918:qingguofan

vanking20000918 commented Jan 30, 2026 •

edited

Loading

Uh oh!

vanking20000918 commented Jan 30, 2026

Uh oh!

vanking20000918 left a comment •

edited

Loading

Uh oh!

vanking20000918 commented Feb 2, 2026

Uh oh!

vanking20000918 commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vanking20000918 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanking20000918 commented Jan 30, 2026

Uh oh!

vanking20000918 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanking20000918 commented Feb 2, 2026

Uh oh!

vanking20000918 commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vanking20000918 commented Jan 30, 2026 •

edited

Loading

vanking20000918 left a comment •

edited

Loading