This repository contains reinforcement learning experiments for sequence generalization on the activity and lis datasets. Training loops, actor/reference coordination, and evaluation utilities build on top of the verl framework while customizing data pipelines and rewards for this project.
- Parquet files for the activity task live in
seqdata/activity/. - Parquet files for the lis task live in
seqdata/lis/. - Both folders include train/test splits as well as
*_reason.parquetvariants which are used for explicit reasoning format-based reward.
- Shell scripts in
myscripts/are the primary entrypoints for running GRPO training. Each script pins the dataset split, model checkpoint, rollout configuration, and reward function selection for a specific experiment (e.g.bash myscripts/activity_answer_qwen2-7b.sh). - The scripts assume the directory layout above; update the
seqdatafolders to swap in new datasets without touching the launch configs.
- Custom reward shaping lives in
verl/utils/reward_score/myreward.py. The launcher scripts reference functions from this module via thecustom_reward_functionoverrides passed to Verl. - Modify or extend this module when introducing new rewards; all scripts pick up the changes automatically.
- Use
python pass_k.py --task activity --model Qwen/Qwen2.5-7B-Instruct --k 256(or themyscripts/pass_k.shhelper) to measure pass@k metrics on the saved models. - Adjust the
--taskflag to switch between the activity and lis datasets or change--model/--kas needed for alternative checkpoints and sampling depths.
- Verl dependencies and CLI flags follow the upstream project. Refer to the official documentation if you need to customize distributed launch parameters or model backends beyond what the scripts provide.