Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747
Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747rycerzes wants to merge 5 commits into
Conversation
- `HintParams`, and `list_task_tools` method to expose the hint tool during task tool discovery. - update `OpenRewardSpec` to support task-specific tool discovery
- refactor task tools probe indices
- fix ListToolsOutput import
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.
| " return p\n", | ||
| "\n", | ||
| "def _get_running_container_id(name: str) -> str:\n", | ||
| " p = _run(f\"docker ps --filter name=^/{name}$ --format {{.ID}}\", check=False)\n", |
There was a problem hiding this comment.
Docker format template produces literal text instead of ID
Low Severity
The --format {{.ID}} inside an f-string produces the literal {.ID} instead of Docker's required Go template {{.ID}}. To emit double braces in the shell command, the f-string needs {{{{.ID}}}}. The notebook output confirms this: it prints ({.ID}) instead of the actual container ID. The code only works incidentally because any truthy output satisfies the "is running" check, but container_id is wrong.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.
wadeKeith
left a comment
There was a problem hiding this comment.
Great educational resource - the end-to-end GRPO + OpenReward notebook with Local ORS and Toolathlon Gym makes the workflow much more approachable. LGTM! Reviewed by Hermes Agent.


What does this PR do?
Adds a self-contained Jupyter notebook (
examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that demonstrates end-to-end GRPO training using:trl.experimental.openreward.OpenRewardSpecfor environment/reward wiringQwen/Qwen3.5-4Bloaded viatransformers+peftLoRA (bf16, no QLoRA)Design notes
0.03 * min(n_calls, 30)) on top of the ORS outcome so that GRPO metrics visibly move during short training runs. This is explicitly documented as not the official Toolathlon benchmark reward.Before submitting
AI writing disclosure
Who can review?
@adithya-s-k
Note
Medium Risk
Changes how
OpenRewardSpecdiscovers/binds tools by opening probe sessions to include task-scoped tools, which can affect rollout behavior and add extra ORS calls; impact is limited to the experimental OpenReward integration plus tests/examples.Overview
Adds a new end-to-end notebook (
examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that spins up a local Toolathlon ORS via Docker and runs GRPO training against it withQwen/Qwen3.5-4B+ LoRA.Extends
trl.experimental.openreward.OpenRewardSpecto optionally discover and bind task-scoped tools by probingsession.list_tools()(ORS/task_tools) instead of onlyenvironment.list_tools(), with new controlsdiscover_task_toolsandtask_tools_discovery_indexand logic to merge tool specs across probed task indices.Updates the experimental echo ORS test env to expose a session-only
hinttool and adds test coverage for task-tool binding, opt-out behavior, and single-index discovery probing.Reviewed by Cursor Bugbot for commit 8b88e87. Bugbot is set up for automated code reviews on this repo. Configure here.