Skip to content

Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747

Open
rycerzes wants to merge 5 commits into
huggingface:mainfrom
rycerzes:feat/openreward-recipe
Open

Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747
rycerzes wants to merge 5 commits into
huggingface:mainfrom
rycerzes:feat/openreward-recipe

Conversation

@rycerzes
Copy link
Copy Markdown
Contributor

@rycerzes rycerzes commented May 11, 2026

What does this PR do?

Adds a self-contained Jupyter notebook (examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that demonstrates end-to-end GRPO training using:

  • trl.experimental.openreward.OpenRewardSpec for environment/reward wiring
  • A local ORS server (Toolathlon Gym Docker container) started and managed from within the notebook
  • Toolathlon Gym as the tool-use evaluation environment (25 MCP servers, per-task tool discovery)
  • Qwen/Qwen3.5-4B loaded via transformers + peft LoRA (bf16, no QLoRA)

Design notes

  • The demo reward function adds a small per-tool-call bonus (0.03 * min(n_calls, 30)) on top of the ORS outcome so that GRPO metrics visibly move during short training runs. This is explicitly documented as not the official Toolathlon benchmark reward.
  • Container lifecycle is fully managed: pull → start → health-check → atexit cleanup.
  • Works on any local Jupyter environment with Docker access; no external ORS deployment required.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@adithya-s-k

Note: This PR depends on and should be merged after #5729.


Note

Medium Risk
Changes how OpenRewardSpec discovers/binds tools by opening probe sessions to include task-scoped tools, which can affect rollout behavior and add extra ORS calls; impact is limited to the experimental OpenReward integration plus tests/examples.

Overview
Adds a new end-to-end notebook (examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that spins up a local Toolathlon ORS via Docker and runs GRPO training against it with Qwen/Qwen3.5-4B + LoRA.

Extends trl.experimental.openreward.OpenRewardSpec to optionally discover and bind task-scoped tools by probing session.list_tools() (ORS /task_tools) instead of only environment.list_tools(), with new controls discover_task_tools and task_tools_discovery_index and logic to merge tool specs across probed task indices.

Updates the experimental echo ORS test env to expose a session-only hint tool and adds test coverage for task-tool binding, opt-out behavior, and single-index discovery probing.

Reviewed by Cursor Bugbot for commit 8b88e87. Bugbot is set up for automated code reviews on this repo. Configure here.

rycerzes added 5 commits May 11, 2026 20:03
- `HintParams`, and `list_task_tools` method to expose the hint tool during task tool discovery.
- update `OpenRewardSpec` to support task-specific tool discovery
- refactor task tools probe indices
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.

" return p\n",
"\n",
"def _get_running_container_id(name: str) -> str:\n",
" p = _run(f\"docker ps --filter name=^/{name}$ --format {{.ID}}\", check=False)\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker format template produces literal text instead of ID

Low Severity

The --format {{.ID}} inside an f-string produces the literal {.ID} instead of Docker's required Go template {{.ID}}. To emit double braces in the shell command, the f-string needs {{{{.ID}}}}. The notebook output confirms this: it prints ({.ID}) instead of the actual container ID. The code only works incidentally because any truthy output satisfies the "is running" check, but container_id is wrong.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.

Copy link
Copy Markdown

@wadeKeith wadeKeith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great educational resource - the end-to-end GRPO + OpenReward notebook with Local ORS and Toolathlon Gym makes the workflow much more approachable. LGTM! Reviewed by Hermes Agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants