Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B) by rycerzes · Pull Request #5747 · huggingface/trl

rycerzes · 2026-05-11T17:24:11Z

What does this PR do?

Adds a self-contained Jupyter notebook (examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that demonstrates end-to-end GRPO training using:

trl.experimental.openreward.OpenRewardSpec for environment/reward wiring
A local ORS server (Toolathlon Gym Docker container) started and managed from within the notebook
Toolathlon Gym as the tool-use evaluation environment (25 MCP servers, per-task tool discovery)
Qwen/Qwen3.5-4B loaded via transformers + peft LoRA (bf16, no QLoRA)

Design notes

The demo reward function adds a small per-tool-call bonus (0.03 * min(n_calls, 30)) on top of the ORS outcome so that GRPO metrics visibly move during short training runs. This is explicitly documented as not the official Toolathlon benchmark reward.
Container lifecycle is fully managed: pull → start → health-check → atexit cleanup.
Works on any local Jupyter environment with Docker access; no external ORS deployment required.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@adithya-s-k

Note: This PR depends on and should be merged after #5729.

Note

Medium Risk
Changes how OpenRewardSpec discovers/binds tools by opening probe sessions to include task-scoped tools, which can affect rollout behavior and add extra ORS calls; impact is limited to the experimental OpenReward integration plus tests/examples.

Overview
Adds a new end-to-end notebook (examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb) that spins up a local Toolathlon ORS via Docker and runs GRPO training against it with Qwen/Qwen3.5-4B + LoRA.

Extends trl.experimental.openreward.OpenRewardSpec to optionally discover and bind task-scoped tools by probing session.list_tools() (ORS /task_tools) instead of only environment.list_tools(), with new controls discover_task_tools and task_tools_discovery_index and logic to merge tool specs across probed task indices.

Updates the experimental echo ORS test env to expose a session-only hint tool and adds test coverage for task-tool binding, opt-out behavior, and single-index discovery probing.

^{Reviewed by Cursor Bugbot for commit 8b88e87. Bugbot is set up for automated code reviews on this repo. Configure here.}

- `HintParams`, and `list_task_tools` method to expose the hint tool during task tool discovery. - update `OpenRewardSpec` to support task-specific tool discovery

- refactor task tools probe indices

- fix ListToolsOutput import

…+ Qwen3.5-4B

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.}

cursor · 2026-05-11T17:25:51Z

+    "    return p\n",
+    "\n",
+    "def _get_running_container_id(name: str) -> str:\n",
+    "    p = _run(f\"docker ps --filter name=^/{name}$ --format {{.ID}}\", check=False)\n",


Docker format template produces literal text instead of ID

Low Severity

The --format {{.ID}} inside an f-string produces the literal {.ID} instead of Docker's required Go template {{.ID}}. To emit double braces in the shell command, the f-string needs {{{{.ID}}}}. The notebook output confirms this: it prints ({.ID}) instead of the actual container ID. The code only works incidentally because any truthy output satisfies the "is running" check, but container_id is wrong.

Additional Locations (1)

examples/notebooks/grpo_openreward_toolathlon_qwen3_5_4b.ipynb#L224-L225

^{Reviewed by Cursor Bugbot for commit 8b88e87. Configure here.}

wadeKeith

Great educational resource - the end-to-end GRPO + OpenReward notebook with Local ORS and Toolathlon Gym makes the workflow much more approachable. LGTM! Reviewed by Hermes Agent.

rycerzes added 5 commits May 11, 2026 20:03

fix: add hint tool and task tools discovery

6c7d590

- `HintParams`, and `list_task_tools` method to expose the hint tool during task tool discovery. - update `OpenRewardSpec` to support task-specific tool discovery

Merge remote-tracking branch 'origin' into feat/openreward-example

603cd3a

fix: tool specs normalization

5f37813

- refactor task tools probe indices

feat: test for task_tools_discovery_index parameter

f48d157

- fix ListToolsOutput import

feat: notebook for GRPO with OpenReward (Local ORS) + Toolathlon Gym …

8b88e87

…+ Qwen3.5-4B

cursor Bot reviewed May 11, 2026

View reviewed changes

wadeKeith reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747

Add end-to-end GRPO + OpenReward notebook (Local ORS / Toolathlon Gym / Qwen3.5-4B)#5747
rycerzes wants to merge 5 commits into
huggingface:mainfrom
rycerzes:feat/openreward-recipe

rycerzes commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

wadeKeith left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rycerzes commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Design notes

Before submitting

AI writing disclosure

Who can review?

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Docker format template produces literal text instead of ID

Uh oh!

wadeKeith left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rycerzes commented May 11, 2026 •

edited by cursor Bot

Loading