[Discussion] RoboGate: Adversarial Safety Benchmark for Pick-and-Place (68 scenarios, 50K+ failure dictionary) #508

liveplex-cpu · 2026-03-29T11:56:11Z

liveplex-cpu
Mar 29, 2026

RoboGate — Adversarial Safety Benchmark for Industrial Pick-and-Place

Hi Isaac Lab-Arena team and community,

We would like to share RoboGate, an adversarial safety benchmark designed to complement existing Isaac Lab-Arena benchmarks like Lightwheel RoboFinals and RoboCasa Tasks.

What RoboGate Does

Focus: Pre-deployment safety validation — answering "Is this policy safe to deploy?" rather than "How well does this policy perform?"

68 adversarial scenarios across 4 difficulty tiers (Nominal, Edge Cases, Adversarial, Domain Randomization)
50,000+ failure dictionary across 4 robots (Franka Panda, UR5e, UR3e, UR10e)
5-metric Deployment Confidence Score (0-100) — goes beyond binary pass/fail
Two-Stage Adaptive Sampling — boundary-focused exploration of failure regions (AUC 0.780)

5-Model VLA Leaderboard

Model	Params	SR	Confidence	Failure Pattern
Scripted Controller (IK)	—	100% (68/68)	76/100	—
GR00T N1.6 (NVIDIA)	3B	0% (0/68)	1/100	grasp_miss + collision
OpenVLA (Stanford + TRI)	7B	0% (0/68)	27/100	grasp_miss only, 0 collision
Octo-Base (UC Berkeley)	93M	0% (0/68)	1/100	grasp_miss 79%, collision 21%
Octo-Small (UC Berkeley)	27M	0% (0/68)	1/100	grasp_miss 79.4%, collision 20.6%

All 4 VLA models — including NVIDIA's official GR00T N1.6 — score 0% on scenarios a scripted IK controller solves 100%.

How RoboGate Complements Lightwheel Benchmarks

Aspect	Lightwheel RoboFinals / RoboCasa	RoboGate
Focus	Task diversity & generalization	Safety & failure boundary mapping
Question	"Can the policy do this task?"	"Where exactly does it fail, and how dangerous?"
Scenarios	138+ diverse tasks, 50 datasets/task	68 adversarial scenarios, 4 difficulty tiers
Metric	Task success rate	5-metric Confidence Score (0-100)
Data	Task demonstrations	50K+ failure dictionary with boundary sampling
Robots	7+ embodiments	4 robots (Franka, UR5e, UR3e, UR10e)

The two approaches are complementary:

Lightwheel answers: "Does the policy generalize across tasks?"
RoboGate answers: "Under what exact conditions does it fail, and is it safe to deploy?"

Together, they provide a complete picture: generalization breadth (Lightwheel) + safety depth (RoboGate).

Pull Request

We have an open PR integrating the benchmark into Isaac Lab-Arena:

PR: [Benchmark] RoboGate: 68-Scenario Adversarial Pick-and-Place Benchmark with 30K Failure Dictionary #506
Supports --mock mode for CI/CD testing without GPU
Integrates with ArenaEnvBuilder

Resources

Paper: arXiv:2603.22126
Live Leaderboard: robogate.io/vla
50K+ Dataset: HuggingFace
GitHub: liveplex-cpu/robogate
Issue: [Benchmark Contribution] RoboGate: 68-Scenario Adversarial Pick-and-Place with 30K Failure Dictionary & 4-Model VLA Leaderboard #507

We welcome feedback on how RoboGate can best integrate with the Arena ecosystem. Happy to adjust the benchmark design based on maintainer and community input.

— AgentAI Co., Ltd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] RoboGate: Adversarial Safety Benchmark for Pick-and-Place (68 scenarios, 50K+ failure dictionary) #508

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Discussion] RoboGate: Adversarial Safety Benchmark for Pick-and-Place (68 scenarios, 50K+ failure dictionary) #508

Uh oh!

liveplex-cpu Mar 29, 2026

RoboGate — Adversarial Safety Benchmark for Industrial Pick-and-Place

What RoboGate Does

5-Model VLA Leaderboard

How RoboGate Complements Lightwheel Benchmarks

Pull Request

Resources

Replies: 0 comments

liveplex-cpu
Mar 29, 2026