EBench: Beyond the Robot Success-Rate Scalar

EBench diagnoses generalist mobile-manipulation policies across 26 tasks and nine capability and generalization axes — and finds that models tied on success rate have nothing in common under the hood.

Strip the leaderboard down to its single number and you lose the only information that actually matters. That is the argument behind EBench, a new simulation benchmark from a group including Ning Gao and Jinliang Zheng that sets out to diagnose generalist mobile-manipulation policies beyond a single success-rate scalar. The phrasing is deliberate. The robot-learning field has spent two years compressing the capability of so-called generalist policies — the vision-language-action models meant to drive any robot through any task — into one headline percentage, and EBench's whole premise is that the headline percentage lies by omission.

The construction is the contribution. EBench comprises 26 diverse and challenging manipulation tasks, but the key move is that each task is annotated along 5 capability dimensions and 4 generalization dimensions — nine axes in total. Capability dimensions ask what a policy can physically do: atomic skills, dexterous manipulation, mobile manipulation, and so on. Generalization dimensions ask how it holds up when the world shifts away from training: new objects, new scenes, new instructions. A model no longer earns one score; it earns a profile across nine axes. That is the difference between a thermometer and a blood panel, and it is exactly the kind of falsifiable, multi-dimensional measurement the field has been missing.

"We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles"— arXiv 2606.18239, source

That sentence is the result the rest of the field should sit with. Four of the most-cited generalist policies — Physical Intelligence's pi-0 and pi-0.5, plus XVLA and InternVLA-A1 — were run through the full battery, and the ones that look interchangeable on the scoreboard are not interchangeable in practice. The paper's specifics are blunt: pi-0.5 achieves the highest test success rate and the best train-to-test retention, InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA shows strengths on a disjoint set of atomic skills compared to the others. Three models, three non-overlapping shapes of competence, and an overall number that would have flattened all of it into a near-tie.

Why "collapses on dexterous tasks" is the line that matters

The InternVLA-A1 finding is the most instructive because it maps directly onto a real deployment decision. A policy that dominates mobile manipulation — driving a base around, reaching, gross pick-and-place — but collapses on dexterous tasks is precisely the robot you would want for warehouse tote-moving and precisely the robot you would not trust to thread a connector or manipulate a deformable object. If you bought it on its aggregate success rate, you would discover that gap only after deployment, in the failure mode that costs you. EBench's value proposition is that it surfaces the gap before procurement rather than after, by reporting where the competence lives rather than just how much of it there is on average.

The retention dimension deserves equal attention. The paper singles out pi-0.5 for the best train-test retention, which is a measure of how much of a policy's training-set skill survives the move to held-out test conditions. Retention is the quiet killer in robot learning: a model can post a gaudy training number and shed most of it the moment the lighting, the object set, or the instruction phrasing drifts. By scoring retention explicitly, EBench separates models that learned the task from models that memorized the demonstrations — a distinction the success-rate scalar erases entirely.

The generalization axes are the honest part

Beyond capability profiling, EBench analyzes generalization from four representative perspectives and identifies the impact of different distribution-shift factors. This is where benchmarks usually get evasive, because generalization is where impressive policies tend to fall apart, and a single in-distribution success rate conveniently hides it. By decomposing distribution shift into distinct factors — and reporting which ones each model is brittle to — EBench gives a buyer or a researcher a map of where a given policy can be trusted to leave the lab. That is more useful than a generalization average, because the shift factors are not equally likely in any given deployment; knowing your robot is robust to new objects but brittle to new scenes tells you whether it survives a relocation.

The skeptical reading has to acknowledge the obvious limit: EBench is a simulation benchmark. Simulated manipulation does not capture contact dynamics, sensor noise, calibration drift, or the long tail of real-world friction that separates a sim hero from a fielded robot. A model's nine-axis profile in EBench is a hypothesis about its real-world profile, not a measurement of it, and the sim-to-real gap can scramble these rankings. The 26-task suite is also a curated slice; capability and generalization are larger spaces than any benchmark fully covers, and the choice of axes is itself an editorial position about what matters. None of that is disqualifying — every benchmark is a model of competence, not competence itself — but it scopes the claims to "diagnostic signal" rather than "verdict," which is exactly how the authors frame it.

The reason this belongs on the sector's front page is that it disciplines the conversation. Generalist manipulation policies are the hottest claim in robotics right now, and they are almost always pitched with a single success number across a self-selected task set. EBench's contribution is to insist that two policies tied on that number can be opposite robots — one a mobile workhorse that fumbles fine work, another a dexterous specialist that cannot get around — and that the only honest way to compare them is to publish the profile. For a field perpetually tempted to crown a winner on one digit, that is the right kind of friction, and it is the kind of evidence that lets a reader audit a generalist-policy claim instead of taking it on faith.

Same Success Rate, Wildly Different Robots: A New Benchmark Breaks the Single-Number Myth

Why "collapses on dexterous tasks" is the line that matters

The generalization axes are the honest part

Comments