REALM: A Real-to-Sim Aligned Benchmark for Generalization in Robotic Manipulation

1CIIRC CTU, 2UvA

🚧 Website under construction 🚧

We are working on delivering a pre-print and open-sourcing the codebase for everyone to use as soon as possible. If you have questions, feedback, ideas, or are interested in using REALM please feel free to reach out to Martin Sedlacek.

Overview

Overview Figure REALM is a large-scale real-to-sim aligned simulation environment and benchmark for generalization in robotic manipulation. It supports 7 distinct manipulation skills and stress-tests them against 15 perturbations. Through empirical validation, we show that evaluation results in simulation are strongly correlated to real-world performance.

Abstract

Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on. It is presently difficult to gain reliable insights into how these models adapt to unseen conditions and scenarios, as real-world evaluation is prohibitively expensive to scale and many current simulation benchmarks are saturated or lack the realism required to serve as a proxy for real-world performance. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and robot control alignment. Our environment offers a unified suite of 15 perturbation factors from 3 categories, some of which were previously unaddressed or scattered across multiple benchmarks and simulation frameworks. We support 7 common manipulation skills on the DROID robot platform and a highly diverse set of objects, which can be used to create hundreds of tasks. Finally, we establish two task sets that form our benchmark and evaluate the π₀, π₀-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge.

Policy Evaluations

Perturbations

"Pick up the spoon."

Perturbation Descriptions

Description of the perturbations used in our approach.

Real-to-Sim

Real-to-Sim Figure

Sim-to-real validation of REALM. Task progression is shown in the real-world (x-axis) and simulation (y-axis). Left: We show a strong Pearson correlation (r) with datapoints close to identity (gray dashed line) and a low Mean Maximum Rank Violation (MMRV) on 7 tasks under 5 visual and behavioral perturbations. Right: The results are also highly correlated under individual perturbations. We observe a p-value of p < 0.001 for all settings indicating that REALM has a strong and statistically significant real-to-sim alignment.

Trajectory Following Figure

Visualization of the control alignment. A trajectory replay in simulation with default robot control (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our control alignment results in significantly more realistic trajectory following.

Attention Maps Figure

Attention maps from the π0 action expert. We replay the same robot trajectory solving a task in reality (top) and simulation (bottom) and observe that the model largely attends to similar patches. We compute the cosine similarity between the attention maps from real and simulated images averaged over approx. 280 frames in the video and all layers and attention heads in the π0 action expert during the last step of the flow matching process, yielding a high similarity score of 0.85/1.

Benchmark

This section presents a detailed evaluation of three VLAs under 15 distinct perturbations on selected tasks from the REALM benchmark.

Overall Task

Results on 8 tasks in the REALm-base task set spanning 5 basic manipulation skills.


Benchmark Image 1
Put Task
Benchmark Image 3
Pick Task
Benchmark Image 5
Stack Task
Benchmark Image 7
Push Task
Benchmark Image 9
Rotate Task

Binary success rate:

Overall Task

Success Rate (y-axis) on the REALM-base task set (x-axis). The Violin plots [29] show Bayesian posteriors of success rates under a uniform Beta prior and the observed data.

Overall Task

Time to completion. Average time to (successfully) complete a task for each model and task in REALM-base task set. The time in seconds is obtained by dividing the number of timesteps in simulation by the control frequency (fixed), thus not accounting for model inference times and latency.

Takeaways

In this work, we introduced REALM: a new high-fidelity simulation environment and benchmark for generalization in robotic manipulation and perform extensive real-to-sim validation showing that our results can serve as a good proxy for real-world performance. We then evaluate three VLAs under 15 perturbation factors and show that generalization and robustness is still far from solved for these models. Based on the extensive experiments conducted, we also draw the following conclusions:

  1. High-fidelity simulation with aligned robot control can serve as a valuable proxy for real-world performance, mitigating the issue of saturated simulation benchmarks.
  2. Despite VLM backbones pretrained on Internet-scale data, there is a noticeable drop in performance from purely semantic perturbations, especially in spatial reasoning.
  3. There is still a noticeable sensitivity to camera view for all models despite the unusually high diversity of viewpoints in the DROID dataset.
  4. Behavioral generalization across objects and their properties is the most challenging for all tested models.
  5. Conversely, all tested models seem to generalize well across known skills when the manipulated object remains the same.
  6. Reliability and robustness, especially under perturbations, is still highly challenging and models still exhibit very low success rates on many basic manipulation tasks.

While we recognize the tremendous progress that enabled VLAs to start performing manipulation tasks in unseen settings, including many scenes in our simulation, we believe these results strongly indicate that current models still lack the capabilities for any autonomous real-world deployment.