🚧 Website under construction 🚧
We are working on delivering a pre-print and open-sourcing the codebase for everyone to use as soon as possible. If you have questions, feedback, ideas, or are interested in using REALM please feel free to reach out to Martin Sedlacek.
REALM is a large-scale real-to-sim aligned simulation environment and benchmark for generalization
in robotic manipulation. It supports 7 distinct manipulation skills and stress-tests them
against 15 perturbations. Through empirical validation, we show that evaluation results in
simulation are strongly correlated to real-world performance.
Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on. It is presently difficult to gain reliable insights into how these models adapt to unseen conditions and scenarios, as real-world evaluation is prohibitively expensive to scale and many current simulation benchmarks are saturated or lack the realism required to serve as a proxy for real-world performance. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and robot control alignment. Our environment offers a unified suite of 15 perturbation factors from 3 categories, some of which were previously unaddressed or scattered across multiple benchmarks and simulation frameworks. We support 7 common manipulation skills on the DROID robot platform and a highly diverse set of objects, which can be used to create hundreds of tasks. Finally, we establish two task sets that form our benchmark and evaluate the π₀, π₀-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge.
"Pick up the spoon."
Description of the perturbations used in our approach.
Sim-to-real validation of REALM. Task progression is shown in the real-world (x-axis) and simulation (y-axis). Left: We show a strong Pearson correlation (r) with datapoints close to identity (gray dashed line) and a low Mean Maximum Rank Violation (MMRV) on 7 tasks under 5 visual and behavioral perturbations. Right: The results are also highly correlated under individual perturbations. We observe a p-value of p < 0.001 for all settings indicating that REALM has a strong and statistically significant real-to-sim alignment.
Visualization of the control alignment. A trajectory replay in simulation with default robot control (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our control alignment results in significantly more realistic trajectory following.
Attention maps from the π0 action expert. We replay the same robot trajectory solving a task in reality (top) and simulation (bottom) and observe that the model largely attends to similar patches. We compute the cosine similarity between the attention maps from real and simulated images averaged over approx. 280 frames in the video and all layers and attention heads in the π0 action expert during the last step of the flow matching process, yielding a high similarity score of 0.85/1.
This section presents a detailed evaluation of three VLAs under 15 distinct perturbations on selected tasks from the REALM benchmark.
Results on 8 tasks in the REALm-base task set spanning 5 basic manipulation skills.
Binary success rate:
Success Rate (y-axis) on the REALM-base task set (x-axis). The Violin plots [29] show Bayesian posteriors of success rates under a uniform Beta prior and the observed data.
Time to completion. Average time to (successfully) complete a task for each model and task in REALM-base task set. The time in seconds is obtained by dividing the number of timesteps in simulation by the control frequency (fixed), thus not accounting for model inference times and latency.
In this work, we introduced REALM: a new high-fidelity simulation environment and benchmark for generalization in robotic manipulation and perform extensive real-to-sim validation showing that our results can serve as a good proxy for real-world performance. We then evaluate three VLAs under 15 perturbation factors and show that generalization and robustness is still far from solved for these models. Based on the extensive experiments conducted, we also draw the following conclusions:
While we recognize the tremendous progress that enabled VLAs to start performing manipulation tasks in unseen settings, including many scenes in our simulation, we believe these results strongly indicate that current models still lack the capabilities for any autonomous real-world deployment.