REALM is a large-scale realistic simulation environment and benchmark for generalization
in robotic manipulation. It supports 7 distinct manipulation skills and stress-tests them
against 15 perturbations. Through empirical validation, we show that evaluation results in
simulation are strongly correlated to real-world performance.
Abstract— Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world perfor- mance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 4,000 objects. Finally, we establish two task sets that form our benchmark and evaluate the π0, π0-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs.
While we recognize the tremendous progress that enabled VLAs to start performing manipulation tasks in unseen settings, including many scenes in our simulation, we believe these results indicate that current models still lack the capabilities for autonomous real-world deployment.
To assess the robustness of VLAs under variable conditions, we implemented perturbations that change the visual, behavioral, and semantic properties of the tasks and environments. We adopt 14 of the 22 perturbations from the ☆-Gen taxonomy and introduce a separate 15-th V-LIGHT perturbations for scene illumination.
Try hovering over the colored segments in the wheel below to see examples of how the scene can change!
"Pick up the spoon."
Description of the perturbations used in our approach:
Sim-to-real validation of REALM. Task progression is shown in the real-world (x-axis) and simulation (y-axis). Left: We show a strong Pearson correlation (r) with datapoints close to identity (gray dashed line) and a low Mean Maximum Rank Violation (MMRV) on 7 tasks under 5 visual and behavioral perturbations. Right: The results are also highly correlated under individual perturbations. We observe a p-value of p < 0.001 between real and simulated rollouts for all settings indicating that REALM is a strong proxy for real-world performance.
Visualization of the control alignment. A trajectory replay in simulation with default robot control (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our control alignment results in significantly more realistic trajectory following.
Attention maps from the π0 action expert. We replay the same robot trajectory solving a task in reality (top) and simulation (bottom) and observe that the model largely attends to similar patches. We compute the cosine similarity between the attention maps from real and simulated images averaged over approx. 280 frames in the video and all layers and attention heads in the π0 action expert during the last step of the flow matching process, yielding a high similarity score of 0.85/1.
This section presents a detailed evaluation of three VLAs under 15 distinct perturbations on the 8 tasks that comprise our benchmark.
Results on 8 tasks in the REALM-base task set spanning 5 basic manipulation skills.
Success Rate (y-axis) on the REALM-base task set (x-axis). The Violin plots show Bayesian posteriors of success rates under a uniform Beta prior and the observed data.
Time to completion. Average time to (successfully) complete a task for each model and task in REALM-base task set. The time in seconds is obtained by dividing the number of timesteps in simulation by the control frequency (fixed), thus not accounting for model inference times and latency.
This section presents a detailed evaluation of three VLA models under 15 distinct perturbations on 10 tasks in the REALM-base and REALM-articulated task sets. On the left, we show a visualization of the task, and on the right, we show the task progression on the nominal default setting (black) and under individual perturbations (colored curves).
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
Task visualization
Results under perturbations
@article{sedlacek2025realm,
title={REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation},
author={Martin Sedlacek and Pavlo Yefanov and Georgy Ponimatkin and Jai Bardhan and Simon Pilc and Mederic Fourmy and Evangelos Kazakos and Cees G. M. Snoek and Josef Sivic and Vladimir Petrik},
journal = {arXiv preprint arXiv:2512.19562},
year={2025}
}