REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Martin Sedlacek^1,2, Pavlo Yefanov^1,2, Georgy Ponimatkin^1,2, Jai Bardhan¹, Simon Pilc^1,2, Mederic Fourmy¹, Evangelos Kazakos¹, Cees Snoek³, Josef Sivic¹, Vladimir Petrik¹
¹Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague ²Faculty of Electrical Engineering, Czech Technical University in Prague ³University of Amsterdam

Video Overview Abstract Takeaways Example Evaluations Perturbations

Real-to-Sim Validation Benchmark Results

arXiv Code

Teaser video

Overview

Overview Figure REALM is a large-scale realistic simulation environment and benchmark for generalization in robotic manipulation. It supports 7 distinct manipulation skills and stress-tests them against 15 perturbations. Through empirical validation, we show that evaluation results in simulation are strongly correlated to real-world performance.

Abstract

Abstract— Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world perfor- mance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 4,000 objects. Finally, we establish two task sets that form our benchmark and evaluate the π0, π0-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs.

Takeaways

High-fidelity simulation with aligned robot control can serve as a valuable proxy for real-world performance, mitigating the issue of saturated simulation benchmarks.
Despite VLM backbones pretrained on Internet-scale data, there is a noticeable drop in performance from purely semantic perturbations.
There is still a noticeable sensitivity to camera view for all models despite the unusually high diversity of viewpoints in the DROID dataset.
Behavioral generalization across objects and their properties is the most challenging for all tested models.
Conversely, all tested models seem to generalize well across known skills when the manipulated object remains the same.
Reliability and robustness, especially under perturbations, is still highly challenging and models still exhibit very low success rates on many basic manipulation tasks.

While we recognize the tremendous progress that enabled VLAs to start performing manipulation tasks in unseen settings, including many scenes in our simulation, we believe these results indicate that current models still lack the capabilities for autonomous real-world deployment.

Example Policy Evaluations

Perturbations

To assess the robustness of VLAs under variable conditions, we implemented perturbations that change the visual, behavioral, and semantic properties of the tasks and environments. We adopt 14 of the 22 perturbations from the ☆-Gen taxonomy and introduce a separate 15-th V-LIGHT perturbations for scene illumination.

Visualizer

Try hovering over the colored segments in the wheel below to see examples of how the scene can change!

"Pick up the spoon."

Description of the perturbations used in our approach:

Real-to-Sim Validation

Sim-to-real validation of REALM. Task progression is shown in the real-world (x-axis) and simulation (y-axis). Left: We show a strong Pearson correlation (r) with datapoints close to identity (gray dashed line) and a low Mean Maximum Rank Violation (MMRV) on 7 tasks under 5 visual and behavioral perturbations. Right: The results are also highly correlated under individual perturbations. We observe a p-value of p < 0.001 between real and simulated rollouts for all settings indicating that REALM is a strong proxy for real-world performance.

Visualization of the control alignment. A trajectory replay in simulation with default robot control (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our control alignment results in significantly more realistic trajectory following.

Attention maps from the π0 action expert. We replay the same robot trajectory solving a task in reality (top) and simulation (bottom) and observe that the model largely attends to similar patches. We compute the cosine similarity between the attention maps from real and simulated images averaged over approx. 280 frames in the video and all layers and attention heads in the π0 action expert during the last step of the flow matching process, yielding a high similarity score of 0.85/1.

Benchmarking Results

This section presents a detailed evaluation of three VLAs under 15 distinct perturbations on the 8 tasks that comprise our benchmark.

Results on 8 tasks in the REALM-base task set spanning 5 basic manipulation skills.

Success Rate (y-axis) on the REALM-base task set (x-axis). The Violin plots show Bayesian posteriors of success rates under a uniform Beta prior and the observed data.

@article{sedlacek2025realm,
    title={REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation},
    author={Martin Sedlacek and Pavlo Yefanov and Georgy Ponimatkin and Jai Bardhan and Simon Pilc and Mederic Fourmy and Evangelos Kazakos and Cees G. M. Snoek and Josef Sivic and Vladimir Petrik},
    journal = {arXiv preprint arXiv:2512.19562},
    year={2025}
}