logo

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Martin Sedlacek1,2, Pavlo Yefanov1,2, Georgy Ponimatkin1,2, Jai Bardhan1, Simon Pilc1,2, Mederic Fourmy1, Evangelos Kazakos1, Cees Snoek3, Josef Sivic1, Vladimir Petrik1
1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague 2Faculty of Electrical Engineering, Czech Technical University in Prague 3University of Amsterdam

Teaser video

Overview

Overview Figure REALM is a large-scale realistic simulation environment and benchmark for generalization in robotic manipulation. It supports 7 distinct manipulation skills and stress-tests them against 15 perturbations. Through empirical validation, we show that evaluation results in simulation are strongly correlated to real-world performance.

Abstract

Abstract— Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world perfor- mance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 4,000 objects. Finally, we establish two task sets that form our benchmark and evaluate the π0, π0-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs.

Takeaways

  1. High-fidelity simulation with aligned robot control can serve as a valuable proxy for real-world performance, mitigating the issue of saturated simulation benchmarks.
  2. Despite VLM backbones pretrained on Internet-scale data, there is a noticeable drop in performance from purely semantic perturbations.
  3. There is still a noticeable sensitivity to camera view for all models despite the unusually high diversity of viewpoints in the DROID dataset.
  4. Behavioral generalization across objects and their properties is the most challenging for all tested models.
  5. Conversely, all tested models seem to generalize well across known skills when the manipulated object remains the same.
  6. Reliability and robustness, especially under perturbations, is still highly challenging and models still exhibit very low success rates on many basic manipulation tasks.

While we recognize the tremendous progress that enabled VLAs to start performing manipulation tasks in unseen settings, including many scenes in our simulation, we believe these results indicate that current models still lack the capabilities for autonomous real-world deployment.

Example Policy Evaluations

Perturbations

To assess the robustness of VLAs under variable conditions, we implemented perturbations that change the visual, behavioral, and semantic properties of the tasks and environments. We adopt 14 of the 22 perturbations from the ☆-Gen taxonomy and introduce a separate 15-th V-LIGHT perturbations for scene illumination.


Visualizer

Try hovering over the colored segments in the wheel below to see examples of how the scene can change!

"Pick up the spoon."

Description of the perturbations used in our approach:

Perturbation Descriptions

Real-to-Sim Validation

Real-to-Sim Figure

Sim-to-real validation of REALM. Task progression is shown in the real-world (x-axis) and simulation (y-axis). Left: We show a strong Pearson correlation (r) with datapoints close to identity (gray dashed line) and a low Mean Maximum Rank Violation (MMRV) on 7 tasks under 5 visual and behavioral perturbations. Right: The results are also highly correlated under individual perturbations. We observe a p-value of p < 0.001 between real and simulated rollouts for all settings indicating that REALM is a strong proxy for real-world performance.


Trajectory Following Figure

Visualization of the control alignment. A trajectory replay in simulation with default robot control (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our control alignment results in significantly more realistic trajectory following.


Attention Maps Figure

Attention maps from the π0 action expert. We replay the same robot trajectory solving a task in reality (top) and simulation (bottom) and observe that the model largely attends to similar patches. We compute the cosine similarity between the attention maps from real and simulated images averaged over approx. 280 frames in the video and all layers and attention heads in the π0 action expert during the last step of the flow matching process, yielding a high similarity score of 0.85/1.

Benchmarking Results

This section presents a detailed evaluation of three VLAs under 15 distinct perturbations on the 8 tasks that comprise our benchmark.

Overall Task

Results on 8 tasks in the REALM-base task set spanning 5 basic manipulation skills.


Overall Task

Success Rate (y-axis) on the REALM-base task set (x-axis). The Violin plots show Bayesian posteriors of success rates under a uniform Beta prior and the observed data.

Overall Task

Time to completion. Average time to (successfully) complete a task for each model and task in REALM-base task set. The time in seconds is obtained by dividing the number of timesteps in simulation by the control frequency (fixed), thus not accounting for model inference times and latency.



Individual Task Results

This section presents a detailed evaluation of three VLA models under 15 distinct perturbations on 10 tasks in the REALM-base and REALM-articulated task sets. On the left, we show a visualization of the task, and on the right, we show the task progression on the nominal default setting (black) and under individual perturbations (colored curves).


Instruction: Put the green block in the bowl

Benchmark Image 1

Task visualization

Put Task

Results under perturbations


Instruction: Put the banana in the box

Benchmark Image 1

Task visualization

Put Task

Results under perturbations


Instruction: Rotate the marker

Benchmark Image 1

Task visualization

Put Task

Results under perturbations


Instruction: Rotate the mug

Benchmark Image 9

Task visualization

Rotate Task

Results under perturbations


Instruction: Pick up the spoon

Benchmark Image 3

Task visualization

Pick Task

Results under perturbations


Instruction: Pick up the water bottle

Benchmark Image 3

Task visualization

Pick Task

Results under perturbations


Instruction: Stack the green block on top of the yellow block

Benchmark Image 5

Task visualization

Stack Task

Results under perturbations


Instruction: Push the switch

Benchmark Image 7

Task visualization

Push Task

Results under perturbations


Instruction: Open the drawer

Benchmark Image 9

Task visualization

Push Task

Results under perturbations

BibTeX

@article{sedlacek2025realm,
    title={REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation},
    author={Martin Sedlacek and Pavlo Yefanov and Georgy Ponimatkin and Jai Bardhan and Simon Pilc and Mederic Fourmy and Evangelos Kazakos and Cees G. M. Snoek and Josef Sivic and Vladimir Petrik},
    journal = {arXiv preprint arXiv:2512.19562},
    year={2025}
}