Recoleta Item Note

Foundational World Models Accurately Detect Bimanual Manipulator Failures

This paper proposes a method for runtime failure detection in bimanual robots: train a probabilistic world model in the compressed latent space of a pretrained visual foundation model, and use its uncertainty as an…

world-modelfailure-detectionbimanual-manipulationconformal-predictionrobot-anomaly-detectionfoundation-vision-model

This paper proposes a method for runtime failure detection in bimanual robots: train a probabilistic world model in the compressed latent space of a pretrained visual foundation model, and use its uncertainty as an anomaly score. The core value is that it can detect impending or ongoing manipulation failures more reliably without explicitly enumerating failure modes in a high-dimensional space.

  • The paper addresses the problem of how to detect anomalous failures in real time during deployment of bimanual visuomotor robots; this matters because failures can lead to degraded performance, equipment damage, and even risks to human safety.
  • The state space of such robots is enormous, including multi-view images, actions, and proprioceptive signals, making it almost impossible to manually define all failure modes.
  • Traditional statistical or simple reconstruction-based anomaly detection methods struggle with high-dimensional, multimodal, temporally correlated robot behavior data.
  • First use the pretrained visual foundation model NVIDIA Cosmos Tokenizer to compress multi-view images into a latent space, then combine action and proprioceptive history to train a history-conditioned probabilistic world model that predicts the next state.
  • The world model uses a VAE-style output of the future latent variable distribution; its predictive variance/standard deviation can be directly used as a score for “how uncertain I am about the current behavior.” Put simply: the model learns only “normal behavior,” and when it encounters behavior unlike the training distribution, uncertainty increases.
  • The paper designs two failure scores: (1) world model uncertainty; (2) world model prediction error (the difference between the prediction and the true next state in latent space).
  • It uses conformal prediction to automatically set thresholds on a calibration set containing only normal trajectories, thereby converting continuous scores into runtime “normal/anomalous” judgments while controlling the false positive rate.
  • It also introduces a new real-world dataset, Bimanual Cable Manipulation, which includes multi-camera views, action/proprioceptive signals, and annotated failure trajectories, for evaluating failures such as dropped cables in real bimanual cable manipulation.
  • On Bimanual Cable Manipulation, the authors’ method WM uncertainty achieves the best overall classification accuracy: 92.0±6.4%, including 87.9±17.0% on normal trajectories and 95.1±5.5% on failure trajectories; the corresponding threshold is 85% conformal prediction threshold.
  • On the same dataset, WM prediction error reaches 87.9±6.4% overall accuracy; it outperforms multiple baselines, but is lower than 92.0±6.4% for WM uncertainty.
  • Compared with learning-based baselines, the best author method is about 2.7 percentage points higher in overall accuracy than the second-best learning method logpZO, which has 89.3±6.8%; the abstract also claims its failure detection rate is 3.8% higher.
  • Compared with other baselines: AE reconstruction error 61.0±4.2%, AE sim 66.4±6.1%, SPARC 42.6±6.8%, PCA K-means 48.6±12.6%, Random 38.7±6.4%; this shows the method clearly outperforms statistical techniques and several common anomaly detection methods.
  • In terms of parameter efficiency, the authors’ world model has only 569.7k trainable parameters, while the second-best learning-based method has about 10M; that is, it achieves better results with about 1/20 the parameter count.
  • In the Push-T simulation, the method demonstrates the ability to distinguish visual anomalies (changing color) and dynamics anomalies (changing friction) using 1028 normal training trajectories, 128 validation trajectories, 512 normal test trajectories, and 2048 failure trajectories (4 failure modes, 512 each), but the excerpt does not provide the full quantitative metrics for that experiment.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.