Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation
AOR proposes a way for robots to learn without training neural policies and without relying on demonstrations or reward engineering: after each failure, a multimodal LLM directly rewrites executable Python controller…
Summary
AOR proposes a way for robots to learn without training neural policies and without relying on demonstrations or reward engineering: after each failure, a multimodal LLM directly rewrites executable Python controller code. The core contribution is to make the entire low-level controller implementation, rather than a skill selector or parameters, the object of in-context learning, enabling the model to diagnose and fix the causes of failure based on visual evidence.
Problem
- When robot foundation models or VLAs fail in a specific deployment setting, it is usually difficult to identify the cause and adapt quickly without retraining.
- Existing LLM-based robotics methods mostly remain at the level of high-level planning / skill selection / one-shot code generation, making it hard to fix errors in geometry, perception, contact, and control details in low-level manipulation.
- This matters because real manipulation tasks are often affected by details such as camera coordinate systems, grasp geometry, and control smoothness, and addressing these issues through large-scale data or RL retraining is costly and slow to debug.
Approach
- AOR uses a dual-timescale closed loop: within an episode, a Python controller executes in real time; between episodes, a multimodal LLM inspects keyframe images and structured results, analyzes the failure, and generates a new controller class.
- The policy representation is not parameters or a skill library, but fully executable Python code, so the LLM can change not only what to do but also how to do it, including phase structure, geometric computation, state-machine logic, and control details.
- The context provided to the LLM includes the current controller source code, episode reward / step count / phase logs / minimum distance / oscillation flag, as well as keyframe images; it is prompted to first describe the failure mode, root-cause location (vision / logic / parameters), and most important modification, then output code.
- To prevent code generation from going out of control, the system includes mechanisms such as a compilation sandbox, action clamp, exception-safe stop, and fallback to the previous working controller on failure.
- In robosuite examples, AOR autonomously discovered and fixed several key issues, such as a back-projection sign error caused by OpenGL camera-coordinate conventions, the need to keep the end effector stationary during grasping, and the use of EMA to smooth actions.
Results
- The paper claims to validate AOR on 3 robosuite manipulation tasks, reporting 100% success on 2 tasks and 91% success on the remaining task.
- The abstract explicitly emphasizes that these results were achieved with no demonstrations, no reward engineering, and no gradient updates.
- The authors say the remaining failures mainly occur in the Stack task: the LLM has identified that contact between the gripper and the target block is the cause, but has not yet found a placement strategy that avoids this contact, so performance remains at 91% rather than 100%.
- The paper provides several quantitative background comparisons to related work, but these are not AOR’s own experiments: for example, Reflexion achieved +22% on AlfWorld and +11% on HumanEval; ReAct showed a +34% absolute improvement over RL/imitation baselines on AlfWorld; OpenVLA achieved +16.5% relative to RT-2-X(55B) after LoRA fine-tuning; Diffusion Policy outperformed prior methods by 46.9% across 12 tasks. These numbers are used to position AOR rather than as direct experimental comparisons.
- The provided excerpt does not include more detailed AOR experimental table information, such as the exact number of samples per task, number of trials, variance, or step-by-step baseline comparisons.
Link
Run your own research radar
Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.