Recoleta Item Note

Hierarchical Latent Action Model

HiLAM aims to learn longer-horizon latent skills from action-unlabeled videos, rather than only recovering low-level actions between adjacent frames. It takes short-term action sequences extracted by an existing latent…

latent-action-modelhierarchical-policyskill-discoveryactionless-videorobot-learning

HiLAM aims to learn longer-horizon latent skills from action-unlabeled videos, rather than only recovering low-level actions between adjacent frames. It takes short-term action sequences extracted by an existing latent action model and further compresses them into variable-length high-level skills by chunking, for hierarchical robot policy pretraining.

  • Existing Latent Action Models (LAMs) mostly model short-horizon frame transitions. They can capture low-level motion, but often overlook the more important long-horizon skill structure in videos.
  • This matters because training robots and world models requires large amounts of data, while data with action labels is expensive and scarce. There is plenty of unlabeled video, but if only short-term motion can be extracted, the higher-level behavioral information in it is wasted.
  • Methods that define skills using fixed windows, fixed skill sets, or language alone struggle to handle the fact that real skill durations are variable, execution speeds differ, and behaviors are diverse.
  • The core idea is simple: first use a pretrained inverse dynamics model (IDM) to turn actionless videos into a sequence of low-level latent actions, then use a hierarchical sequence model to automatically split that sequence into segments, with each segment corresponding to a high-level latent skill.
  • HiLAM uses H-Net's dynamic chunking: if adjacent token features differ greatly, a new segment is started at that position. This enables automatic discovery of skill boundaries, without manual annotation and without requiring fixed skill lengths.
  • Training does three things: predict the next latent action (latent next-token prediction), use a pretrained forward dynamics model (FDM) to reconstruct future frames from the predicted actions in order to preserve “actionness,” and add chunk ratio regularization to avoid degenerate segmentation.
  • After learning skills, the authors train a hierarchical policy: the high-level policy predicts latent skills from the current observation and language, and the low-level policy then predicts low-level actions from the observation and that skill; finally, only the low-level policy is fine-tuned to output real robot actions.
  • This design reuses an existing LAM as the low-level extractor, making it computationally better suited for handling long-horizon trajectories.
  • In data-efficiency experiments on LIBERO-Long, using only 10% of expert demonstrations for fine-tuning, BAKU = 23% success rate, while HiLAM = 45%, nearly doubling it.
  • On LIBERO-Long, with 50% demonstrations, HiLAM = 84%, reaching a level comparable to BAKU using 100% data; with 100% demonstrations, HiLAM = 94%, significantly higher than BAKU.
  • The paper claims it consistently outperforms the strong baseline BAKU across all four suites: LIBERO-Spatial / Object / Goal / Long, but the excerpt does not provide the full numeric table for each suite.
  • The LIBERO-Long ablation in Table 1 shows that the best setting is human video pretraining + stage-2 latent skill + stage-0 latent action, with success rate 0.94; corresponding numbers are BAKU + human pretraining + z^0 latent action = 0.91, and HiLAM without large-scale pretraining = 0.67.
  • Under robot video pretraining, HiLAM is also effective: z^1 skill + z^0 action = 0.90, z^2 skill + z^0 action = 0.90; this indicates the method does not depend on a single data source.
  • Non-hierarchical BAKU with latent conditioning also improves (e.g. 0.87 / 0.91), but still trails the best HiLAM at 0.94, supporting the authors’ claim that high-level skills + hierarchical policy is more effective.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.