ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

Published in arXiv preprint, 2026

ELAN4D is a plug-and-play training framework for Vision-Language-Action models that adds embodiment-centric 4D supervision through future robot keypoint tracks. Using forward kinematics from proprioceptive states, it derives 3D displacement tracks for robot joints and the end-effector, supervises a lightweight auxiliary track decoder during training, and discards the decoder at inference so the base policy interface remains unchanged.

The paper evaluates ELAN4D on LIBERO, LIBERO-Plus, RoboTwin2.0, and real-world manipulation tasks, showing consistent improvements over strong VLA baselines and stronger generalization under camera, background, and layout shifts.