Experiments
Video for motion, audio for force.
Pick a task and a run. Each row plays the generated reference video + audio, our force-aware execution on a Franka arm that tracks the audio-derived force profile, and a kinematic-only baseline for comparison. Turn the sound on — the contact audio is the force signal.
The video plays the full generated audio. The plot only keeps frames where SAM-audio and distance confirm real contact, which is what we use to drive execution.
Abstract
Audio is the force signal video can't see.
Recent advances in video generation enable learning robot manipulation trajectories from generated videos. However, these approaches produce purely kinematic trajectories that lack force information, leading to failure in contact-rich tasks where appropriate contact forces are essential for success. Generated audio carries a complementary and underexplored signal: contact sounds encode force dynamics that video alone cannot capture.
We present a pipeline that jointly leverages generated video and audio to recover both motion trajectories and contact force profiles from a single task description. We execute these force-aware trajectories on a Franka Panda robot using a closed-loop force regulator that tracks the audio-derived force profile during contact. Real-robot experiments on whiteboard wiping, carrot peeling, and lamp button pressing demonstrate that our force-aware pipeline enables successful contact-rich manipulation from video generation where a kinematic-only baseline fails.
Pipeline
Audio as the force signal.
Our vision pipeline segments objects (SAM 2), estimates depth (Depth Pro), and tracks 3D points (SpaTracker) to detect contacts and infer force directions. The audio pipeline extracts loudness as a force-magnitude proxy. A 1 kHz impedance controller closes the loop on the audio-derived force profile.