ICRA Workshop 2026

Dreaming the Sound of Contact

Leveraging video and audio generation for zero-shot force-aware manipulation.

Guanhua Ji*, Tianyu Li*, Dayoon Suh, Nadia Figueroa *Equal contribution

Watch Paper arXivComing Soon CodeComing Soon
Concept overview: three stacked panels showing generated video and audio, a robot contact point on a whiteboard with force-direction and audio-magnitude annotations, and three real-world execution tasks — wiping, peeling, and lamp pressing.

Experiments

Video for motion, audio for force.

Pick a task and a run. Each row plays the generated reference video + audio, our force-aware execution on a Franka arm that tracks the audio-derived force profile, and a kinematic-only baseline for comparison. Turn the sound on — the contact audio is the force signal.

choose a task
6 / 6 ours 0 / 6 base
Generated Video + Audio from Seedance 2.0
Audio loudness over time — whiteboard, run 1.

The video plays the full generated audio. The plot only keeps frames where SAM-audio and distance confirm real contact, which is what we use to drive execution.

Baseline · kinematic Same trajectory, no force regulation
Ours · force-aware Franka arm tracks the audio-derived force profile
Measured contact force over time — whiteboard, run 1.

Abstract

Audio is the force signal video can't see.

Recent advances in video generation enable learning robot manipulation trajectories from generated videos. However, these approaches produce purely kinematic trajectories that lack force information, leading to failure in contact-rich tasks where appropriate contact forces are essential for success. Generated audio carries a complementary and underexplored signal: contact sounds encode force dynamics that video alone cannot capture.

We present a pipeline that jointly leverages generated video and audio to recover both motion trajectories and contact force profiles from a single task description. We execute these force-aware trajectories on a Franka Panda robot using a closed-loop force regulator that tracks the audio-derived force profile during contact. Real-robot experiments on whiteboard wiping, carrot peeling, and lamp button pressing demonstrate that our force-aware pipeline enables successful contact-rich manipulation from video generation where a kinematic-only baseline fails.

Pipeline

Audio as the force signal.

Our vision pipeline segments objects (SAM 2), estimates depth (Depth Pro), and tracks 3D points (SpaTracker) to detect contacts and infer force directions. The audio pipeline extracts loudness as a force-magnitude proxy. A 1 kHz impedance controller closes the loop on the audio-derived force profile.

Pipeline diagram: vision branch segments, tracks, and estimates force direction; audio branch extracts loudness; the combined force-aware trajectory runs on a Franka Panda with closed-loop force regulation.