Clink! Chop! Thud! — Learning Object Sounds from Real-World Interactions

1Georgia Institute of Technology, 2Carnegie Mellon University
Teaser image showing object sound learning

Humans handle a wide variety of objects throughout the day and many of these interactions produce sounds. We introduce a multimodal object-aware framework that learns the relationship between the objects in an interaction and the resulting sounds. This enables our model to detect the sounding objects from a set of candidates in a scene.

Sounding Object Detection

While an environment may contain many objects, only a few are directly involved in producing sound during an interaction. Our model detects the sounding object given a video of an object interaction.

Sounding Object Detection Benchmark

We introduce a manually labelled benchmark to evaluate sounding object detection. Ground truth objects are highlighted by red and green segmentation masks.

Ego4D

Epic Kitchens

Abstract

Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

Sounding Action Discovery

Our object-aware framework also achieves state-of-the-art performance on sounding action discovery, first introduced by Chen et al.

Table of results for sounding action discovery benchmark

BibTeX

@inproceedings{yang2025clink,
    title = {Clink! Chop! Thud! --- Learning Object Sounds from Real-World Interactions},
    author = {Mengyu Yang and Yiming Chen and Haozheng Pei and Siddhant Agarwal and Arun Balajee Vasudevan and James Hays},
    year = {2025},
    booktitle = {ICCV},
}