Interactive Audio is a project designed to advance the capabilities of robotic systems in identifying and classifying objects based on their material composition using both visual and audio cues. We utilized Grounded SAM for object detection and masking, while incorporating custom-designed tools to capture accurate audio signatures of objects. The ultimate goal is for the robot to differentiate objects not just by appearance but also by their unique acoustic properties when tapped.
One of my primary contributions to this project was the design of a specialized “object tapper,” a tool equipped with a directional microphone. This tapper provided a prime point of contact for obtaining clear audio signatures by tapping objects and recording their unique sound responses.
Design Consideration: The tool needed to ensure sufficient “give” to prevent objects from tipping over during taps while still allowing firm enough contact for effective sound capture. This required a balance of impact force and material flexibility. The design also considered the physics of impulse and momentum transfer, ensuring that the tap delivered just enough force without creating excessive vibrations that could skew the audio recording. Using principles from Newton’s Second Law (( F = ma )), the design optimized force application to ensure controlled taps.
Challenges: Earlier versions that incorporated elastics often wore down due to repeated use, reducing both their elasticity and reliability. One can see that in the final design not only is there a “plastic tab” allowing for more consistent collision feedback, but there is also a custom mount designed to integrate comfortably with the franka emika panda allowing for smoother trajectory execution. Additionally there is a camera mount strategically placed to identify the object before contact (the camera should never collide with the object).Transitioning to PLA provided a more consistent tap mechanism, greatly improving the longevity and consistency of our recordings.
Figure 2: 3rd revised CAD final design with custom mounts and PLA tab Right-side to enable consistent contact and audio signatures, as well as locations to place mics and cameras
Figure 3: 3rd revised CAD final design with custom mounts and PLA tab Right-side to enable consistent contact and audio signatures, as well as locations to place mics and cameras.
The directional microphone mounted on the tapper was used to gather clean and isolated audio samples of various objects. These audio signatures were then used to correlate sound to material composition, forming the foundation for material classification in the system.
Figure 4: Actual CAD in action during robot data collection
Physics Principles: The tapping mechanism was designed based on the concept of sound wave propagation and the relationship between material density and the speed of sound. By ensuring a uniform tapping force, the sound waves generated by each object were consistent, which helped in classifying materials based on their acoustic signatures.
Generative Ensemble Model: I helped develop a Generative Ensemble model that linked these audio recordings to specific material properties. The model used the distinct frequencies, amplitudes, and decay rates of each object’s sound to differentiate between materials such as wood, plastic, and metal. By integrating this into our system, the robot could classify objects more effectively by considering not only their appearance but also their sound profile.
In addition to audio sampling, the project aimed to predict an object’s sound signature based on its visual appearance using object rendering techniques. This involved feeding visual data into our predictive model to forecast the acoustic properties of an object even before tapping it.
Figure 5: Object Renderer in action for soup spoon
Physics Integration: By applying principles of acoustic impedance and resonance frequency, the system could predict how different materials would respond to a tap based on their visual structure. This required understanding how sound waves behave when encountering materials of different densities and stiffness.
Computer Science Techniques: Using machine learning techniques, the model was trained to simulate the physical interaction between the object tapper and the materials, using visual cues like texture, shape, and material thickness to predict the object’s acoustic response. By combining physics-based models with visual data, we aimed to predict sound signatures without needing physical contact in all cases.
Material Durability: Earlier designs, using elastics and rubber bands, lacked longevity and wore out quickly. Switching to flexible PLA ensured both durability and the right amount of flexibility for repeated taps.
Audio Consistency: Tapping with too much or too little force could result in inconsistent sound signatures. By refining the tapper design and calibrating the tapping force, we ensured that audio samples were consistent across different trials, improving the accuracy of material classification.
Sound Prediction: Predicting an object’s sound signature based solely on its visual appearance required sophisticated models and the integration of both physics-based simulations and machine learning. While the model’s accuracy improved over time, balancing computational complexity with real-time performance remained a challenge.
Interactive Audio represents a significant leap in human-robot interaction, allowing robotic systems to perceive and classify objects based not only on their visual features but also on their acoustic properties. The design and development of the object tapper tool, along with advanced audio analysis and material prediction models, provide a more comprehensive understanding of object identification, opening new possibilities for robotic manipulation and material classification tasks.