Imagine being at a cocktail party with your friends. Although you are surrounded by other people having their own conversations and a music band is playing in the background, you are still able to engage with your friends and understand what they are saying. The reason is that humans are very good at focusing their attention on a specific sound source while ignoring the others. Unfortunately, what we have just mentioned is not true for everyone. People with hearing loss find a cocktail party situation as very challenging for socializing: competing speakers that are loud, the clinking of glasses, background music, and even reverberations are all elements of disturbance for a hearing-impaired listener.
Designing algorithms that can separate the sounds of an acoustically complex environment and deliver the speech of interest to the listener is very important in the scenario described above. We refer to the task of extracting the target speech signal from a mixture of sounds as speech enhancement. If we are interested in extracting multiple, even overlapping, speech signals from a mixture, then we denote the task as speech separation.
Traditionally, these tasks have been addressed with techniques applied only to acoustic signals. However, this approach fails to deliver intelligible and high-quality speech signals when the acoustic background noise level is very high and the frequency content of the noise is similar to the one of the speeches of interest. A way to overcome this issue is by exploiting visual cues, such as facial expressions and lip movements. In fact, these cues are not affected by the acoustic background noise. Audio-visual speech enhancement and separation systems are able to outperform their audio-only counterparts and can produce a good estimate of a speech signal even for challenging situations. In particular, the use of deep learning contributed to significantly advancing the field by providing learned representations of acoustic and visual modalities at several levels of abstraction and flexibly combining them.
The use of visual modality comes with a cost. Processing high-dimensional data, like a video feed, is computationally expensive and requires powerful chips. As a consequence, the power consumption increases and this might be an issue for a battery-driven device. Furthermore, including a camera in embedding devices can be challenging because of the additional space required to incorporate this component and its cost. Despite these limitations, audio-visual speech enhancement and separation offer several opportunities, not only for hearing assistive devices, but also for video conference systems, surveillance applications, and noise reduction in video editing software.
Editor’s note: Daniel and Zheng-Hua are presenting at ODSC Europe 2021. Be sure to check out their half-day training session, “Audio-Visual Speech Enhancement and Separation Based on Deep Learning,” there!
About the authors/ODSC Europe 2021 speakers on Audio-Visual Speech Enhancement:
Daniel Michelsanti is an Industrial Postdoctoral Researcher at Demant and Aalborg University. He has a PhD in Electrical and Electronic Engineering obtained at Aalborg University. Currently, he is investigating cutting-edge technologies for next-generation hearing assistive devices, with the goal of improving the life quality of people with hearing loss.
Zheng-Hua Tan is a Professor in the Department of Electronic Systems and a Co-Head of the Centre for Acoustic Signal Processing Research (CASPR) at Aalborg University. His research interests include machine learning, deep learning, pattern recognition, speech and speaker recognition, noise-robust speech processing, multimodal signal processing, and social robotics. He has authored/co-authored over 200 refereed publications.