Introduction
Imagine a world where machines understand what you want and how you are feeling when you call at a customer care – if you are unhappy about something, you speak to a person quickly. If you are looking for a specific information, you may not need to talk to a person (unless you want to!).
This is going to be the new order of the world – you can already see this happening to a good degree. Check out the highlights of 2017 in the data science industry. You can see the breakthroughs that deep learning was bringing in a field which were difficult to solve before. One such field that deep learning has a potential to help solving is audio/speech processing, especially due to its unstructured nature and vast impact.
So for the curious ones out there, I have compiled a list of tasks that are worth getting your hands dirty when starting out in audio processing. I’m sure there would be a few more breakthroughs in time to come using Deep Learning.
The article is structured to explain each task and its importance. There is also a research paper that goes in the details of that specific task, along with a case study that would help you get started in solving the task.
So let’s get cracking!
1. Audio Classification
Audio classification is a fundamental problem in the field of audio processing. The task is essentially to extract features from the audio, and then identify which class the audio belongs to. Many useful applications pertaining to audio classification can be found in the wild – such as genre classification, instrument recognition and artist identification.
This task is also the most explored topic in audio processing. Plenty of papers were published in this field in the last year. In fact, we have also hosted a practice hackathon for community collaboration for solving this particular task.
Whitepaper – http://ieeexplore.ieee.org/document/5664796/?reload=true
A common approach to solve an audio classification task is to pre-process the audio inputs to extract useful features, and then apply a classification algorithm on it. For example, in the case study below we are given a 5 second excerpt of a sound, and the task is to identify which class does it belong to – whether it is a dog barking or a drilling sound. As mentioned in the article, an approach to deal with this is to extract an audio feature called MFCC and then pass it though a neural network to get the appropriate class.
Case Study – https://www.geeksforgeeks.org/blog/2017/08/audio-voice-processing-deep-learning/
2. Audio Fingerprinting
The aim of audio fingerprinting is to determine the digital “summary” of the audio. This is done to identify the audio from an audio sample. Shazam is an excellent example of an application of audio fingerprinting. It recognises the music on the basis of the first two to five seconds of a song. However, there are still situations where the system fails, especially where there is a high amount of background noise.
Whitepaper – http://www.cs.toronto.edu/~dross/ChandrasekharSharifiRoss_ISMIR2011.pdf
To solve this problem, an approach could be to represent the audio in a different manner, so that it is easily deciphered. Then, we can find out the patterns that differentiate the audio from the background noise. In the case study below, the author converts raw audio to spectrograms and then uses peak finding and fingerprint hashing algorithms to define the fingerprints of that audio file.
Case Study – http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
3. Automatic Music Tagging
Music Tagging is a more complex version of audio classification. Here, we can have multiple classes that each audio may belong to, aka, a multi-label classification problem. A potential application of this task can be to create metadata for the audio so that it can be searched later on. Deep learning has helped solve this task to a certain extent which can be seen in the case study below.
Whitepaper – https://link.springer.com/article/10.1007/s10462-012-9362-y
As seen with most of the tasks, the first step is always to extract features from the audio sample. Then, sort it according to the nuances of the audio (for example, if the audio contains more instrumental noise than the singer’s voice, the tag could be “instrumental”). This can be done either by machine learning or deep learning methods. The case study mentioned below uses deep learning to solve the problem, specifically convolution recurrent neural network along with Mel Frequency Extraction.
Case Study – https://github.com/keunwoochoi/music-auto_tagging-keras
4. Audio Segmentation
Segmentation literally means dividing a particular object into parts (or segments) based on a defined set of characteristics. Segmentation, especially for audio data analysis, is an important pre-processing step. This is because we can segment a noisy and lengthy audio signal into short homogeneous segments (handy short sequences of audio) which are used for further processing. An application of the task is heart sound segmentation, i.e. to identify sounds specific to the heart.
Whitepaper – http://www.mecs-press.org/ijitcs/ijitcs-v6-n11/IJITCS-V6-N11-1.pdf
We can convert this into a supervised learning problem, where each time stamp can be classified on the basis of the segments required. Then we can apply an audio classification approach to solve the problem. In the case study below, the task is to segment the heart sound into two segments (lub and dub), so that we can identify an anomaly in each segment. It can be solved by using audio feature extraction and then deep learning can be applied for classification.
Case Study – https://www.geeksforgeeks.org/blog/2017/11/heart-sound-segmentation-deep-learning/
5. Audio Source Separation
Audio Source Separation consists of isolating one or more source signals from a mixture of signals. One of the most common applications of this is identifying the lyrics from the audio for simultaneous translation (karaoke, for instance). This is a classic example shown in Andrew Ng’s machine learning course where he separates the sound of the speaker from the background music.
Whitepaper – http://ijcert.org/ems/ijcert_papers/V3I1103.pdf
A typical usage scenario involves:
- loading an audio file
- computing a time-frequency transform to obtain a spectrogram, and
- using some of the source separation algorithm (such as non-negative matrix factorization) to obtain a time-frequency mask
The mask is then multiplied with the spectrogram and the result is converted back to the time domain.
Case Study – https://github.com/IoSR-Surrey/untwist
6. Beat Tracking
As the name suggests, the goal here is to track the location of each beat in a collection of audio files. Beat tracking can be utilized to automate time-consuming tasks that must be completed in order to synchronize events with music. It is useful in various applications, such as video editing, audio editing, and human-computer improvisation.
An approach to solve beat tracking can be to be parse the audio file and use an onset detection algorithm to track the beats. Although the techniques used to for onset detection rely heavily on audio feature engineering and machine learning, deep learning can easily be used here to optimize the results.
Case Study – https://github.com/adamstark/BTrack
7. Music Recommendation
Thanks to the internet, we now have millions of songs we can listen to anytime. Ironically, this has made it even harder to discover new music because of the plethora of options out there. Music recommendation systems help deal with this information overload by automatically recommending new music to listeners. Content providers like Spotify and Saavn have developed highly sophisticated music recommendation engines. These models leverage the user’s past listening history among many other features to build customized recommendation lists.
Whitepaper – https://pdfs.semanticscholar.org/7442/c1ebd6c9ceafa8979f683c5b1584d659b728.pdf
We can tackle the challenge of customizing listening preferences by training a regression/deep learning model. This can be used to predict the latent representations of songs that were obtained from a collaborative filtering model. This way, we could predict the representation of a song in the collaborative filtering space, even if no usage data was available.
Case Study – http://benanne.github.io/2014/08/05/spotify-cnns.html
8. Music Retrieval
One of the most difficult tasks in audio processing, Music Retrieval essentially aims to build a search engine based on audio. Although we can do this by solving sub-tasks like audio fingerprinting, this task encompasses much more that that. For example, we also have to solve different smaller tasks for different types of music retrieval (timbre detection would be great for gender identification). Currently, there is no other system that has been developed to match the industry expected standards.
Whitepaper – http://www.nowpublishers.com/article/Details/INR-042
The task of music retrieval is divided into smaller and simpler steps, which include tonal analysis (e.g. melody and harmony) and rhythm or tempo (e.g. beat tracking). Then, on the basis of these individual analysis, information is extracted which is used for retrieval of similar audio samples.
Case Study – https://youtu.be/oGGVvTgHMHw
9. Music Transcription
Music Transcription is another challenging audio processing task. It comprises of annotating audio and creating a kind of “sheet” for generating music from it at a later point of time. The manual effort involved in transcribing music from recordings can be vast. It varies enormously depending on the complexity of the music, how good our listening skills are and how detailed we want our transcription to be.
Whitepaper – http://ieeexplore.ieee.org/abstract/document/7955698
The approach for music transcription is similar to that of speech recognition, where musical notes are transcribed into lyrical excerpts of instruments.
Case Study – https://youtu.be/9boJ-Ai6QFM
10. Onset Detection
Onset detection is the first step in analysing an audio/music sequence. For most of the tasks mentioned above, it is somewhat necessary to perform onset detection, i.e. detecting the start of an audio event. Onset detection was essentially the first task that researchers intended to solve in audio processing.
Whitepaper – http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.989&rep=rep1&type=pdf
Onset detection is typically done by:
- computing a spectral novelty function
- finding peaks in the spectral novelty function
- backtracking from each peak to a preceding local minimum. Backtracking can be useful for finding segmentation points such that the onset occurs shortly after the beginning of the segment
Case Study – https://musicinformationretrieval.com/onset_detection.html
End Notes
In this article, I have mentioned a few tasks that can be looked at when solving audio processing problems. I hope you find the article insightful in dealing with audio/speech related projects.