Stateoftheart Machine Learning Algorithms for Speech and Audio Problems

Hands-on Tutorials, INTUITIVE AUDIO DEEP LEARNING SERIES

Audio Deep Learning Made Uncomplicated: Audio Classification, Step-by-Stride

An end-to-end instance and architecture for audio deep learning'due south foundational application scenario, in apparently English.

Sound Nomenclature is ane of the almost widely used applications in Audio Deep Learning. It involves learning to classify sounds and to predict the category of that audio. This type of problem can be practical to many applied scenarios e.1000. classifying music clips to place the genre of the music, or classifying short utterances by a set of speakers to place the speaker based on the phonation.

In this article, we will walk through a elementary demo application so as to understand the approach used to solve such sound classification problems. My goal throughout will exist to understand not just how something works but why information technology works that mode.

I accept a few more articles in my sound deep learning series that y'all might notice useful. They explore other fascinating topics in this space including how we prepare audio data for deep learning, why we use Mel Spectrograms for deep learning models and how they are generated and optimized.

Country-of-the-Art Techniques (What is sound and how it is digitized. What problems is audio deep learning solving in our daily lives. What are Spectrograms and why they are all-of import.)
Why Mel Spectrograms perform better (Processing audio data in Python. What are Mel Spectrograms and how to generate them)
Data Preparation and Augmentation (Enhance Spectrograms features for optimal operation past hyper-parameter tuning and data augmentation)
Automatic Speech Recognition (Speech-to-Text algorithm and architecture, using CTC Loss and Decoding for aligning sequences.)
Beam Search (Algorithm usually used by Spoken language-to-Text and NLP applications to raise predictions)

Audio Classification

Just similar classifying hand-written digits using the MNIST dataset is considered a 'Hi World"-blazon problem for Estimator Vision, we can recall of this application as the introductory problem for audio deep learning.

We will start with audio files, convert them into spectrograms, input them into a CNN plus Linear Classifier model, and produce predictions about the form to which the sound belongs.

Audio Nomenclature application (Image by Writer)

At that place are many suitable datasets available for sounds of different types. These datasets incorporate a large number of audio samples, forth with a form label for each sample that identifies what blazon of sound information technology is, based on the problem you are trying to address.

These form labels can often be obtained from some part of the filename of the sound sample or from the sub-binder name in which the file is located. Alternately the class labels are specified in a separate metadata file, usually in TXT, JSON, or CSV format.

Case problem — Classifying ordinary metropolis sounds

For our demo, nosotros will use the Urban Sound 8K dataset that consists of a corpus of ordinary sounds recorded from day-to-solar day urban center life. The sounds are taken from x classes such every bit drilling, dogs barking, and sirens. Each sound sample is labeled with the form to which it belongs.

Later on downloading the dataset, we see that it consists of two parts:

Audio files in the 'audio' folder: Information technology has 10 sub-folders named 'fold1' through 'fold10'. Each sub-binder contains a number of '.wav' audio samples eg. 'fold1/103074–7–one–0.wav'
Metadata in the 'metadata' folder: It has a file 'UrbanSound8K.csv' that contains information about each audio sample in the dataset such as its filename, its class label, the 'fold' sub-folder location, then on. The grade characterization is a numeric Course ID from 0–9 for each of the 10 classes. eg. the number 0 means air conditioner, 1 is a automobile horn, and so on.

The samples are around iv seconds in length. Hither's what one sample looks like:

An audio sample of a drill (Image by Author)

Sample Rate, Number of Channels, $.25, and Audio Encoding

The recommendation of the dataset creators is to use the folds for doing 10-fold cross-validation to report metrics and evaluate the performance of your model. However, since our goal in this article is primarily every bit a demo of an sound deep learning example rather than to obtain the all-time metrics, we will ignore the folds and treat all the samples but as one large dataset.

Set training data

As for most deep learning issues, we will follow these steps:

Deep Learning Workflow (Prototype past Writer)

The grooming data for this problem volition be adequately simple:

The features (X) are the audio file paths
The target labels (y) are the class names

Since the dataset has a metadata file that contains this information already, we can utilize that directly. The metadata contains information most each sound file.

Since it is a CSV file, we can use Pandas to read it. We tin can prepare the feature and label data from the metadata.

This gives us the information nosotros need for our training data.

Training data with audio file paths and form IDs

Browse the audio file directory when metadata isn't available

Having the metadata file made things piece of cake for us. How would we prepare our information for datasets that exercise not contain a metadata file?

Many datasets consist of only audio files arranged in a folder structure from which class labels can exist derived. To prepare our training data in this format, nosotros would do the following:

Preparing Training Data when metadata isn't bachelor (Image by Author)

Scan the directory and prepare a listing of all the sound file paths.
Excerpt the class characterization from each file name, or from the name of the parent sub-folder
Map each class name from text to a numeric class ID

With or without metadata, the issue would be the same — features consisting of a list of audio file names and target labels consisting of class IDs.

Audio Pre-processing: Define Transforms

This preparation information with audio file paths cannot be input straight into the model. Nosotros have to load the audio data from the file and process information technology then that it is in a format that the model expects.

This audio pre-processing will all be done dynamically at runtime when we will read and load the sound files. This approach is like to what we would do with image files too. Since sound data, similar image data, can exist fairly big and memory-intensive, we don't want to read the unabridged dataset into memory all at in one case, ahead of time. So nosotros keep only the audio file names (or image file names) in our training data.

Then, at runtime, as we train the model one batch at a time, we volition load the sound data for that batch and procedure it by applying a series of transforms to the audio. That mode nosotros keep audio data for only ane batch in memory at a time.

With image data, we might have a pipeline of transforms where we start read the image file as pixels and load information technology. Then we might apply some image processing steps to reshape and resize the data, crop them to a fixed size and convert them into grayscale from RGB. Nosotros might also utilize some image augmentation steps like rotation, flips, and then on.

The processing for audio data is very similar. Correct at present we're just defining the functions, they will be run a trivial later when we feed data to the model during grooming.

Pre-processing the training information for input to our model (Image by Author)

Read audio from a file

The start affair we need is to read and load the sound file in ".wav" format. Since we are using Pytorch for this example, the implementation below uses torchaudio for the audio processing, but librosa will piece of work just likewise.

Audio moving ridge loaded from a file (Prototype by Writer)

Convert to two channels

Some of the sound files are mono (ie. 1 audio channel) while most of them are stereo (ie. ii audio channels). Since our model expects all items to accept the aforementioned dimensions, nosotros will convert the mono files to stereo, past duplicating the first aqueduct to the second.

Standardize sampling rate

Some of the audio files are sampled at a sample rate of 48000Hz, while most are sampled at a rate of 44100Hz. This means that i 2nd of audio volition have an array size of 48000 for some audio files, while information technology will have a smaller assortment size of 44100 for the others. Over again, we must standardize and convert all sound to the same sampling rate and then that all arrays have the same dimensions.

Resize to the same length

We then resize all the sound samples to accept the same length by either extending its duration by padding information technology with silence, or past truncating it. Nosotros add that method to our AudioUtil form.

Data Augmentation: Time Shift

Next, we can do data augmentation on the raw audio bespeak by applying a Fourth dimension Shift to shift the audio to the left or the right by a random amount. I go into a lot more detail about this and other data augmentation techniques in this article.

Time Shift of the audio wave (Image past Writer)

Mel Spectrogram

We and so convert the augmented audio to a Mel Spectrogram. They capture the essential features of the audio and are often the well-nigh suitable way to input sound information into deep learning models. To become more than background nearly this, you lot might want to read my manufactures (here and hither) which explain in uncomplicated words what a Mel Spectrogram is, why they are crucial for audio deep learning, as well every bit how they are generated and how to tune them for getting the best performance from your models.

Data Augmentation: Time and Frequency Masking

Now we can practice another round of augmentation, this time on the Mel Spectrogram rather than on the raw audio. We will utilise a technique called SpecAugment that uses these 2 methods:

Frequency mask — randomly mask out a range of consecutive frequencies by calculation horizontal bars on the spectrogram.
Time mask — similar to frequency masks, except that we randomly block out ranges of fourth dimension from the spectrogram by using vertical bars.

Mel Spectrogram afterwards SpecAugment. Notice the horizontal and vertical mask bands (Image by Author)

Define Custom Data Loader

Now that we have divers all the pre-processing transform functions we will ascertain a custom Pytorch Dataset object.

To feed your data to a model with Pytorch, we need two objects:

A custom Dataset object that uses all the audio transforms to pre-procedure an audio file and prepares one data item at a fourth dimension.
A congenital-in DataLoader object that uses the Dataset object to fetch individual data items and packages them into a batch of data.

Fix Batches of Data with the Data Loader

All of the functions we demand to input our data to the model accept now been defined.

We use our custom Dataset to load the Features and Labels from our Pandas dataframe and split that information randomly in an 80:20 ratio into training and validation sets. We then use them to create our training and validation Data Loaders.

Split our data for training and validation (Image by Author)

When nosotros outset training, the Information Loader will randomly fetch i batch of input Features containing the list of sound file names and run the pre-processing audio transforms on each audio file. It will also fetch a batch of the corresponding target Labels containing the class IDs. Thus it will output one batch of training information at a time, which can straight be fed every bit input to our deep learning model.

Information Loader applies transforms and prepares one batch of data at a time (Image past Author)

Let'due south walk through the steps as our data gets transformed, starting with an audio file:

The audio from the file gets loaded into a Numpy array of shape (num_channels, num_samples). Most of the audio is sampled at 44.1kHz and is about four seconds in elapsing, resulting in 44,100 * 4 = 176,400 samples. If the audio has 1 aqueduct, the shape of the array will be (one, 176,400). Similarly, sound of 4 seconds duration with two channels and sampled at 48kHz volition take 192,000 samples and a shape of (2, 192,000).
Since the channels and sampling rates of each audio are unlike, the next two transforms resample the audio to a standard 44.1kHz and to a standard 2 channels.
Since some audio clips might exist more than or less than 4 seconds, we besides standardize the audio duration to a fixed length of iv seconds. Now arrays for all items have the aforementioned shape of (2, 176,400)
The Fourth dimension Shift data augmentation now randomly shifts each audio sample forrard or backward. The shapes are unchanged.
The augmented audio is now converted into a Mel Spectrogram, resulting in a shape of (num_channels, Mel freq_bands, time_steps) = (two, 64, 344)
The SpecAugment data augmentation now randomly applies Time and Frequency Masks to the Mel Spectrograms. The shapes are unchanged.

Thus, each batch will have 2 tensors, one for the Ten feature data containing the Mel Spectrograms and the other for the y target labels containing numeric Form IDs. The batches are picked randomly from the training data for each training epoch.

Each batch has a shape of (batch_sz, num_channels, Mel freq_bands, time_steps)

A batch of (Ten, y) data

We can visualize one detail from the batch. We see the Mel Spectrogram with vertical and horizontal stripes showing the Frequency and Fourth dimension Masking information augmentation.

The information is now ready for input to the model.

Create Model

The data processing steps that we just did are the most unique aspects of our sound nomenclature problem. From hither on, the model and training procedure are quite like to what is commonly used in a standard image nomenclature problem and are not specific to audio deep learning.

Since our data now consists of Spectrogram images, we build a CNN classification compages to process them. It has iv convolutional blocks which generate the feature maps. That information is and then reshaped into the format we need and then it can exist input into the linear classifier layer, which finally outputs the predictions for the 10 classes.

The model takes a batch of pre-processed data and outputs class predictions (Paradigm by Author)

A few more details well-nigh how the model processes a batch of data:

A batch of images is input to the model with shape (batch_sz, num_channels, Mel freq_bands, time_steps) ie. (16, two, 64, 344).
Each CNN layer applies its filters to footstep up the prototype depth ie. number of channels. The image width and height are reduced as the kernels and strides are applied. Finally, after passing through the four CNN layers, we get the output feature maps ie. (16, 64, 4, 22).
This gets pooled and flattened to a shape of (xvi, 64) and so input to the Linear layer.
The Linear layer outputs 1 prediction score per form ie. (xvi, 10)

Training

We are at present set to create the grooming loop to railroad train the model.

We define the functions for the optimizer, loss, and scheduler to dynamically vary our learning rate as preparation progresses, which ordinarily allows training to converge in fewer epochs.

We train the model for several epochs, processing a batch of data in each iteration. Nosotros keep rail of a simple accurateness metric which measures the pct of correct predictions.

Inference

Ordinarily, equally office of the training loop, nosotros would also evaluate our metrics on the validation data. We would then do inference on unseen information, perhaps by keeping aside a test dataset from the original data. Nonetheless, for the purposes of this demo, we will employ the validation data for this purpose.

We run an inference loop taking intendance to disable the gradient updates. The forward pass is executed with the model to get predictions, simply we do not need to backpropagate or run the optimizer.

Conclusion

We have at present seen an end-to-end example of sound nomenclature which is one of the most foundational problems in audio deep learning. Non just is this used in a broad range of applications, but many of the concepts and techniques that we covered here will exist relevant to more complicated audio problems such equally automatic spoken language recognition where we outset with human speech, understand what people are saying, and convert it to text.

And finally, if you lot liked this commodity, you lot might as well savor my other series on Transformers, Geolocation Machine Learning, and Image Caption architectures.

Allow's go on learning!

harrisgothys.blogspot.com

Source: https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5