Speech Recognizer

Made by SeanvonB | Source

This was my final for Udacity's Natural Language Processing Nanodegree, which I completed in late 2020. It's really a summative project for all of my School of AI Nanodegrees: AI Programming and Computer Vision included; because it relies heavily on the knowledge from all three courses.

In data science terms, I should actually call this an Automatic Speech Recognition (ASR) pipeline. An ASR pipeline recieves spoken audio as input and returns a text transcript of the speech as output, so you'll often find this at the heart of speech recognition or dictation software. The end result should look something like this:

I'll explain the process across the following three sections:

  1. Preprocessing
  2. Models
  3. Prediction

Let's get started!

1.0 Preprocessing

As always, I begin by examining the dataset. For this project, Udacity provided us with the LibriSpeech Corpus, which is about 1,000 hours of English speech samples compiled from public domain audio books. The full corpus can be found here; however, for this project, Udacity selected a small subset to help reduce the training burden.

The following loads the dataset, which returns these variables:

  • vis_text - transcribed text (label) for the training example.
  • vis_raw_audio - raw audio waveform for the training example.
  • vis_mfcc_feature - mel-frequency cepstral coefficients (MFCCs) for the training example.
  • vis_spectrogram_feature - spectrogram for the training example.
  • vis_audio_path - the file path to the training example.

1.1 Sample the Data

Next, the following will import the tools necessary for visualizing and playing audio samples. The embedded IPython audio player should allow you to listen to the first sample from the dataset:

1.2 Choosing Feature Representation

The above sample isn't yet useful to me, because computers can't hear – at least, not readily. I've read about some deep learning architectures that can read raw audio data, but they're a different beast. However, from projects like my Facial Keypoint Detector and Image Captioner, I've learned that computers can see pretty well. Images are just matrices, so I know the first step to preprocessing this data could simply be to make the audio samples just as matrix-y as possible.

Udacity provided two suggestions:

So, let's look at both...

1.3 Spectrograms

You know 'em, you love 'em, and you didn't know how to calculate them until you borrowed the calculation from this repository for utils.py.

These bad boys are 3D representations of audio signal over time. On the x-axis, you have time; on the y-axis, frequency; and, represented in the third dimension by color, amplitude. Those are all definitely physics words that I learned at one point and briefly re-learned for this project.

To speed up calculations without impacting performance, spectrograms can also be normalized to fall within the range of -1 to 1.

Here's an example of one:

1.4 Mel-Frequency Cepstral Coefficients (MFCCs)

This time, the calculation was boosted from this repository for utils.py.

As I'll show in the example below, MFCCs look like simplified versions of spectrograms. Their calculation involves some "linear cosine transform of a log power spectrum" stuff, but the Mel part caught my attention. The Mel scale is a scale of pitches that human listeners hear as being equidistant from each other, sort of like a scientific solfège. So, while they are simplified spectrograms in some respect, they've been simplified to favor human experience. That sounds like a way to help a model hear as we do. An MFCC feature is also much lower-dimensional than a spectrogram feature, which could help a model generalize – interesting!

Here's an example of an MFCC; again, normalized:

2.0 Models

Now that the audio data is preprocessed into a format that a model can recieve, I'll begin playing with some neural network architectures for acoustic modeling. Just like I did with the Language Translator, I begin simple and add new features after each successful training session.

Here are the steps this stage will take:

  1. Simple RNN
  2. RNN + TimeDistributed Dense
  3. CNN + RNN + TimeDistributed Dense
  4. Deep RNN + TimDistributed Dense
  5. Bidirectional RNN + TimeDistributed Dense
  6. Final Model

And, of course, you know the final_model will just be a mash of whatever worked well together.

Let's begin with some workspace utility, provided by Udacity:

2.1 Simple RNN

Again, there isn't anything simple about a Recurrent Neural Network, but this one will be the vanilla flavor of the day. Because I'm working with sequential data, all of the models in this notebook will prominantly feature RNNs, and this model will serve was the baseline.

As you can see in this example, this model (and all of the others) will take the acoustic features of the audio sequential per time step:

Then, for each time step, the model will produce an output of 28 possible outputs, including the 26 characters of the English alphabet, the space character, or an apostrophe. Well, technically the model will produce a vector of probabilities for the likelihood of all of the above, but I'll just use that vector to select the highest probability for now.

The following example is the same model in what's called the unrolled format:

The simple RNN is specified in Keras as follows:

My acoustic models will train with CTC for their loss function. Note: using custom loss functions, like CTC, with Keras required some tinkering in 2020, but I don't know whether this is still true. Udacity helped me implement this criterion as add_ctc_loss in utils.py.

Finally, training involves a number of optional arguments that can fine tune the process:

  • minibatch_size - the size of the minibatches that are generated while training the model (default: 20).
  • spectrogram - Boolean value for whether spectrograms (True) or MFCCs (False) are used for training (default: True).
  • mfcc_dim - the size of the feature dimension to use when generating MFCC features (default: 13).
  • optimizer - the Keras optimizer used to train the model (default: SGD).
  • epochs - the number of epochs to use to train the model (default: 20).
  • verbose - controls the verbosity of the training output in the model.fit_generator method (default: 1).
  • sort_by_duration - Boolean value dictating whether the training and validation sets are sorted by (increasing) duration before the start of the first epoch (default: False).

For all hyperparameters, I choose what I determined to be some commonly accepted "best practice" or "proof of concept" defaults after some searching on Stack Overflow and Reddit.

I mentioned it before, but you might also notice input_dim=13 appearing frequently. This indicates that I chose to use MFCCs over spectrograms, the latter of which are input_dim=161.

The following cell will train model_0:

2.2 RNN + TimeDistributed Dense

The primary change for this model will be the inclusion of BatchNormalization via TimeDistributed layer. Generally, Batch Normalization refers to a collection of strategies that allow the network to train faster by safely using a higher learnrate. In the case of a TimeDistributed layer, the network will use this advantage to find more complex relationships in the dataset.

Here is the new model, rolled:

But I think the unrolled model illustrates this one much better:

The following cells will train model_1:

2.3 CNN + RNN + TimeDistributed Dense

Expanding on the previous model, this one includes a 1D Convolutional Layer, which uses a technique from computer vision to hopefully extract more useful feature maps: sliding a filter over the initial feature representation that enhances important features while muting others.

The resulting model will look like this:

The following cells will train model_2:

2.4 Deeper RNN + TimeDistributed Dense

Until now, the model used a single recurrent layer, but I want to see what happens if I adjust the model to accept any variable number of RNN layers. On one hand, this could simply be overkill that only serves to extend training time; on the other, maybe the depth provided by multiple RNNs will be what cracks this problem – I'll be looking for a strong performance uplift to justify the added layers.

Since I'm worried about training time, I'll temporarily remove the CNN, and the model will look like this:

The following cells will train model_3:

2.5 Bidirectional RNN + TimeDistributed Dense

Deeper RNNs provide more of the same benefit, but a Bidirectional layer provides something new: the ability to see future contexts, which can make a difference when working with language, where quirks like phrasal verbs or split clauses might otherwise confuse the model.

As you can see, this model looks a lot like the last one, but the RNN layers now feed in both directions:

The following cells will train model_4:

2.6 Compare the Models

I didn't talk about the performance of each model, because I was waiting for this step. The following cell will plot the change in training and validation loss per epoch for each model, and we can see how each model performed. It may also be possible to see when models begin to overfit or even explode/vanish their gradients. Everything is better with graphs!

Obviously, TimeDistributed had a pretty substantial impact; otherwise, the models performed more or less the same. There does, however, appear to be a clear benefit to including the CNN for feature extraction, so my final_model will certainly make use of this architecture.

2.7 Final Model

My final model for this project can be found with the others in models.py and includes the Convolutional, BatchNormalization, Bidirectional, and TimeDistribtued layers as they were shown in the steps above.

It also includes Dropout layers per this research paper, which required some help from this repository to implement for recurrent layers. I did not include Dropout thus far, because the training sessions were relatively short and overfitting wasn't as likely. I also experimented with adding MaxPool layers to the CNN, but my implementation actually performed substantially worse with Max Pooling for some reason.

Whew, the following cells will train the final_model:

3.0 Prediction

We've reached the best part: watching a computer almost transcribe human speech. Yeah, unfortunately, this project was never destined to be a perfect speech-to-text device. That would require substantial training time on expensive cloud computing hardware, and Udacity provided only so much time and so much GPU power.

But let's see how close this model comes! The following will retrieve an audio, run it through the network, and print the outcome:

That's definitely not perfect. But I do understand exactly what was said. What do you think?

This one might be slightly worse. But I find it interesting that the model struggles most with linking (in speech terms) or spacing (in text terms), because that's also the most difficult part of language acquisition and comprehension for non-native English speakers.

Clearly, there remains work to be done. But I'm honestly pretty proud of what was accomplished with relatively little resources, which nicely highlights the power of machine learning. Given a dedicated, pre-trained language model and/or 4-6 weeks of low learnrate training on Amazon or Google Cloud servers, I think my architecture might just hold up!

Thanks for reading!

Made by SeanvonB | Source