This was my final for Udacity's Natural Language Processing Nanodegree, which I completed in late 2020. It's really a summative project for all of my School of AI Nanodegrees: AI Programming and Computer Vision included; because it relies heavily on the knowledge from all three courses.
In data science terms, I should actually call this an Automatic Speech Recognition (ASR) pipeline. An ASR pipeline recieves spoken audio as input and returns a text transcript of the speech as output, so you'll often find this at the heart of speech recognition or dictation software. The end result should look something like this:

I'll explain the process across the following three sections:
Let's get started!
As always, I begin by examining the dataset. For this project, Udacity provided us with the LibriSpeech Corpus, which is about 1,000 hours of English speech samples compiled from public domain audio books. The full corpus can be found here; however, for this project, Udacity selected a small subset to help reduce the training burden.
The following loads the dataset, which returns these variables:
vis_text- transcribed text (label) for the training example.vis_raw_audio- raw audio waveform for the training example.vis_mfcc_feature- mel-frequency cepstral coefficients (MFCCs) for the training example.vis_spectrogram_feature- spectrogram for the training example.vis_audio_path- the file path to the training example.
from data_generator import vis_train_features
# Extract label and audio features for single training sample
vis_text, vis_raw_audio, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path = vis_train_features()
There are 2023 total training examples.
Next, the following will import the tools necessary for visualizing and playing audio samples. The embedded IPython audio player should allow you to listen to the first sample from the dataset:
from IPython.display import Markdown, display
from data_generator import vis_train_features, plot_raw_audio
from IPython.display import Audio
%matplotlib inline
# Plot audio signal
plot_raw_audio(vis_raw_audio)
# Print length of audio signal
display(Markdown('**Shape of Audio Signal** : ' + str(vis_raw_audio.shape)))
# Print corresponding transcript
display(Markdown('**Transcript** : ' + str(vis_text)))
# Play audio file
Audio(vis_audio_path)
Shape of Audio Signal : (84231,)
Transcript : her father is a most remarkable person to say the least
The above sample isn't yet useful to me, because computers can't hear – at least, not readily. I've read about some deep learning architectures that can read raw audio data, but they're a different beast. However, from projects like my Facial Keypoint Detector and Image Captioner, I've learned that computers can see pretty well. Images are just matrices, so I know the first step to preprocessing this data could simply be to make the audio samples just as matrix-y as possible.
Udacity provided two suggestions:
So, let's look at both...
You know 'em, you love 'em, and you didn't know how to calculate them until you borrowed the calculation from
this repository for utils.py.
These bad boys are 3D representations of audio signal over time. On the x-axis, you have time; on
the y-axis, frequency; and, represented in the third dimension by color, amplitude.
Those are all definitely physics words that I learned at one point and briefly re-learned for this project.
To speed up calculations without impacting performance, spectrograms can also be normalized to
fall within the range of -1 to 1.
Here's an example of one:
from data_generator import plot_spectrogram_feature
# Plot normalized spectrogram
plot_spectrogram_feature(vis_spectrogram_feature)
# Print spectrogram shape
display(Markdown('**Shape of Spectrogram** : ' + str(vis_spectrogram_feature.shape)))
Shape of Spectrogram : (381, 161)
This time, the calculation was boosted from this
repository for utils.py.
As I'll show in the example below, MFCCs look like simplified versions of spectrograms. Their calculation involves some "linear cosine transform of a log power spectrum" stuff, but the Mel part caught my attention. The Mel scale is a scale of pitches that human listeners hear as being equidistant from each other, sort of like a scientific solfège. So, while they are simplified spectrograms in some respect, they've been simplified to favor human experience. That sounds like a way to help a model hear as we do. An MFCC feature is also much lower-dimensional than a spectrogram feature, which could help a model generalize – interesting!
Here's an example of an MFCC; again, normalized:
from data_generator import plot_mfcc_feature
# Plot normalized MFCC
plot_mfcc_feature(vis_mfcc_feature)
# Print MFCC shape
display(Markdown('**Shape of MFCC** : ' + str(vis_mfcc_feature.shape)))
Shape of MFCC : (381, 13)
Now that the audio data is preprocessed into a format that a model can recieve, I'll begin playing with some neural network architectures for acoustic modeling. Just like I did with the Language Translator, I begin simple and add new features after each successful training session.
Here are the steps this stage will take:
And, of course, you know the final_model will just be a mash of whatever worked well together.
Let's begin with some workspace utility, provided by Udacity:
#############################################################
# RUN THIS CODE CELL IF RESUMING NOTEBOOK AFTER A BREAK #
#############################################################
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.75
set_session(tf.Session(config=config))
# Watch for changes in `models.py` and reload automatically
%load_ext autoreload
%autoreload 2
# Import NN architectures for speech recognition
from models import *
# Import function for training acoustic model
from utils import train_model
Using TensorFlow backend.
Again, there isn't anything simple about a Recurrent Neural Network, but this one will be the vanilla flavor of the day. Because I'm working with sequential data, all of the models in this notebook will prominantly feature RNNs, and this model will serve was the baseline.
As you can see in this example, this model (and all of the others) will take the acoustic features of the audio sequential per time step:

Then, for each time step, the model will produce an output of 28 possible outputs, including the 26 characters of the English alphabet, the space character, or an apostrophe. Well, technically the model will produce a vector of probabilities for the likelihood of all of the above, but I'll just use that vector to select the highest probability for now.
The following example is the same model in what's called the unrolled format:

The simple RNN is specified in Keras as follows:
model_0 = simple_rnn_model(input_dim=13) # The `input_dim` = 13 for MFCC features
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
rnn (GRU) (None, None, 29) 3741
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 3,741
Trainable params: 3,741
Non-trainable params: 0
_________________________________________________________________
None
My acoustic models will train with CTC for their
loss function. Note: using custom loss functions, like CTC, with Keras required some tinkering in 2020, but I
don't
know whether this is still true. Udacity helped me implement this criterion as add_ctc_loss in
utils.py.
Finally, training involves a number of optional arguments that can fine tune the process:
minibatch_size- the size of the minibatches that are generated while training the model (default:20).spectrogram- Boolean value for whether spectrograms (True) or MFCCs (False) are used for training (default:True).mfcc_dim- the size of the feature dimension to use when generating MFCC features (default:13).optimizer- the Keras optimizer used to train the model (default:SGD).epochs- the number of epochs to use to train the model (default:20).verbose- controls the verbosity of the training output in themodel.fit_generatormethod (default:1).sort_by_duration- Boolean value dictating whether the training and validation sets are sorted by (increasing) duration before the start of the first epoch (default:False).
For all hyperparameters, I choose what I determined to be some commonly accepted "best practice" or "proof of concept" defaults after some searching on Stack Overflow and Reddit.
I mentioned it before, but you might also notice input_dim=13 appearing frequently. This indicates
that I chose to use MFCCs over spectrograms, the latter of which are input_dim=161.
The following cell will train model_0:
train_model(input_to_softmax=model_0,
pickle_path='model_0.pickle',
save_model_path='model_0.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 278s - loss: 844.6487 - val_loss: 756.6623
Epoch 2/20
101/101 [==========] 256s - loss: 779.3528 - val_loss: 762.4975
Epoch 3/20
101/101 [==========] 254s - loss: 779.1221 - val_loss: 754.2807
Epoch 4/20
101/101 [==========] 255s - loss: 779.4376 - val_loss: 760.7240
Epoch 5/20
101/101 [==========] 260s - loss: 779.1932 - val_loss: 754.1836
...
Epoch 16/20
101/101 [==========] 260s - loss: 779.4956 - val_loss: 764.2389
Epoch 17/20
101/101 [==========] 257s - loss: 779.2465 - val_loss: 758.1152
Epoch 18/20
101/101 [==========] 254s - loss: 779.3760 - val_loss: 750.4402
Epoch 19/20
101/101 [==========] 259s - loss: 779.3372 - val_loss: 764.4173
Epoch 20/20
101/101 [==========] 260s - loss: 779.0816 - val_loss: 753.9043
The primary change for this model will be the inclusion of BatchNormalization via TimeDistributed layer. Generally, Batch Normalization
refers to a collection of strategies that allow the network to train faster by safely using a higher
learnrate. In the case of a TimeDistributed layer, the network will use this advantage
to find more complex relationships in the dataset.
Here is the new model, rolled:

But I think the unrolled model illustrates this one much better:

The following cells will train model_1:
model_1 = rnn_model(input_dim=13,
units=200,
activation='relu')
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
rnn (GRU) (None, None, 200) 128400
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 200) 800
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29) 5829
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 135,029
Trainable params: 134,629
Non-trainable params: 400
_________________________________________________________________
None
train_model(input_to_softmax=model_1,
pickle_path='model_1.pickle',
save_model_path='model_1.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 261s - loss: 315.2228 - val_loss: 364.9819
Epoch 2/20
101/101 [==========] 260s - loss: 224.1788 - val_loss: 213.9861
Epoch 3/20
101/101 [==========] 256s - loss: 201.0221 - val_loss: 199.6651
Epoch 4/20
101/101 [==========] 251s - loss: 186.1088 - val_loss: 192.8514
Epoch 5/20
101/101 [==========] 260s - loss: 175.0228 - val_loss: 179.3166
...
Epoch 16/20
101/101 [==========] 259s - loss: 123.1043 - val_loss: 135.9445
Epoch 17/20
101/101 [==========] 259s - loss: 121.1695 - val_loss: 134.6261
Epoch 18/20
101/101 [==========] 261s - loss: 119.6366 - val_loss: 133.9350
Epoch 19/20
101/101 [==========] 261s - loss: 117.5602 - val_loss: 135.0312
Epoch 20/20
101/101 [==========] 262s - loss: 116.4232 - val_loss: 134.2130
Expanding on the previous model, this one includes a 1D Convolutional Layer, which uses a technique from computer vision to hopefully extract more useful feature maps: sliding a filter over the initial feature representation that enhances important features while muting others.
The resulting model will look like this:

The following cells will train model_2:
model_2 = cnn_rnn_model(input_dim=13,
filters=200,
kernel_size=11,
conv_stride=2,
conv_border_mode='same',
units=200)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
conv1d (Conv1D) (None, None, 200) 28800
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200) 800
_________________________________________________________________
rnn (SimpleRNN) (None, None, 200) 80200
_________________________________________________________________
batch_normalization_2 (Batch (None, None, 200) 800
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 29) 5829
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 116,429
Trainable params: 115,629
Non-trainable params: 800
_________________________________________________________________
None
train_model(input_to_softmax=model_2,
pickle_path='model_2.pickle',
save_model_path='model_2.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 114s - loss: 253.1224 - val_loss: 213.2807
Epoch 2/20
101/101 [==========] 111s - loss: 182.4856 - val_loss: 175.2272
Epoch 3/20
101/101 [==========] 109s - loss: 157.5329 - val_loss: 154.6968
Epoch 4/20
101/101 [==========] 112s - loss: 145.0075 - val_loss: 147.8111
Epoch 5/20
101/101 [==========] 111s - loss: 137.8364 - val_loss: 140.9015
...
Epoch 16/20
101/101 [==========] 110s - loss: 108.8807 - val_loss: 128.9220
Epoch 17/20
101/101 [==========] 109s - loss: 107.7632 - val_loss: 129.8091
Epoch 18/20
101/101 [==========] 109s - loss: 106.1927 - val_loss: 128.8591
Epoch 19/20
101/101 [==========] 110s - loss: 105.3344 - val_loss: 126.7256
Epoch 20/20
101/101 [==========] 111s - loss: 103.9792 - val_loss: 130.1807
Until now, the model used a single recurrent layer, but I want to see what happens if I adjust the model to accept any variable number of RNN layers. On one hand, this could simply be overkill that only serves to extend training time; on the other, maybe the depth provided by multiple RNNs will be what cracks this problem – I'll be looking for a strong performance uplift to justify the added layers.
Since I'm worried about training time, I'll temporarily remove the CNN, and the model will look like this:

The following cells will train model_3:
model_3 = deep_rnn_model(input_dim=13,
units=200,
recur_layers=2)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
rnn0 (GRU) (None, None, 200) 128400
_________________________________________________________________
batch_normalization_3 (Batch (None, None, 200) 800
_________________________________________________________________
rnn1 (GRU) (None, None, 200) 240600
_________________________________________________________________
batch_normalization_4 (Batch (None, None, 200) 800
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 29) 5829
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 376,429
Trainable params: 375,629
Non-trainable params: 800
_________________________________________________________________
None
train_model(input_to_softmax=model_3,
pickle_path='model_3.pickle',
save_model_path='model_3.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 420s - loss: 294.5375 - val_loss: 326.4501
Epoch 2/20
101/101 [==========] 426s - loss: 228.3170 - val_loss: 218.7167
Epoch 3/20
101/101 [==========] 424s - loss: 192.3747 - val_loss: 193.8009
Epoch 4/20
101/101 [==========] 427s - loss: 171.3817 - val_loss: 169.5791
Epoch 5/20
101/101 [==========] 426s - loss: 157.6423 - val_loss: 159.3954
...
Epoch 16/20
101/101 [==========] 428s - loss: 116.7533 - val_loss: 132.6016
Epoch 17/20
101/101 [==========] 423s - loss: 114.8657 - val_loss: 131.2636
Epoch 18/20
101/101 [==========] 427s - loss: 113.8633 - val_loss: 134.3393
Epoch 19/20
101/101 [==========] 422s - loss: 112.4764 - val_loss: 131.3335
Epoch 20/20
101/101 [==========] 423s - loss: 114.2621 - val_loss: 134.7672
Deeper RNNs provide more of the same benefit, but a Bidirectional layer provides something new:
the ability to see future contexts, which can make a difference when working with language, where quirks like
phrasal verbs or split clauses might otherwise confuse the model.
As you can see, this model looks a lot like the last one, but the RNN layers now feed in both directions:

The following cells will train model_4:
model_4 = bidirectional_rnn_model(input_dim=13,
units=200)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 400) 256800
_________________________________________________________________
time_distributed_4 (TimeDist (None, None, 29) 11629
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 268,429
Trainable params: 268,429
Non-trainable params: 0
_________________________________________________________________
None
train_model(input_to_softmax=model_4,
pickle_path='model_4.pickle',
save_model_path='model_4.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 417s - loss: 286.0226 - val_loss: 214.4835
Epoch 2/20
101/101 [==========] 421s - loss: 211.8139 - val_loss: 196.9271
Epoch 3/20
101/101 [==========] 420s - loss: 199.4706 - val_loss: 190.5670
Epoch 4/20
101/101 [==========] 411s - loss: 191.0267 - val_loss: 181.3404
Epoch 5/20
101/101 [==========] 422s - loss: 182.7790 - val_loss: 178.2257
...
Epoch 16/20
101/101 [==========] 423s - loss: 131.3034 - val_loss: 139.9685
Epoch 17/20
101/101 [==========] 426s - loss: 128.0395 - val_loss: 141.6328
Epoch 18/20
101/101 [==========] 422s - loss: 125.3457 - val_loss: 137.0013
Epoch 19/20
101/101 [==========] 426s - loss: 122.5610 - val_loss: 137.2083
Epoch 20/20
101/101 [==========] 422s - loss: 119.9599 - val_loss: 135.8749
I didn't talk about the performance of each model, because I was waiting for this step. The following cell will plot the change in training and validation loss per epoch for each model, and we can see how each model performed. It may also be possible to see when models begin to overfit or even explode/vanish their gradients. Everything is better with graphs!
from glob import glob
import numpy as np
import _pickle as pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style(style='white')
# Obtain saved models
all_pickles = sorted(glob("results/*.pickle"))
model_names = [item[8:-7] for item in all_pickles]
valid_loss = [pickle.load( open( i, "rb" ) )['val_loss'] for i in all_pickles]
train_loss = [pickle.load( open( i, "rb" ) )['loss'] for i in all_pickles]
num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]
fig = plt.figure(figsize=(16,5))
# Plot training loss vs. epoch for each model
ax1 = fig.add_subplot(121)
for i in range(len(all_pickles)):
ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]),
train_loss[i], label=model_names[i])
# Clean up plot
ax1.legend()
ax1.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Training Loss')
# Plot validation loss vs. epoch for each model
ax2 = fig.add_subplot(122)
for i in range(len(all_pickles)):
ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]),
valid_loss[i], label=model_names[i])
# Clean up plot
ax2.legend()
ax2.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.show()
Obviously, TimeDistributed had a pretty substantial impact; otherwise, the models performed more
or less the same. There does, however, appear to be a clear benefit to including the CNN for feature extraction,
so my final_model will certainly make use of this architecture.
My final model for this project can be found with the others in models.py and includes the
Convolutional, BatchNormalization, Bidirectional, and
TimeDistribtued layers as they were shown in the steps above.
It also includes Dropout layers per this research
paper, which required some help from this repository to implement for
recurrent layers. I did not include Dropout thus far, because the training sessions were relatively short and
overfitting wasn't as likely. I also experimented with adding MaxPool layers to the CNN, but my
implementation actually performed substantially worse with Max Pooling for some reason.
Whew, the following cells will train the final_model:
model_end = final_model(input_dim = 13,
filters = 200,
kernel_size = 11,
conv_stride = 2,
conv_border_mode='same',
units = 200)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
conv1d (Conv1D) (None, None, 200) 28800
_________________________________________________________________
batch_normalization_5 (Batch (None, None, 200) 800
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 400) 481200
_________________________________________________________________
batch_normalization_6 (Batch (None, None, 400) 1600
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 400) 721200
_________________________________________________________________
batch_normalization_7 (Batch (None, None, 400) 1600
_________________________________________________________________
dropout_2 (Dropout) (None, None, 400) 0
_________________________________________________________________
time_distributed_5 (TimeDist (None, None, 29) 11629
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 1,246,829
Trainable params: 1,244,829
Non-trainable params: 2,000
_________________________________________________________________
None
train_model(input_to_softmax=model_end,
pickle_path='model_end.pickle',
save_model_path='model_end.h5',
spectrogram=False)
Epoch 1/20
101/101 [==========] 413s - loss: 261.3609 - val_loss: 223.9081
Epoch 2/20
101/101 [==========] 413s - loss: 196.2226 - val_loss: 181.6405
Epoch 3/20
101/101 [==========] 414s - loss: 166.5231 - val_loss: 149.9131
Epoch 4/20
101/101 [==========] 414s - loss: 149.5870 - val_loss: 142.6551
Epoch 5/20
101/101 [==========] 415s - loss: 136.9037 - val_loss: 128.4056
...
Epoch 16/20
101/101 [==========] 412s - loss: 73.4453 - val_loss: 113.9058
Epoch 17/20
101/101 [==========] 409s - loss: 69.8473 - val_loss: 113.9859
Epoch 18/20
101/101 [==========] 408s - loss: 66.1584 - val_loss: 114.8722
Epoch 19/20
101/101 [==========] 409s - loss: 63.2991 - val_loss: 116.1796
Epoch 20/20
101/101 [==========] 409s - loss: 60.3402 - val_loss: 118.9479
We've reached the best part: watching a computer almost transcribe human speech. Yeah, unfortunately, this project was never destined to be a perfect speech-to-text device. That would require substantial training time on expensive cloud computing hardware, and Udacity provided only so much time and so much GPU power.
But let's see how close this model comes! The following will retrieve an audio, run it through the network, and print the outcome:
import numpy as np
from data_generator import AudioGenerator
from keras import backend as K
from utils import int_sequence_to_text
from IPython.display import Audio
def get_predictions(index, partition, input_to_softmax, model_path):
""" Print a model's decoded predictions
Params:
index (int): The example you would like to visualize
partition (str): One of 'train' or 'validation'
input_to_softmax (Model): The acoustic model
model_path (str): Path to saved acoustic model's weights
"""
# Load train and test data
data_gen = AudioGenerator()
data_gen.load_train_data()
data_gen.load_validation_data()
# Obtain true transcription and audio features
if partition == 'validation':
transcr = data_gen.valid_texts[index]
audio_path = data_gen.valid_audio_paths[index]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
elif partition == 'train':
transcr = data_gen.train_texts[index]
audio_path = data_gen.train_audio_paths[index]
data_point = data_gen.normalize(data_gen.featurize(audio_path))
else:
raise Exception('Invalid partition! Must be "train" or "validation"')
# Obtain and decode acoustic model predictions
input_to_softmax.load_weights(model_path)
prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
output_length = [input_to_softmax.output_length(data_point.shape[0])]
pred_ints = (K.eval(K.ctc_decode(
prediction, output_length)[0][0])+1).flatten().tolist()
# Play audio file, and display true and predicted transcriptions
print('-'*80)
Audio(audio_path)
print('True transcription:\n' + '\n' + transcr)
print('-'*80)
print('Predicted transcription:\n' + '\n' + ''.join(int_sequence_to_text(pred_ints)))
print('-'*80)
get_predictions(index=0,
partition='train',
input_to_softmax=final_model(input_dim = 13,
filters = 200,
kernel_size = 11,
conv_stride = 2,
conv_border_mode='same',
units = 200),
model_path="./results/model_end.h5")
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
conv1d (Conv1D) (None, None, 200) 28800
_________________________________________________________________
batch_normalization_17 (Batc (None, None, 200) 800
_________________________________________________________________
bidirectional_10 (Bidirectio (None, None, 400) 481200
_________________________________________________________________
batch_normalization_18 (Batc (None, None, 400) 1600
_________________________________________________________________
bidirectional_11 (Bidirectio (None, None, 400) 721200
_________________________________________________________________
batch_normalization_19 (Batc (None, None, 400) 1600
_________________________________________________________________
dropout_10 (Dropout) (None, None, 400) 0
_________________________________________________________________
time_distributed_9 (TimeDist (None, None, 29) 11629
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 1,246,829
Trainable params: 1,244,829
Non-trainable params: 2,000
_________________________________________________________________
None
True transcription:
her father is a most remarkable person to say the least
Predicted transcription:
her father is a most r markcabe persont to sey the least
That's definitely not perfect. But I do understand exactly what was said. What do you think?
get_predictions(index=100,
partition='validation',
input_to_softmax=final_model(input_dim = 13,
filters = 200,
kernel_size = 11,
conv_stride = 2,
conv_border_mode='same',
units = 200),
model_path="./results/model_end.h5")
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
the_input (InputLayer) (None, None, 13) 0
_________________________________________________________________
conv1d (Conv1D) (None, None, 200) 28800
_________________________________________________________________
batch_normalization_44 (Batc (None, None, 200) 800
_________________________________________________________________
bidirectional_28 (Bidirectio (None, None, 400) 481200
_________________________________________________________________
batch_normalization_45 (Batc (None, None, 400) 1600
_________________________________________________________________
bidirectional_29 (Bidirectio (None, None, 400) 721200
_________________________________________________________________
batch_normalization_46 (Batc (None, None, 400) 1600
_________________________________________________________________
dropout_28 (Dropout) (None, None, 400) 0
_________________________________________________________________
time_distributed_18 (TimeDis (None, None, 29) 11629
_________________________________________________________________
softmax (Activation) (None, None, 29) 0
=================================================================
Total params: 1,246,829
Trainable params: 1,244,829
Non-trainable params: 2,000
_________________________________________________________________
None
True transcription:
i was absent rather more than an hour
Predicted transcription:
i was apsen other morthen an hour
This one might be slightly worse. But I find it interesting that the model struggles most with linking (in speech terms) or spacing (in text terms), because that's also the most difficult part of language acquisition and comprehension for non-native English speakers.
Clearly, there remains work to be done. But I'm honestly pretty proud of what was accomplished with relatively
little resources, which nicely highlights the power of machine learning. Given a dedicated, pre-trained language
model and/or 4-6 weeks of low learnrate training on Amazon or Google Cloud servers, I think my
architecture might just hold up!
Thanks for reading!