This project was part of my Natural Language Processing Nanodegree, which I completed in late 2020. This particular Nanodegree – in fact, this particular project – had been my goal throughout my studies of machine learning. I was just so excited to work on it back then, and I'm still excited to share the work with you now. Machine translation has a long and fascinating history that involved many different approaches before the widespread commercial adoption of Neural Machine Translation (NMT) around 2016 or so. The following NMT pipeline, that I created with TensorFlow via Keras, reflects some of the most state-of-the-art practices from that time period, but it was already somewhat outdated when I built it in 2020, thanks largely to Google Brain's Transformer model with attention.
This notebook includes three main sections:
Let's get started with a whole bunch of workspace helpers and imports:
%load_ext autoreload
%aimport helper, tests
%autoreload 1
import collections
import helper
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
Using TensorFlow backend.
Ain't nobody got time for training networks on CPU, so this cell simply confirms that the running workspace has access to a GPU, whether through a Udacity Workspace, Amazon Web Services, Google Cloud Platform, or an onboard device. As you can see below, this notebook did:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
[name: "/cpu:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8195025083738673623 , name: "/gpu:0" device_type: "GPU" memory_limit: 357564416 locality { bus_id: 1 } incarnation: 12650583625369465834 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0" ]
Gotta start with the data!
Language datasets are some of the oldest, largest, and best-maintained datasets available to data science, and
the most commonly used translation sets are apparently those from WMT.
However, these sets are enormous, so Udacity provided truncated versions of these datasets as
vocabulary subsets that can train simple networks much faster. These files, for English and French, are located
in the data
directory and will loaded in below using the provided helper.py
package:
# Load English data
english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')
print('Dataset Loaded')
Dataset Loaded
Each index of small_vocab_en
and small_vocab_fr
contain the same sentence in their
respective language.
The following simply prints the first two pairs:
for sample_i in range(2):
print('small_vocab_en Line {}: {}'.format(sample_i + 1, english_sentences[sample_i]))
print('small_vocab_fr Line {}: {}'.format(sample_i + 1, french_sentences[sample_i]))
... Line 1: new jersey is sometimes quiet during autumn , and it is snowy in april . ... Line 1: new jersey est parfois calme pendant l' automne , et il est neigeux en avril . ... Line 2: the united states is usually chilly during july , and it is usually freezing in november . ... Line 2: les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
Obviously, this data has already undergone some preprocessing, because everything is lowercase and the punctuation is delimited with spaces. This isn't surprising, as these samples come from established datasets that are used for research, but that would otherwise have been Steps 1 and 2.
In this instance, "complexity" refers to the size of the vocabulary and the number of unique words within it. You can probably intuit that more "complex" problems require more complex solutions, so the following will provide some insight into the complexity of what Udacity selected:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
1823250 English words. 227 unique English words. 10 Most common words in the English dataset: "is" "," "." "in" "it" "during" "the" "but" "and" "sometimes" 1961295 French words. 355 unique French words. 10 Most common words in the French dataset: "est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"
For comparison, Lewis Carroll's Alice's Adventures in Wonderland has 15,500 total words and 2,766 unique words.
So, there isn't that much complexity to this dataset.
There are many steps involved in assembling a computer vision pipeline that a natural language processing pipeline can thankfully skip. However, there's one significant difference that must be addressed: unlike image data, language data isn't already numerical. Networks can't perform massive matrix maths on letters.
That's where tokenizing comes in. Tokenize can occur at the character level; but, for this
application, I'll tokenize at the word level. This will create a library of word IDs that each represent one word.
Fortunately, this process is very easy with the Keras Tokenzier
object.
I'll also print the outcome as an example:
def tokenize(x):
"""
Tokenize x
:param x: List of sentences/strings to be tokenized
:return: Tuple of (tokenized x data, tokenizer used to tokenize x)
"""
x_tk = Tokenizer(char_level = False)
x_tk.fit_on_texts(x)
return x_tk.texts_to_sequences(x), x_tk
# Test function and print results
text_sentences = [
'The quick brown fox jumps over the lazy dog .',
'By Jove , my quick study of lexicography won a prize .',
'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(sent))
print(' Output: {}'.format(token_sent))
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21} Sequence 1 in x Input: The quick brown fox jumps over the lazy dog . Output: [1, 2, 4, 5, 6, 7, 1, 8, 9] Sequence 2 in x Input: By Jove , my quick study of lexicography won a prize . Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17] Sequence 3 in x Input: This is a short sentence . Output: [18, 19, 3, 20, 21]
As you can see, the Tokenizer
simply assigns numbers to words in the order that they appear.
The network will expect every batch of word ID sequences (an abstract way of saying "batch of sentences") to be the same length, but that doesn't naturally occur in either dimension: length varies between different sentences within each language and between the same sentence in different languages. Since sentences/sequences are fully dynamic in length, padding must be added to the end of each sequence to make them all as long as the longest sample in the batch.
Keras provides another function, pad_sequences
, for just this
purpose:
def pad(x, length=None):
"""
Pad x
:param x: List of sequences.
:param length: Length to pad the sequence to. If None, use length of longest sequence in x.
:return: Padded numpy array of sequences
"""
if length is None:
length = max([len(sentence) for sentence in x])
return pad_sequences(x, maxlen = length, padding = "post")
# Test function and print results
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(np.array(token_sent)))
print(' Output: {}'.format(pad_sent))
Sequence 1 in x Input: [1 2 4 5 6 7 1 8 9] Output: [1 2 4 5 6 7 1 8 9 0] Sequence 2 in x Input: [10 11 12 2 13 14 15 16 3 17] Output: [10 11 12 2 13 14 15 16 3 17] Sequence 3 in x Input: [18 19 3 20 21] Output: [18 19 3 20 21 0 0 0 0 0]
Here's the full preprocessing pipeline, which includes the above tokenize
and pad
functions, plus a .reshape()
of the data to accomodate how Keras implements the
SparseCategoricalCrossEntropy
, the loss function that I've chosen for this project. Finally, the
vocabulary sizes must be increased by 1
to account for the new <PAD>
token –
this dumb thing had me stumped for a while.
def preprocess(x, y):
"""
Preprocess x and y
:param x: Feature List of sentences
:param y: Label List of sentences
:return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
"""
preprocess_x, x_tk = tokenize(x)
preprocess_y, y_tk = tokenize(y)
preprocess_x = pad(preprocess_x)
preprocess_y = pad(preprocess_y)
# Loss function requires labels to be in 3D
preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
return preprocess_x, preprocess_y, x_tk, y_tk
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
preprocess(english_sentences, french_sentences)
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
# Add 1 for <PAD> token
english_vocab_size = len(english_tokenizer.word_index) + 1
french_vocab_size = len(french_tokenizer.word_index) + 1
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
Data Preprocessed Max English sentence length: 15 Max French sentence length: 21 English vocabulary size: 200 French vocabulary size: 345
And that's all for data preprocessing!
This section showcases some experimentation with neural network architectures. From the start, I was pretty certain that the final architecture would use all of the tested architecture features; instead, I was mostly just curious how much of an impact each would have on performance.
Here are the four architectures that will be shown in this section:
But, first, there's an issue with what all of these models will output...
Everything that was done to preprocess the data was done to help the network handle it. But, regardless of the
architecture, every model must end with a function that converts the base output – a sequence of word IDs – back
into sentences that humans can understand. That's what the following logits_to_text
function does:
Note: the word logit, in this context, means the highest-probability word ID that the network predicts for a given index within a given sequence.
def logits_to_text(logits, tokenizer):
"""
Turn logits from a neural network into text using the tokenizer
:param logits: Logits from a neural network
:param tokenizer: Keras Tokenizer fit on the labels
:return: String that represents the text of the logits
"""
index_to_words = {id: word for word, id in tokenizer.word_index.items()}
index_to_words[0] = '<PAD>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
print('`logits_to_text` function loaded.')
`logits_to_text` function loaded.
It feels pedantic to say "a simple RNN" – y'know, just your run-of-the-mill Recurrent Neural Network. There isn't anything simple about RNNs, which I used previously in my Image Captioner project. What RNNs added that prevous neural networks lacked is memeory between steps. As you can see in the following diagram, each step passes information both out of the network and forward to the next step, which allows the network to handle sequential data, like language, where subsequent outputs are determined as much by previous outputs as they are by current inputs.
But this project will build upon this foundation with some new twists; so, for this notebook, I'll start with a simple RNN:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a basic RNN on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
learning_rate = 0.01
input_seq = Input(input_shape[1:])
rnn = GRU(256, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation("softmax")(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
# Reshape input to work with base Keras RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
# Train network
simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110k/110k [==========] - 13s 118us/step - loss: 1.6598 - acc: 0.5866 - val_loss: 1.1817 - val_acc: 0.6496 Epoch 2/10 110k/110k [==========] - 11s 99us/step - loss: 1.0830 - acc: 0.6646 - val_loss: 1.0070 - val_acc: 0.6741 Epoch 3/10 110k/110k [==========] - 11s 101us/step - loss: 0.9677 - acc: 0.6830 - val_loss: 0.9412 - val_acc: 0.6812 Epoch 4/10 110k/110k [==========] - 11s 100us/step - loss: 0.8949 - acc: 0.6983 - val_loss: 0.8732 - val_acc: 0.7052 Epoch 5/10 110k/110k [==========] - 11s 100us/step - loss: 0.8490 - acc: 0.7108 - val_loss: 0.8361 - val_acc: 0.7141 Epoch 6/10 110k/110k [==========] - 11s 101us/step - loss: 0.8062 - acc: 0.7235 - val_loss: 0.8042 - val_acc: 0.7195 Epoch 7/10 110k/110k [==========] - 11s 100us/step - loss: 0.7834 - acc: 0.7300 - val_loss: 0.7361 - val_acc: 0.7583 Epoch 8/10 110k/110k [==========] - 11s 100us/step - loss: 0.7515 - acc: 0.7456 - val_loss: 0.6995 - val_acc: 0.7733 Epoch 9/10 110k/110k [==========] - 11s 100us/step - loss: 0.6924 - acc: 0.7684 - val_loss: 0.6767 - val_acc: 0.7760 Epoch 10/10 110k/110k [==========] - 11s 100us/step - loss: 0.6928 - acc: 0.7638 - val_loss: 0.6751 - val_acc: 0.7690 new jersey est parfois chaud en l' de l' est il en avril <PAD> ... <PAD>
Well, that's an actual sentence... with essentially the opposite of the intended meaning.
Word IDs are a pretty basic way to represent a word for the network; there's a better way: word
embeddings. Unlike word IDs, which represent words as a list of integers, word embeddings represent
words as vectors in n-dimensional space, i.e. a big cloud of words, where similar words can cluster closer to
each other. Word embeddings can help the network understand nuances in language, like how hot
can
be closer to cold
in one dimension and closer to sexy
in another. In the example
below, you can see word the
– with the word ID 8
- being embedded as the vector
[0.2, 4, 2.4, 1.1, ...]
, which continues for n
dimensions.
The following uses a Keras Embedding
layer with n
set to 256
:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a RNN model using word embedding on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
learning_rate = 0.01
model = Sequential()
model.add(Embedding(english_vocab_size, 256, input_length = output_sequence_length))
model.add(GRU(256, return_sequences = True))
model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
# Reshape input
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
# Train network
embed_rnn_model = embed_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110k/110k [==========] - 14s 123us/step - loss: 1.3673 - acc: 0.7013 - val_loss: 0.4086 - val_acc: 0.8722 Epoch 2/10 110k/110k [==========] - 13s 120us/step - loss: 0.3158 - acc: 0.8982 - val_loss: 0.2705 - val_acc: 0.9118 Epoch 3/10 110k/110k [==========] - 13s 120us/step - loss: 0.2431 - acc: 0.9190 - val_loss: 0.2315 - val_acc: 0.9227 Epoch 4/10 110k/110k [==========] - 13s 120us/step - loss: 0.2142 - acc: 0.9268 - val_loss: 0.2139 - val_acc: 0.9277 Epoch 5/10 110k/110k [==========] - 13s 120us/step - loss: 0.2022 - acc: 0.9299 - val_loss: 0.2071 - val_acc: 0.9285 Epoch 6/10 110k/110k [==========] - 13s 120us/step - loss: 0.1968 - acc: 0.9313 - val_loss: 0.2080 - val_acc: 0.9292 Epoch 7/10 110k/110k [==========] - 13s 120us/step - loss: 0.1939 - acc: 0.9321 - val_loss: 0.2000 - val_acc: 0.9301 Epoch 8/10 110k/110k [==========] - 13s 120us/step - loss: 0.1929 - acc: 0.9326 - val_loss: 0.2032 - val_acc: 0.9305 Epoch 9/10 110k/110k [==========] - 13s 119us/step - loss: 0.1917 - acc: 0.9328 - val_loss: 0.2039 - val_acc: 0.9299 Epoch 10/10 110k/110k [==========] - 13s 120us/step - loss: 0.1923 - acc: 0.9325 - val_loss: 0.2087 - val_acc: 0.9290 new jersey est parfois calme en l' automne et il est neigeux en avril <PAD> ... <PAD>
Now that's pretty good!
An RNN allows the model to handle sequential data, like language; but a bidirectional RNN allows the model to handle language better. That's because a bidirectional RNN can also see future inputs! That might not be necessary for rote and inflexible sentence structures, but most instances of English will feature split, subordinate, or conditional clauses, phrasal verb tenses, or prepositional phrases – these can cause all manner of unusual splices and inversions of sentence structure. And that's just English – I have no idea what linguistic chicanery French gets up to!
This time, the model features a Keras Bidirectional
layer:
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a bidirectional RNN model on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
learning_rate = 0.001
model = Sequential()
model.add(Bidirectional(GRU(256, return_sequences = True), input_shape = input_shape[1:]))
model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
# Train network
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
bd_rnn_model = bd_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
bd_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110k/110k [==========] - 18s 165us/step - loss: 2.1304 - acc: 0.5489 - val_loss: 1.4903 - val_acc: 0.6112 Epoch 2/10 110k/110k [==========] - 18s 159us/step - loss: 1.3656 - acc: 0.6257 - val_loss: 1.2708 - val_acc: 0.6424 Epoch 3/10 110k/110k [==========] - 18s 159us/step - loss: 1.2151 - acc: 0.6505 - val_loss: 1.1648 - val_acc: 0.6636 Epoch 4/10 110k/110k [==========] - 18s 160us/step - loss: 1.1233 - acc: 0.6713 - val_loss: 1.0811 - val_acc: 0.6800 Epoch 5/10 110k/110k [==========] - 18s 160us/step - loss: 1.0499 - acc: 0.6846 - val_loss: 1.0170 - val_acc: 0.6919 Epoch 6/10 110k/110k [==========] - 18s 159us/step - loss: 0.9913 - acc: 0.6938 - val_loss: 0.9668 - val_acc: 0.6986 Epoch 7/10 110k/110k [==========] - 18s 159us/step - loss: 0.9465 - acc: 0.7005 - val_loss: 0.9258 - val_acc: 0.7059 Epoch 8/10 110k/110k [==========] - 18s 159us/step - loss: 0.9087 - acc: 0.7067 - val_loss: 0.8895 - val_acc: 0.7108 Epoch 9/10 110k/110k [==========] - 18s 159us/step - loss: 0.8744 - acc: 0.7128 - val_loss: 0.8581 - val_acc: 0.7167 Epoch 10/10 110k/110k [==========] - 18s 159us/step - loss: 0.8452 - acc: 0.7185 - val_loss: 0.8360 - val_acc: 0.7205 new jersey est parfois calme en mois et il il il en en en <PAD> ... <PAD>
Uh-oh, that's somehow worse... Oh, of course! Bidirectional must take twice as long to train!
At this point, the architecture is becoming a little complicated, and its training needs are becoming a little
less reasonable. But you know I'm still gonna mash the three previous approaches together with some Dropout
and see what happens.
Clearly, the Embedding
layer had by far the most significant impact; however, I'm curious whether
the Bidirectional
layer will perform better on word embeddings. The following model begs for more
training time, but I gave this one the same 10
epochs that the previous models had.
Here's the final model:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
"""
Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
learning_rate = 0.001
model = Sequential()
model.add(Embedding(english_vocab_size, 256,
input_length = output_sequence_length,
input_shape = input_shape[1:]))
model.add(Bidirectional(GRU(256, return_sequences = True)))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
print('Final Model Loaded')
# Train network
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
final_rnn_model = model_final(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
final_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(final_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Final Model Loaded Train on 110288 samples, validate on 27573 samples Epoch 1/10 110k/110k [==========] - 25s 228us/step - loss: 2.6778 - acc: 0.4926 - val_loss: 1.5954 - val_acc: 0.6197 Epoch 2/10 110k/110k [==========] - 24s 221us/step - loss: 1.2212 - acc: 0.6969 - val_loss: 0.8890 - val_acc: 0.7729 Epoch 3/10 110k/110k [==========] - 24s 222us/step - loss: 0.7875 - acc: 0.7865 - val_loss: 0.6014 - val_acc: 0.8299 Epoch 4/10 110k/110k [==========] - 24s 222us/step - loss: 0.5741 - acc: 0.8341 - val_loss: 0.4452 - val_acc: 0.8688 Epoch 5/10 110k/110k [==========] - 24s 222us/step - loss: 0.4457 - acc: 0.8671 - val_loss: 0.3665 - val_acc: 0.8900 Epoch 6/10 110k/110k [==========] - 24s 222us/step - loss: 0.3657 - acc: 0.8900 - val_loss: 0.2882 - val_acc: 0.9145 Epoch 7/10 110k/110k [==========] - 24s 221us/step - loss: 0.3122 - acc: 0.9057 - val_loss: 0.2441 - val_acc: 0.9273 Epoch 8/10 110k/110k [==========] - 25s 222us/step - loss: 0.2725 - acc: 0.9177 - val_loss: 0.2136 - val_acc: 0.9358 Epoch 9/10 110k/110k [==========] - 25s 222us/step - loss: 0.2425 - acc: 0.9269 - val_loss: 0.1866 - val_acc: 0.9441 Epoch 10/10 110k/110k [==========] - 24s 222us/step - loss: 0.2171 - acc: 0.9347 - val_loss: 0.1684 - val_acc: 0.9505 new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> ... <PAD>
That's almost dead on, with only an errant space
in l'automne
from the printed
sample.
This was provided by Udacity to assess my work on the Nanodegree assignment, which was found to be successful:
def final_predictions(x, y, x_tk, y_tk):
"""
Gets predictions using the final model
:param x: Preprocessed English data
:param y: Preprocessed French data
:param x_tk: English tokenizer
:param y_tk: French tokenizer
"""
x = pad(x, max_french_sequence_length)
model = model_final(
x.shape,
y.shape[1],
english_vocab_size,
french_vocab_size)
model.fit(x, y, batch_size=1024, epochs=10, validation_split=0.2)
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
y_id_to_word[0] = '<PAD>'
sentence = 'he saw a old yellow truck'
sentence = [x_tk.word_index[word] for word in sentence.split()]
sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
sentences = np.array([sentence[0], x[0]])
predictions = model.predict(sentences, len(sentences))
print('Sample 1:')
print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
print('Il a vu un vieux camion jaune')
print('Sample 2:')
print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))
final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110k/110k [==========] - 25s 228us/step - loss: 2.6723 - acc: 0.4934 - val_loss: 1.5712 - val_acc: 0.6215 Epoch 2/10 110k/110k [==========] - 24s 221us/step - loss: 1.2008 - acc: 0.7022 - val_loss: 0.8650 - val_acc: 0.7777 Epoch 3/10 110k/110k [==========] - 24s 221us/step - loss: 0.7601 - acc: 0.7923 - val_loss: 0.5802 - val_acc: 0.8351 Epoch 4/10 110k/110k [==========] - 24s 221us/step - loss: 0.5550 - acc: 0.8388 - val_loss: 0.4283 - val_acc: 0.8733 Epoch 5/10 110k/110k [==========] - 24s 221us/step - loss: 0.4326 - acc: 0.8706 - val_loss: 0.3353 - val_acc: 0.8992 Epoch 6/10 110k/110k [==========] - 24s 221us/step - loss: 0.3529 - acc: 0.8936 - val_loss: 0.2799 - val_acc: 0.9176 Epoch 7/10 110k/110k [==========] - 24s 221us/step - loss: 0.3020 - acc: 0.9088 - val_loss: 0.2375 - val_acc: 0.9304 Epoch 8/10 110k/110k [==========] - 24s 221us/step - loss: 0.2648 - acc: 0.9197 - val_loss: 0.2078 - val_acc: 0.9385 Epoch 9/10 110k/110k [==========] - 24s 222us/step - loss: 0.2378 - acc: 0.9284 - val_loss: 0.1859 - val_acc: 0.9447 Epoch 10/10 110k/110k [==========] - 24s 221us/step - loss: 0.2131 - acc: 0.9359 - val_loss: 0.1665 - val_acc: 0.9512 Sample 1: il a vu un vieux camion jaune <PAD> ... <PAD> Il a vu un vieux camion jaune Sample 2: new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> ... <PAD> new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> ... <PAD>
I'm still floored by achieving 95% validation accuracy after only 10 epochs, because this model could still benefit from so much more training. Further enhancements to this architecture could also be made, like the encoder-decoder arrangement I used for the Image Captioner. I'm so proud to have reached this point, and I hope you found the journey interesting.
Thanks for reading!