Language Translator

Made by SeanvonB | Source

This project was part of my Natural Language Processing Nanodegree, which I completed in late 2020. This particular Nanodegree – in fact, this particular project – had been my goal throughout my studies of machine learning. I was just so excited to work on it back then, and I'm still excited to share the work with you now. Machine translation has a long and fascinating history that involved many different approaches before the widespread commercial adoption of Neural Machine Translation (NMT) around 2016 or so. The following NMT pipeline, that I created with TensorFlow via Keras, reflects some of the most state-of-the-art practices from that time period, but it was already somewhat outdated when I built it in 2020, thanks largely to Google Brain's Transformer model with attention.

This notebook includes three main sections:

  1. Preprocessing, where I examine, tokenize, and pad the dataset.
  2. Models, where I showcase three different network features on their own before combining them into the final model.
  3. Prediction, where I show how the trained model performs.

Let's get started with a whole bunch of workspace helpers and imports:

Ain't nobody got time for training networks on CPU, so this cell simply confirms that the running workspace has access to a GPU, whether through a Udacity Workspace, Amazon Web Services, Google Cloud Platform, or an onboard device. As you can see below, this notebook did:

1.0 Preprocessing

Gotta start with the data!

1.1 Dataset

Language datasets are some of the oldest, largest, and best-maintained datasets available to data science, and the most commonly used translation sets are apparently those from WMT. However, these sets are enormous, so Udacity provided truncated versions of these datasets as vocabulary subsets that can train simple networks much faster. These files, for English and French, are located in the data directory and will loaded in below using the provided helper.py package:

1.2 Sample the Data

Each index of small_vocab_en and small_vocab_fr contain the same sentence in their respective language.

The following simply prints the first two pairs:

Obviously, this data has already undergone some preprocessing, because everything is lowercase and the punctuation is delimited with spaces. This isn't surprising, as these samples come from established datasets that are used for research, but that would otherwise have been Steps 1 and 2.

1.3 Vocabulary Complexity

In this instance, "complexity" refers to the size of the vocabulary and the number of unique words within it. You can probably intuit that more "complex" problems require more complex solutions, so the following will provide some insight into the complexity of what Udacity selected:

For comparison, Lewis Carroll's Alice's Adventures in Wonderland has 15,500 total words and 2,766 unique words.

So, there isn't that much complexity to this dataset.

1.4 Tokenize the Vocabulary

There are many steps involved in assembling a computer vision pipeline that a natural language processing pipeline can thankfully skip. However, there's one significant difference that must be addressed: unlike image data, language data isn't already numerical. Networks can't perform massive matrix maths on letters.

That's where tokenizing comes in. Tokenize can occur at the character level; but, for this application, I'll tokenize at the word level. This will create a library of word IDs that each represent one word. Fortunately, this process is very easy with the Keras Tokenzier object.

I'll also print the outcome as an example:

As you can see, the Tokenizer simply assigns numbers to words in the order that they appear.

1.5 Pad the Inputs

The network will expect every batch of word ID sequences (an abstract way of saying "batch of sentences") to be the same length, but that doesn't naturally occur in either dimension: length varies between different sentences within each language and between the same sentence in different languages. Since sentences/sequences are fully dynamic in length, padding must be added to the end of each sequence to make them all as long as the longest sample in the batch.

Keras provides another function, pad_sequences, for just this purpose:

1.6 Preprocess Pipeline

Here's the full preprocessing pipeline, which includes the above tokenize and pad functions, plus a .reshape() of the data to accomodate how Keras implements the SparseCategoricalCrossEntropy, the loss function that I've chosen for this project. Finally, the vocabulary sizes must be increased by 1 to account for the new <PAD> token – this dumb thing had me stumped for a while.

And that's all for data preprocessing!

2.0 Models

This section showcases some experimentation with neural network architectures. From the start, I was pretty certain that the final architecture would use all of the tested architecture features; instead, I was mostly just curious how much of an impact each would have on performance.

Here are the four architectures that will be shown in this section:

  1. Simple RNN
  2. RNN with Embedding
  3. Bidirectional RNN
  4. Final Model

But, first, there's an issue with what all of these models will output...

2.1 IDs to Text

Everything that was done to preprocess the data was done to help the network handle it. But, regardless of the architecture, every model must end with a function that converts the base output – a sequence of word IDs – back into sentences that humans can understand. That's what the following logits_to_text function does:

Note: the word logit, in this context, means the highest-probability word ID that the network predicts for a given index within a given sequence.

2.2 Model #1: Simple RNN

It feels pedantic to say "a simple RNN" – y'know, just your run-of-the-mill Recurrent Neural Network. There isn't anything simple about RNNs, which I used previously in my Image Captioner project. What RNNs added that prevous neural networks lacked is memeory between steps. As you can see in the following diagram, each step passes information both out of the network and forward to the next step, which allows the network to handle sequential data, like language, where subsequent outputs are determined as much by previous outputs as they are by current inputs.

But this project will build upon this foundation with some new twists; so, for this notebook, I'll start with a simple RNN:

Well, that's an actual sentence... with essentially the opposite of the intended meaning.

2.3 Model #2: RNN with Embedding

Word IDs are a pretty basic way to represent a word for the network; there's a better way: word embeddings. Unlike word IDs, which represent words as a list of integers, word embeddings represent words as vectors in n-dimensional space, i.e. a big cloud of words, where similar words can cluster closer to each other. Word embeddings can help the network understand nuances in language, like how hot can be closer to cold in one dimension and closer to sexy in another. In the example below, you can see word the – with the word ID 8 - being embedded as the vector [0.2, 4, 2.4, 1.1, ...], which continues for n dimensions.

The following uses a Keras Embedding layer with n set to 256:

Now that's pretty good!

2.4 Model #3: Bidirectional RNN

An RNN allows the model to handle sequential data, like language; but a bidirectional RNN allows the model to handle language better. That's because a bidirectional RNN can also see future inputs! That might not be necessary for rote and inflexible sentence structures, but most instances of English will feature split, subordinate, or conditional clauses, phrasal verb tenses, or prepositional phrases – these can cause all manner of unusual splices and inversions of sentence structure. And that's just English – I have no idea what linguistic chicanery French gets up to!

This time, the model features a Keras Bidirectional layer:

Uh-oh, that's somehow worse... Oh, of course! Bidirectional must take twice as long to train!

Model #4: Final Model

At this point, the architecture is becoming a little complicated, and its training needs are becoming a little less reasonable. But you know I'm still gonna mash the three previous approaches together with some Dropout and see what happens. Clearly, the Embedding layer had by far the most significant impact; however, I'm curious whether the Bidirectional layer will perform better on word embeddings. The following model begs for more training time, but I gave this one the same 10 epochs that the previous models had.

Here's the final model:

That's almost dead on, with only an errant space in l'automne from the printed sample.

3.0 Prediction

This was provided by Udacity to assess my work on the Nanodegree assignment, which was found to be successful:

I'm still floored by achieving 95% validation accuracy after only 10 epochs, because this model could still benefit from so much more training. Further enhancements to this architecture could also be made, like the encoder-decoder arrangement I used for the Image Captioner. I'm so proud to have reached this point, and I hope you found the journey interesting.

Thanks for reading!

Made by SeanvonB | Source