Image Captioner

Made by SeanvonB | Source

This is another project that was part of my Computer Vision Nanodegree from 2020. In this notebook, I cover the process of developing an image captioner: a network that receives an image and returns a written description of the image. It combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells to create a network with both feedforward and feedback connections. Beyond the obvious but incredible difference an image captioner can make in terms of user accessibility, this project demonstrates a network that's capable of infering contextual nuance, which has countless other applications and is simply fascinating. On the other hand, this project sometimes demonstrates a network that's incapable of infering contextual nuance; which, while disappointing, can also be pretty funny.

Per Udacity's instruction, the project is broken down into four steps:

  1. Understand the data
  2. Preprocess the data
  3. Train the model
  4. Test the model

Let's get started!

1.0 Understand the Data

Image and text data for this project is provided by the Microsoft Common Objects in Context dataset, or COCO. In addition to captioning algorithms, this dataset would be ideal for any project that relies on contextual recognition, object detection, object segmentation, and pattern recognition. You can read more about COCO here.

Here's an example of the dataset that's provided by COCO itself:

As you can see, there's a pretty wide variety of contexts, as well as some objects that seemingly lack any context. Further, in addition to specialized data, like the color segmented examples shown, the images from COCO include 5 captions per image, which will be the primary association that I will expect this network to infer. Plus, the COCO dataset can be accessed through the COCO API, which significantly reduces local file space and just feels like good data science – hopefully, that will reduce the likelihood that I get an email from GitHub about their recommended repo size limit.

1.1 Initialize the COCO API

The following code will establish the connection to COCO and retrieve the dataset for storage in memory. Note that two different types of files are being imported: instance annotations and caption annotations; there will be more on this later. Finally, a list of IDs is created, so that dataset samples can be indivdually accessed if needed.

1.2 Plot a Sample

Next, it's wise to check a few samples before moving on to preprocessing, which I do with two objectives in mind: (1) learn what format/shape a sample has, and (2) learn how to access each component of a sample as needed. So, I'll plot a sample image and print the corresponding captions.

If you'd like to rerun this cell, a new image and its captions will be chosen randomly each time

I think there may already be some interesting details to note: the captions apparently don't follow a rigid pattern, e.g. verb tense isn't consistent nor are punctuation, capitalization, or article inclusion. I'm curious how this will affect the outcome and whether this may actually improve generalization.

2.0 Preprocess the Data

This section will cover all of the steps between acquiring the data and actually training the network, which will include transforming the data, preparing the data loader, and determining settings like batch size and how many times words must be seen by the network before they're added to the vocabulary. Lastly, this section will include importing the network.

Rather than use PyTorch's DataLoader as before, Udacity provided their own data loader in data_loader.py (which can be initialized with get_loader) and stated we were not permitted to change this file or use an alternative.

As mentioned in previous projects, data transforms solve two problems: first, they conform inputs to match what pre-trained models expect, which is typically a 224x224 image Tensor with color channels normalized according to ImageNet standards; second, they improve generalization by allowing us to subtly mess with the images, thereby functionally expanding the dataset. For this project, I simply resize the image to a minimum of 256 pixels, randomly extract a 224x224 segment, then horizontally mirror the image 50% of the time.

Next, for batch size, I typically start with 32; but, after the first run, I will drop it to 16 – or, in this case, 10 – as smaller batch sizes have been shown to help the network generalize.

Finally, vocabulary threshold is a new variable for this kind of project. Setting a minimum number of times that words must be seen before being added to the vocabulary helps ensure that the network will take fewer gambles on words that it doesn't particularly understand. Raising this threshold will produce a network that's more correct but less descriptive, while lowering it will produce a network that's more fun.

Finally, the transformed data must be loaded into the data loader from data_loader.py.

2.1 Udacity's Data Loader

. . .

2.2 Load the Data

Another important observation to make will be that the captions vary pretty wildly in length. The following output will show that the overwhelming majority of captions are about 10 words long, but there captions with as few as 6 words or as many as 57. I suspect this may be due to crowdsourcing the captions and human variance, which makes including this possible inability to follow instructions all the more important: neither humans nor realistic AI obey the rules! Yikes.

Here's the breakdown:

Rather than restrict the network's learning to the shortest captions or slow down training with a static approach that fits the longest, this research paper suggests an interesting solution: draw batches of image-caption pairs that all feature captions of the same length, and choose that length randomly but proportionately to the number of samples with that length. According to the paper, this approach is computationally optimal without impacting the network's ability to generalize.

The following creates and fills the data loader, then produces a batch, which will be printed below to confirm format/shape:

2.3 Assemble the Network

For this project, the network architecture consists of two main components: a CNN encoder, and an RNN decoder.

The exact architecture for mine can be found in model.py, but here's an example:

First, I'll import EncoderCNN and DecoderRNN from model.py, then I'll explain what they each do.

2.4 Implement the CNN Encoder

First, let's talk encoder – in the above example, it's the blue part.

I used CNNs in previous projects, and those CNNs extracted features and created feature maps, which were then passed to fully-connected layers for classification or regression. This time, however, the fully-connected layer has been removed; instead, the feaure maps are flattened into a vector, run through a Linear layer that resizes the vector to embed_size dimensions, then passed to a whole second network: the decoder. Think of the encoder as a machine that reduces images to their "informational essence" and the decoder as a separate machine that turns "informational essence" into English text. Does this mean that you could swap in a different decoder that was trained on French? Yeah, I think you could!

It wasn't a required part of the assignment, but I also included batch normalization as described in this research paper, which makes the following claim: "Batch Normalization allows us to use much higher learning rates and be less careful about initialization" – you had me at "less careful"!

The following assembles the encoder and confirms that my sizes are still correct:

2.5 Implement the RNN Decoder

Now, we can talk decoder - in the above example, it's the teal part.

The main difference between a CNN and an RNN is that the RNN doesn't just feed information forward; the RNN also feeds information backward for use by the next pass. In other words, the RNN has memory and will remember the last few things it saw. This allows the network to handle sequential data, like language, where subsequent outputs are determined as much by previous outputs as they are by the next input.

For my decoder, I implemented the one described in this research paper. They seem like they know what they're doing. Finally, the outputs must also be a Tensor with the following shape: [batch_size, captions.shape[1], vocab_size], where the final outputs[i,j,k] can be interpreted as the likelihood that the i-th caption's j-th token is the k-th token in the vocabulary.

So, the following assembles the decoder and again confirms that my sizes are correct:

3.0 Train the Model

With the data preprocessed and the network assembled, training can just about begin. All that's left to do is tweak the hyperparameters!

Here's a brief summary of hyperparameters provided by Udacity:

Begin by setting the following variables:

  • batch_size - the batch size of each training batch. It is the number of image-caption pairs used to amend the model weights in each training step.
  • vocab_threshold - the minimum word count threshold. Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.
  • vocab_from_file - a Boolean that decides whether to load the vocabulary from file.
  • embed_size - the dimensionality of the image and word embeddings.
  • hidden_size - the number of features in the hidden state of the RNN decoder.
  • num_epochs - the number of epochs to train the model. We recommend that you set num_epochs=3, but feel free to increase or decrease this number as you wish. This paper trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours! (But of course, if you want your model to compete with current research, you will have to train for much longer.)
  • save_every - determines how often to save the model weights. We recommend that you set save_every=1, to save the model weights after each epoch. This way, after the ith epoch, the encoder and decoder weights will be saved in the models/ folder as encoder-i.pkl and decoder-i.pkl, respectively.
  • print_every - determines how often to print the batch loss to the Jupyter notebook while training. Note that you will not observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected! You are encouraged to keep this at its default value of 100 to avoid clogging the notebook, but feel free to change it.
  • log_file - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

Udacity also recommended the following research papers as sources for initial values:

Show and Tell: A Neural Image Caption Generator by Vinyals, et al.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Xu, et al.

Finally, Udacity asked the following questions as part of the Nanodegree to challenge our decisions:

3.1 Nanodegree Questions

Question #1

Question: Describe your CNN-RNN architecture in detail. With this architecture in mind, how did you select the values of the variables in Task 1? If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

Answer: As a whole, my architecture is fairly simple, because debugging is easier if you start with the basics and only increase the complexity when the basics fall short of the goal. First, a CNN encoder extracts object features from an image followed by an LSTM-Linear decoder that gives those objects a caption. I did not overthink my Task 1 variables, either; I simply chose examples that were given in the research, tried them, and raised or lowered them if the initial values didn't produce the results I expected -- I am a machine learning machine-learning. For clarity, I cited the research I used in each of the sections where I used it.

Question #2

Question: How did you select the transform in transform_train? If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

Answer: I left them the same, because they seems like pretty typical defaults – they're exactly what I used in the previous projects, at least; so they're the default for me. I considered transforms like color jitter and random rotation, but I was concerned that such transforms might clash with the semantics of the captions associated with each image – in fact, I nearly removed horizontal flip for that reason, too.

Question #3

Question: How did you select the trainable parameters of your architecture? Why do you think this is a good choice?

Answer: Everything needs to be trained except for the pre-trained ResNet portion of the encoder. So, the latter portion of the encoder (embed) and all of the decoder are set to be trainable. Perhaps I'm confused by the question, but I think this must be a good choice, because – without setting them to be trainable – the network would just produce random garbage. Right?

Question #4

Question: How did you select the optimizer used to train your model?

Answer: Adam was a good friend in high school, so I always give him a chance first. Yeah, boi! If he can't handle it, I guess my other friends from high school, SDG and ASDG, will have a go.

3.2 Train

I chose to train for a single epoch, because performance wasn't a graded requirement of this assignment – in fact, the final segment of this project actually required the model to produce both hits and misses. However, thanks to batch normalization, a single epoch with increased learnrate should hopefully produce a much more coherent model than a such little training normally would.

The following defines and executes the training cycle:

4.0 Test the Model

A short training cycle like that didn't take too long, but let's see well the model performs! If testing goes well, then I'd continue training from here.

Now, on to the most exciting part: seeing whether this model can actually do anything! But, before I can demo some predictions, I need to assemble a whole new pipeline for testing and inference – feel free to skip down to Section 4.4 if you just wanna see this model take a few swings.

The following cells replicate my data transforms from earlier for use in testing, then print out a new image from the data loader in test mode:

4.1 Load the Models

The model can easily be loaded with the pickle files that were saved at the end of training; these files are essentially checkpoints that allow the model to be quickly rebuilt as it was at the moment training ended with all of the same settings. The only settings that weren't saved were embed_size and hidden_size, so those must be re-defined specifically.

The following does just that:

4.2 Create the Sampler

The actual work from this section can be found in the DecoderRNN class of model.py as the sample method. This method recieves the Tensor features from the model and outputs a Python list that represents the predicted caption sentence, with each index of that list containing the next predicted word of the full predicted sentence.

The follow code, provided by Udacity, just checks whether my implementation of sample breaks anything:

4.3: Clean the Captions

The following function simply takes the output of sample, removes the <start> and <end> tokens, and combines the list elements into a single string; in other words, it cleans it:

Seeing whether this step works, in a way, confirms whether the whole pipeline works – oh, the suspense! Here goes:

What a profound thing!

4.4 Generate Predictions

Finally, we have arrived! The following get_prediction function grabs the next image from the loader, runs it through the network, cleans the output, and prints the caption!

Here's the definition of the get_prediction function:

I'll share some hits and misses in a moment; but, for now, here's a cell where you can simply mash get_prediction() to your heart's content. If you find any particularly funny ones, you know I'd love to see them!

Note: if you are viewing this project on GitHub Pages, this cell will not be interactive. Before you can "mash get_prediction() to your heart's content", you will need to clone the repo. ☹️

4.6 The Model Performed Well

Here are some selected examples of the model producing accurate captions for the given image:

4.6 The Model Could Perform Better...

And here are some selected examples of the model not totally understanding what's happening in the images:

Hey, you win some, and you lose some.

Thanks for reading!

Made by SeanvonB | Source