This is another project that was part of my Computer Vision Nanodegree from 2020. In this notebook, I cover the process of developing an image captioner: a network that receives an image and returns a written description of the image. It combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells to create a network with both feedforward and feedback connections. Beyond the obvious but incredible difference an image captioner can make in terms of user accessibility, this project demonstrates a network that's capable of infering contextual nuance, which has countless other applications and is simply fascinating. On the other hand, this project sometimes demonstrates a network that's incapable of infering contextual nuance; which, while disappointing, can also be pretty funny.
Per Udacity's instruction, the project is broken down into four steps:
Let's get started!
Image and text data for this project is provided by the Microsoft Common Objects in Context dataset, or COCO. In addition to captioning algorithms, this dataset would be ideal for any project that relies on contextual recognition, object detection, object segmentation, and pattern recognition. You can read more about COCO here.
Here's an example of the dataset that's provided by COCO itself:
As you can see, there's a pretty wide variety of contexts, as well as some objects that seemingly lack any context. Further, in addition to specialized data, like the color segmented examples shown, the images from COCO include 5 captions per image, which will be the primary association that I will expect this network to infer. Plus, the COCO dataset can be accessed through the COCO API, which significantly reduces local file space and just feels like good data science – hopefully, that will reduce the likelihood that I get an email from GitHub about their recommended repo size limit.
The following code will establish the connection to COCO and retrieve the dataset for storage in memory. Note that two different types of files are being imported: instance annotations and caption annotations; there will be more on this later. Finally, a list of IDs is created, so that dataset samples can be indivdually accessed if needed.
import os
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
# Initialize COCO API for instance annotations
dataDir = '/opt/cocoapi'
dataType = 'val2014'
instances_annFile = os.path.join(dataDir, 'annotations/instances_{}.json'.format(dataType))
coco = COCO(instances_annFile)
# Initialize COCO API for caption annotations
captions_annFile = os.path.join(dataDir, 'annotations/captions_{}.json'.format(dataType))
coco_caps = COCO(captions_annFile)
# Get image IDs
ids = list(coco.anns.keys())
loading annotations into memory... Done (t=6.34s) creating index... index created! loading annotations into memory... Done (t=0.97s) creating index... index created!
Next, it's wise to check a few samples before moving on to preprocessing, which I do with two objectives in mind: (1) learn what format/shape a sample has, and (2) learn how to access each component of a sample as needed. So, I'll plot a sample image and print the corresponding captions.
If you'd like to rerun this cell, a new image and its captions will be chosen randomly each time
import numpy as np
import skimage.io as io
import matplotlib.pyplot as plt
%matplotlib inline
# Get URL for random image
ann_id = np.random.choice(ids)
img_id = coco.anns[ann_id]['image_id']
img = coco.loadImgs(img_id)[0]
url = img['coco_url']
# Print URL and plot image
print(url)
I = io.imread(url)
plt.axis('off')
plt.imshow(I)
plt.show()
# Load and print captions
annIds = coco_caps.getAnnIds(imgIds=img['id'])
anns = coco_caps.loadAnns(annIds)
coco_caps.showAnns(anns)
http://images.cocodataset.org/val2014/COCO_val2014_000000219820.jpg
A small boat is going down the river in front of colorful trees. A terraced hill in fall colors going down to the water with a boat on it. a red blue and yellow boat and some red trees Boat on water above trees with fall foliage. A series of steep stairs lay next to a lake
I think there may already be some interesting details to note: the captions apparently don't follow a rigid pattern, e.g. verb tense isn't consistent nor are punctuation, capitalization, or article inclusion. I'm curious how this will affect the outcome and whether this may actually improve generalization.
This section will cover all of the steps between acquiring the data and actually training the network, which will include transforming the data, preparing the data loader, and determining settings like batch size and how many times words must be seen by the network before they're added to the vocabulary. Lastly, this section will include importing the network.
Rather than use PyTorch's DataLoader as before, Udacity
provided their own data loader in data_loader.py
(which can be initialized with
get_loader
) and stated we were not permitted to change this file or use an alternative.
As mentioned in previous projects, data transforms solve two problems: first, they conform inputs to match what pre-trained models expect, which is typically a 224x224 image Tensor with color channels normalized according to ImageNet standards; second, they improve generalization by allowing us to subtly mess with the images, thereby functionally expanding the dataset. For this project, I simply resize the image to a minimum of 256 pixels, randomly extract a 224x224 segment, then horizontally mirror the image 50% of the time.
Next, for batch size, I typically start with 32; but, after the first run, I will drop it to 16 – or, in this case, 10 – as smaller batch sizes have been shown to help the network generalize.
Finally, vocabulary threshold is a new variable for this kind of project. Setting a minimum number of times that words must be seen before being added to the vocabulary helps ensure that the network will take fewer gambles on words that it doesn't particularly understand. Raising this threshold will produce a network that's more correct but less descriptive, while lowering it will produce a network that's more fun.
Finally, the transformed data must be loaded into the data loader from data_loader.py
.
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
%pip install nltk
import nltk
nltk.download('punkt')
from data_loader import get_loader
from torchvision import transforms
# Define a transform to pre-process the training images
transform_train = transforms.Compose([
transforms.Resize(256), # Resize smallest dim to 256 pixels
transforms.RandomCrop(224), # Randomly crop 224x224 segment
transforms.RandomHorizontalFlip(), # Horizontal mirror with probability=0.5
transforms.ToTensor(), # Convert PIL to Tensor
transforms.Normalize((0.485, 0.456, 0.406), # Normalize image for pre-trained model
(0.229, 0.224, 0.225))])
# Specify the batch size
batch_size = 10
# Set the minimum word count threshold
vocab_threshold = 5
# Obtain the data loader
data_loader = get_loader(transform=transform_train,
mode='train',
batch_size=batch_size,
vocab_threshold=vocab_threshold,
vocab_from_file=False)
Requirement already satisfied: nltk in /opt/conda/lib/python3.6/site-packages (3.2.5) Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from nltk) (1.11.0) [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. loading annotations into memory... Done (t=1.06s) creating index... index created! [0/414113] Tokenizing captions... [100000/414113] Tokenizing captions... [200000/414113] Tokenizing captions... [300000/414113] Tokenizing captions... [400000/414113] Tokenizing captions... loading annotations into memory... Done (t=0.97s) creating index...
0%| | 865/414113 [00:00<01:34, 4351.31it/s]
index created! Obtaining caption lengths...
100%|██████████| 414113/414113 [01:36<00:00, 4289.59it/s]
. . .
Another important observation to make will be that the captions vary pretty wildly in length. The following output will show that the overwhelming majority of captions are about 10 words long, but there captions with as few as 6 words or as many as 57. I suspect this may be due to crowdsourcing the captions and human variance, which makes including this possible inability to follow instructions all the more important: neither humans nor realistic AI obey the rules! Yikes.
Here's the breakdown:
from collections import Counter
counter = Counter(data_loader.dataset.caption_lengths)
lengths = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)
for value, count in lengths:
print('value: %2d --- count: %5d' % (value, count))
value: 10 --- count: 86334 value: 11 --- count: 79948 value: 9 --- count: 71934 value: 12 --- count: 57637 value: 13 --- count: 37645 ... value: 56 --- count: 2 value: 6 --- count: 2 value: 53 --- count: 2 value: 55 --- count: 2 value: 57 --- count: 1
Rather than restrict the network's learning to the shortest captions or slow down training with a static approach that fits the longest, this research paper suggests an interesting solution: draw batches of image-caption pairs that all feature captions of the same length, and choose that length randomly but proportionately to the number of samples with that length. According to the paper, this approach is computationally optimal without impacting the network's ability to generalize.
The following creates and fills the data loader, then produces a batch, which will be printed below to confirm format/shape:
import numpy as np
import torch.utils.data as data
# Randomly choose caption length, then sample indices with that length
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)
# Create and assign batch sampler to retrieve batch with sampled indices
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler
# Obtain and print batch
images, captions = next(iter(data_loader))
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)
# Print preprocessed images and captions
print('images:', images)
print('captions:', captions)
sampled indices: [399799, 210848, 364086, 368244, 239616, 63840, 326666, 301245, 271280, 128451] images.shape: torch.Size([10, 3, 224, 224]) captions.shape: torch.Size([10, 12]) images: tensor(...) captions: tensor(...)
For this project, the network architecture consists of two main components: a CNN encoder, and an RNN decoder.
The exact architecture for mine can be found in model.py
, but here's an example:
First, I'll import EncoderCNN
and DecoderRNN
from model.py, then
I'll explain what they each do.
from model import EncoderCNN, DecoderRNN
# Watch for changes in `model.py``, and re-load automatically
% load_ext autoreload
% autoreload 2
# Determine which device will be active
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
First, let's talk encoder – in the above example, it's the blue part.
I used CNNs in previous projects, and those CNNs extracted features
and created feature maps, which were then passed to fully-connected layers for classification or regression.
This time, however, the fully-connected layer has been removed; instead, the feaure maps are flattened into a
vector, run through a Linear
layer that resizes the vector to
embed_size
dimensions, then passed to a whole second network: the decoder. Think of the encoder as a machine
that reduces images to their "informational essence" and the decoder as a
separate machine that turns "informational essence" into English text. Does this mean that you could swap in a
different decoder that was trained on French? Yeah, I think you could!
It wasn't a required part of the assignment, but I also included batch normalization as described in this research paper, which makes the following claim: "Batch Normalization allows us to use much higher learning rates and be less careful about initialization" – you had me at "less careful"!
The following assembles the encoder and confirms that my sizes are still correct:
# Image bedding size
embed_size = 256
# Initialize encoder
encoder = EncoderCNN(embed_size)
# Move encoder to GPU if CUDA available
encoder.to(device)
# Move images to GPU if CUDA available.
images = images.to(device)
# Pass images to encoder
features = encoder(images)
print('type(features):', type(features))
print('features.shape:', features.shape)
# Check if I broke any pipes
assert type(features)==torch.Tensor, "Encoder output needs to be a PyTorch Tensor."
assert (features.shape[0]==batch_size) & (features.shape[1]==embed_size), "The shape of the encoder output is incorrect."
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.torch/models/resnet50-19c8e357.pth 100%|██████████| 102502400/102502400 [00:01<00:00, 59702705.38it/s]
type(features): <class 'torch.Tensor'> features.shape: torch.Size([10, 256])
Now, we can talk decoder - in the above example, it's the teal part.
The main difference between a CNN and an RNN is that the RNN doesn't just feed information forward; the RNN also feeds information backward for use by the next pass. In other words, the RNN has memory and will remember the last few things it saw. This allows the network to handle sequential data, like language, where subsequent outputs are determined as much by previous outputs as they are by the next input.
For my decoder, I implemented the one described in this research paper. They seem like they know what they're doing.
Finally, the outputs must also be a Tensor with the following shape:
[batch_size, captions.shape[1], vocab_size]
, where the final outputs[i,j,k]
can be
interpreted as the likelihood that the i
-th caption's j
-th token is the
k
-th token in the vocabulary
.
So, the following assembles the decoder and again confirms that my sizes are correct:
# RNN hidden state size
hidden_size = 512
# Get size of vocabulary
vocab_size = len(data_loader.dataset.vocab)
# Initialize decoder
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
# Move decoder to GPU if CUDA available.
decoder.to(device)
# Move captions to GPU if CUDA available
captions = captions.to(device)
# Pass both encoder output and captions to decoder
outputs = decoder(features, captions)
print('type(outputs):', type(outputs))
print('outputs.shape:', outputs.shape)
# Check if I broke any pipes
assert type(outputs)==torch.Tensor, "Decoder output needs to be a PyTorch Tensor."
assert (outputs.shape[0]==batch_size) & (outputs.shape[1]==captions.shape[1]) & (outputs.shape[2]==vocab_size), "The shape of the decoder output is incorrect."
type(outputs): <class 'torch.Tensor'> outputs.shape: torch.Size([10, 12, 8855])
With the data preprocessed and the network assembled, training can just about begin. All that's left to do is tweak the hyperparameters!
Here's a brief summary of hyperparameters provided by Udacity:
Begin by setting the following variables:
batch_size
- the batch size of each training batch. It is the number of image-caption pairs used to amend the model weights in each training step.vocab_threshold
- the minimum word count threshold. Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.vocab_from_file
- a Boolean that decides whether to load the vocabulary from file.embed_size
- the dimensionality of the image and word embeddings.hidden_size
- the number of features in the hidden state of the RNN decoder.num_epochs
- the number of epochs to train the model. We recommend that you setnum_epochs=3
, but feel free to increase or decrease this number as you wish. This paper trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours! (But of course, if you want your model to compete with current research, you will have to train for much longer.)save_every
- determines how often to save the model weights. We recommend that you setsave_every=1
, to save the model weights after each epoch. This way, after thei
th epoch, the encoder and decoder weights will be saved in themodels/
folder asencoder-i.pkl
anddecoder-i.pkl
, respectively.print_every
- determines how often to print the batch loss to the Jupyter notebook while training. Note that you will not observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected! You are encouraged to keep this at its default value of100
to avoid clogging the notebook, but feel free to change it.log_file
- the name of the text file containing - for every step - how the loss and perplexity evolved during training.
Udacity also recommended the following research papers as sources for initial values:
Show and Tell: A Neural Image Caption Generator by Vinyals, et al.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Xu, et al.
Finally, Udacity asked the following questions as part of the Nanodegree to challenge our decisions:
Question: Describe your CNN-RNN architecture in detail. With this architecture in mind, how did you select the values of the variables in Task 1? If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.
Answer: As a whole, my architecture is fairly simple, because debugging is easier if you start with the basics and only increase the complexity when the basics fall short of the goal. First, a CNN encoder extracts object features from an image followed by an LSTM-Linear decoder that gives those objects a caption. I did not overthink my Task 1 variables, either; I simply chose examples that were given in the research, tried them, and raised or lowered them if the initial values didn't produce the results I expected -- I am a machine learning machine-learning. For clarity, I cited the research I used in each of the sections where I used it.
Question: How did you select the transform in transform_train
? If you left the
transform at its provided value, why do you think that it is a good choice for your CNN architecture?
Answer: I left them the same, because they seems like pretty typical defaults – they're exactly what I used in the previous projects, at least; so they're the default for me. I considered transforms like color jitter and random rotation, but I was concerned that such transforms might clash with the semantics of the captions associated with each image – in fact, I nearly removed horizontal flip for that reason, too.
Question: How did you select the trainable parameters of your architecture? Why do you think this is a good choice?
Answer: Everything needs to be trained except for the pre-trained ResNet portion of the
encoder. So, the latter portion of the encoder (embed
) and all of the decoder are set to be
trainable. Perhaps I'm confused by the question, but I think this must be a good choice, because –
without setting them to be trainable – the network would just produce random garbage. Right?
Question: How did you select the optimizer used to train your model?
Answer: Adam was a good friend in high school, so I always give him a chance first. Yeah, boi! If he can't handle it, I guess my other friends from high school, SDG and ASDG, will have a go.
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math
# Hyperparameters
batch_size = 64
vocab_threshold = 5
vocab_from_file = False
embed_size = 256
hidden_size = 512
num_epochs = 1
save_every = 1
print_every = 100
log_file = 'training_log.txt'
# Transforms
transform_train = transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# Data Loader
data_loader = get_loader(transform=transform_train,
mode='train',
batch_size=batch_size,
vocab_threshold=vocab_threshold,
vocab_from_file=vocab_from_file)
# Vocabulary
vocab_size = len(data_loader.dataset.vocab)
# Initialize networks
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
# Move models to GPU if CUDA available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)
# Loss function
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()
# Specify learnable parameters
params = list(decoder.parameters()) + list(encoder.embed.parameters())
# Optimizer
optimizer = torch.optim.Adam(params)
# Steps per epoch
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)
loading annotations into memory... Done (t=1.04s) creating index... index created! [0/414113] Tokenizing captions... [100000/414113] Tokenizing captions... [200000/414113] Tokenizing captions... [300000/414113] Tokenizing captions... [400000/414113] Tokenizing captions... loading annotations into memory... Done (t=0.97s) creating index...
0%| | 903/414113 [00:00<01:31, 4533.09it/s]
index created! Obtaining caption lengths...
100%|██████████| 414113/414113 [01:33<00:00, 4441.47it/s]
I chose to train for a single epoch, because performance wasn't a graded requirement of this assignment – in
fact, the final segment of this project actually required the model to produce both hits and misses. However,
thanks to batch normalization, a single epoch with increased learnrate
should
hopefully produce a much more coherent model than a such little training normally would.
The following defines and executes the training cycle:
import torch.utils.data as data
import numpy as np
import os
import requests
import time
# Open training log file
f = open(log_file, 'w')
old_time = time.time()
response = requests.request("GET",
"http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token",
headers={"Metadata-Flavor":"Google"})
for epoch in range(1, num_epochs+1):
for i_step in range(1, total_step+1):
if time.time() - old_time > 60:
old_time = time.time()
requests.request("POST",
"https://nebula.udacity.com/api/v1/remote/keep-alive",
headers={'Authorization': "STAR " + response.text})
# Randomly sample caption length, then sample indices with that length
indices = data_loader.dataset.get_train_indices()
# Create and assign batch sampler to retrieve batch with sampled indices
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler
# Obtain batch
images, captions = next(iter(data_loader))
# Move images and captions to GPU if CUDA available.
images = images.to(device)
captions = captions.to(device)
# Zero gradients
decoder.zero_grad()
encoder.zero_grad()
# Pass inputs through CNN-RNN model
features = encoder(images)
outputs = decoder(features, captions)
# Calculate batch loss
loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
# Backward pass
loss.backward()
# Update parameters in optimizer
optimizer.step()
# Get training statistics
stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
# Print training statistics (on same line)
print('\r' + stats, end="")
sys.stdout.flush()
# Print training statistics to file
f.write(stats + '\n')
f.flush()
# Print training statistics (on different line)
if i_step % print_every == 0:
print('\r' + stats)
# Save weights
if epoch % save_every == 0:
torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))
# Close training log file
f.close()
Epoch [1/1], Step [100/6471], Loss: 3.9504, Perplexity: 51.9548 Epoch [1/1], Step [200/6471], Loss: 3.6203, Perplexity: 37.34829 Epoch [1/1], Step [300/6471], Loss: 3.4357, Perplexity: 31.05318 Epoch [1/1], Step [400/6471], Loss: 3.0031, Perplexity: 20.1469 Epoch [1/1], Step [500/6471], Loss: 2.8526, Perplexity: 17.3336 ... Epoch [1/1], Step [6100/6471], Loss: 2.0174, Perplexity: 7.518914 Epoch [1/1], Step [6200/6471], Loss: 2.8182, Perplexity: 16.7471 Epoch [1/1], Step [6300/6471], Loss: 2.2008, Perplexity: 9.03214 Epoch [1/1], Step [6400/6471], Loss: 2.1075, Perplexity: 8.22739 Epoch [1/1], Step [6471/6471], Loss: 2.0801, Perplexity: 8.00519
A short training cycle like that didn't take too long, but let's see well the model performs! If testing goes well, then I'd continue training from here.
Now, on to the most exciting part: seeing whether this model can actually do anything! But, before I can demo some predictions, I need to assemble a whole new pipeline for testing and inference – feel free to skip down to Section 4.4 if you just wanna see this model take a few swings.
The following cells replicate my data transforms from earlier for use in testing, then print out a new image
from the data loader in test
mode:
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from torchvision import transforms
# Re-define transforms for testing
transform_test = transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# Create data loader
data_loader = get_loader(transform=transform_test,
mode="test")
Vocabulary successfully loaded from vocab.pkl file!
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Obtain sample image before and after pre-processing.
orig_image, image = next(iter(data_loader))
# Visualize sample image, before pre-processing.
plt.imshow(np.squeeze(orig_image))
plt.title('example image')
plt.show()
The model can easily be loaded with the pickle files that were saved at the end of training; these files are
essentially checkpoints that allow the model to be quickly rebuilt as it was at the moment training ended with
all of the same settings. The only settings that weren't saved were embed_size
and
hidden_size
, so those must be re-defined specifically.
The following does just that:
import os
import torch
from model import EncoderCNN, DecoderRNN
# Load the following pickles
encoder_file = "encoder-1.pkl"
decoder_file = "decoder-1.pkl"
# Re-define size
embed_size = 256
hidden_size = 512
# Vocabulary
vocab_size = len(data_loader.dataset.vocab)
# Initialize encoder and decoder; set each to INFERENCE
encoder = EncoderCNN(embed_size)
encoder.eval()
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
decoder.eval()
# Load trained weights
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
# Move models to GPU if CUDA available.
encoder.to(device)
decoder.to(device)
DecoderRNN( (embed): Embedding(8855, 256) (lstm): LSTM(256, 512, batch_first=True) (fc): Linear(in_features=512, out_features=8855, bias=True) )
The actual work from this section can be found in the DecoderRNN
class of model.py
as
the sample
method. This method recieves the Tensor features from the model and outputs a Python
list that represents the predicted caption sentence, with each index of that list containing the next predicted word
of the full predicted sentence.
The follow code, provided by Udacity, just checks whether my implementation of sample
breaks
anything:
# Move image Pytorch Tensor to GPU if CUDA available
image = image.to(device)
# Obtain embedded image features
features = encoder(image).unsqueeze(1)
# Pass embedded image features through model to get predicted caption
output = decoder.sample(features)
print('example output:', output)
assert (type(output)==list), "Output needs to be a Python list"
assert all([type(x)==int for x in output]), "Output should be a list of integers."
assert all([x in data_loader.dataset.vocab.idx2word for x in output]), "Each entry in the output needs to correspond to an integer that indicates a token in the vocabulary."
example output: [0, 3, 33, 30, 21, 3, 33, 30, 39, 46, 18, 1, 1, 18, 1, 1, 18, 1, 1, 18]
The following function simply takes the output of sample
, removes the <start>
and <end>
tokens, and combines the list elements into a single string; in other words, it
cleans it:
def clean_sentence(output):
sentence = ""
for i in output:
word = data_loader.dataset.vocab.idx2word[i]
if i == 0: # 0 = START
continue
elif i == 1: # 1 = END
break
else:
sentence = sentence + " " + word
return sentence.strip()
Seeing whether this step works, in a way, confirms whether the whole pipeline works – oh, the suspense! Here goes:
sentence = clean_sentence(output)
print('example sentence:', sentence)
assert type(sentence)==str, 'Sentence needs to be a Python string!'
example sentence: a street sign with a street sign on it .
What a profound thing!
Finally, we have arrived! The following get_prediction
function grabs the next image from the
loader, runs it through the network, cleans the output, and prints the caption!
Here's the definition of the get_prediction
function:
def get_prediction():
orig_image, image = next(iter(data_loader))
plt.imshow(np.squeeze(orig_image))
plt.title('Sample Image')
plt.show()
image = image.to(device)
features = encoder(image).unsqueeze(dim=1)
output = decoder.sample(features)
sentence = clean_sentence(output)
print(sentence)
I'll share some hits and misses in a moment; but, for now, here's a cell where you can simply mash
get_prediction()
to your heart's content. If you find any particularly funny ones, you know I'd
love to see them!
Note: if you are viewing this project on GitHub Pages, this cell will not be interactive. Before you can
"mash get_prediction()
to your heart's content", you will need to clone the repo. ☹️
get_prediction()
a woman is holding a box of doughnuts .
Here are some selected examples of the model producing accurate captions for the given image:
get_prediction()
a zebra is eating hay from a feeder .
get_prediction()
a man is surfing on a wave in the ocean .
And here are some selected examples of the model not totally understanding what's happening in the images:
get_prediction()
a group of people standing around a table with a cake .
get_prediction()
a red stop sign sitting on top of a pole .
Hey, you win some, and you lose some.
Thanks for reading!