AI Image Classifier

Made by SeanvonB | Source

This was the final project of my AI Programming Nanodegree from Udacity, which I completed work on in 2019. The purpose of this project was to train an image classifier for use in a hypothetical smart phone app – in our case, an app that could identify the name of a flower simply by looking at it with the phone's camera. For the purpose of training such a network, we were given the 102 Category Flower Dataset from the University of Oxford's Visual Geometry Group, or VGG, which contains between 40 and 258 images for each of the 102 categories. Here are a few example of these images and their associated class-to-name identities:

Per Udacity's instruction, the project was broken down into three steps:

It's also worth mentioning that this project uses flowers as an example, but the model could theoretically classify images according to any dataset that can be made available to it. Replace the data folder with VGG's Pet Dataset and tweak a couple links, re-run the training cycle, then – bam! – this project becomes a pet breed classifier!

Loading Packages

This project uses PyTorch plus Torchvision and PIL to preprocess images, assemble the model architecture, and train the network. It also relies on MatPlotLib to present data and NumPy for matrix multiplication.

The training cycles will default to CUDA for GPU training when available. With CUDA, the network can reach passable accuracy in about an hour; without CUDA, I didn't even try – it would take a long time.

Loading the Data

The first step was separating small test and valid subsets from the bulk of the dataset, which will remain as the train subset. The first two represent data that the model doesn't see during the training step, so they can be used to measure performance between (valid) or after (test) the training cycles. This step was already completed but could have been done programmatically as well.

Next, all three sets must be resized and cropped to a size of 224x224 pixels, which are the dimensions required by most (including VGG's) pre-trained networks. But, to help the network generalize better and overfit less, the train set can also be subjected to a selection of Torchvision transformation, like random rotation, random cropping, random flipping, and color jittering – these functionally expand the set by creating random variations of the same images.

Finally, the pre-trained networks available to me were all trained on ImageNet data, which normalizes each color channel separately. After converting the images to tensors, this normalization can happen simply by passing them the mean and standard deviation of each color channel as calculated from ImageNet, which will result in each color to a value ranging between -1 and 1.

Mapping Labels to Names

This step is a minor but extremely necessary formality. The dataset isn't labeled in a way that's helpful to humans: there isn't a class labeled daffodil – but there is a class labeled 42 that contains a whole bunch of daffodil images. The file class_to_name.json contains all such associations and be used to interpret the network's results.

Building the Classifier

The network will be built upon a pre-trained network, but it will feature a new and untrained feed-forward network as the classifier. I tested AlexNet, DenseNet, and ResNet, but I ultimately found that the first one I tried, VGG, worked equally well. So I stuck with it.

My classifier is a pretty simple little layer cake of linear, ReLU, and dropout levels, each gradually reducing the number of features until it reaches the number of discrete classes within the dataset. The dropout chance starts low but increases, as suggested in Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

Training the Classifier

All that remains to be done before baking this cake is defining the hyperparameters and training cycle. Hyperparameters, like number of epochs, testing frequency, batching, and learning rate (specified above) mostly determine the resources – primarily time – that will be allocated to training.

Udacity expected a final accuracy of > 70%, but I hit 86.5% without too much time investment. Higher accuracy is easily attainable, but it might require cloud computing time to be realistic.

Testing the Network

The network will see versions of the training set images many thousands of times; and, without proper precautions, the network will begin learning how to identify those specific images rather than learning how to generally identify what those images contain. The same is true for the validation set, despite seeing those less frequently. This behavior is called overfitting.

To confirm whether the accuracy seen during training is indeed accurate, a new network should be tested with the unseen images in the testing set. Fortunately, this network classifies the testing set as accurately as the validation set, so we can say that this network isn't overfitting.

Saving and Loading the Checkpoint

If testing went well enough, the checkpoint deserves to be saved and compared to future checkpoints, until an acceptable winner emerges. This checkpoint can then be reloaded for further inference.

A "checkpoint" includes everything needed to completely rebuild the model as it stood at the end of testing: the classifier, the hyperparameters, the optimizer state, etc.

Inference for Classification

Now to start classifying stuff! Once a high-performing checkpoint emerges, it can be loaded into another pipeline: a simplified version of the preproccessing and feed forward stages of training with a single image as the input and K number of output probabilities, each roughly the likelihood of that image belonging to that class.

Image Preprocessing

Just like training, inference also requires some preprocessing, because the network was trained with a narrow expectation of input – namely, a 224x224 image with colors normalized according to ImageNet. To be safe, I resize the images such that the shortest dimension is 256 pixels, then I crop a 224x224 section from the middle; this helps the network focus on the center of the image.

Two more things: PyTorch disagrees with PIL and NumPy on whether the color channel should be the first or third matrix dimension, respectively. But a quick NumPy transpose can reorder those. Likewise, the network expected the image to be transformed and normalized in specific ways, but our eyeballs expect the images to be as they were – better undo that as well.

Class Prediction

Finally, the full pipeline is complete: everything from inference to label-name mapping to results presentation can be combined, an image can be given as input, and a classification of the image can be expected as output.

Sanity Checking

It's wise to return more than just the highest single probability, however; because the added information can confirm how certain the network is about the classification it has made. For instance, in the following inference, there are strong runners-up, which means there's some significant uncertainty:

Now, let's see whether mine can do better...

Booyah! That's a wild pansy, bay bee!
Thank you for reading!

Made by SeanvonB | Source