Facial Keypoint Detector

Made by SeanvonB | Source

This is a project from my Computer Vision Nanodegree, which I completed in 2020. It's an exploration of using Convolutional Neural Networks (CNNs) with Haar Cascades to plot facial keypoints (also called "landmarks") on images containing one or more faces. This work provides the foundation for any number of other hypothetical projects, like facial filters, facial tracking, facial pose recognition, emotion recognition, occupancy summation, and so on.

Per Udacity's instruction, the project was broken down into four steps:

  1. Load and visualize data to become familiar with the dataset
  2. Define network architecture and train model
  3. Deploy the model as a facial keypoint detector
  4. Do important data science, i.e. put sunglasses and moustaches on faces

1.0 Examine Data

Image data for this project consists of 5,770 full-color stills taken from the YouTube Faces Database and turned into sets of one image paired with its associated known keypoints. A random selection of 3,462 images are the training set, i.e. the ones used to train new models; and the remaining 2,308 are the testing set, i.e. the ones used to test trained models. Their keypoints are stored in .csv files that can also be found in the ./data/ directory. Here are a couple examples and their associated keypoints:

Oh no, Priyanka – you've got something on your face! Kit, you look fine.

Those pink dots are the keypoints. There are 68 of them, each with their own accessible coordinates, and they identify important facial structures. Each keypoint is individually numbered, and ranges of numbers can be used to select specific facial regions.

They generally increase left-to-right, and they identify regions as follows:

This will be important later.

1.1 Types of Data

Now, the image data must be retrieved, loaded, and repackaged into a new form that maintains association with the keypoint data.

First, the image data is retrieved and unzipped.

The information about the images and keypoints in this dataset are summarized in .csv files, which can be read in using pandas (🐼). From the training CSV, and I collect the keypoint annotations as an (N, 2) array, where N is the number of keypoints and 2 is the dimension of the keypoint coordinates: (x, y).

Note: when I mention the dimensions of something, think of those dimensions as the columns and rows of a table. Because that's what they are. If an object has 3 dimensions, you can think of them as a stack of 2-dimensional tables or maybe as a Rubik's cube. Again, that's roughly what they are.

The following will retrieve the image data and unzip it into the /data/ directory:

Next, the .csv files must be read in by pandas (🐼); and, while that's imported, everything else may as well be imported too.

Finally, images and known keypoints can be bundled together into the arrays mentioned above.

1.2 Check Images

Below, the function show_keypoints takes in an image and keypoints and displays them.

This is a good moment to just check out a whole bunch of different images and develop a feel for the dataset and the objective. Important takeaways from this step are that the images are all different shapes and sizes, so they'll have to be normalized in at least a couple different ways.

1.3 Dataset Class

The following is a modified version of the Dataset class from PyTorch and directions for using it, both provided by Udacity.

torch.utils.data.Dataset is an abstract class representing a dataset. This class will allow us to load batches of image/keypoint data, and uniformly apply transformations to our data, such as rescaling and normalizing images for training a neural network.

Your custom dataset should inherit Dataset and override the following methods:

  • __len__ so that len(dataset) returns the size of the dataset.
  • __getitem__ to support the indexing such that dataset[i] can be used to get the i-th sample of image/keypoint data. Let's create a dataset class for our face keypoints dataset. We will read the CSV file in __init__ but leave the reading of images to __getitem__. This is memory efficient because all the images are not stored in the memory at once but read as required.

A sample of our dataset will be a dictionary {'image': image, 'keypoints': key_pts}. Our dataset will take an optional argument transform so that any required processing can be applied on the sample. We will see the usefulness of transform in the next section.

Awesome.

The previous code cell instantiates the dataset, while the following fires off a few random requests to see what the dataset returns.

1.4 Transforms

Likewise, Udacity provided some transform functions for preprocessing the images – but I think these may just be copies of torchvision functions? Regardless, images need to be preprocessed, because networks expect certain size and dimensions (224x224 seems popular) and particular color normalization. Also, as required by PyTorch, everything created thus far must be converty from NumPy lists and arrays into Tensors.

Here's what Udacity provides:

  • Normalize: to convert a color image to grayscale values with a range of [0,1] and normalize the asssociated keypoints to be within a range of about [-1, 1].
  • Rescale: to rescale an image to a desired size.
  • RandomCrop: to crop an image randomly.
  • ToTensor: to convert numpy images to torch images.

We will write them as callable classes instead of simple functions so that parameters of the transform need not be passed everytime it's called. For this, we just need to implement __call__ method and (if we require parameters to be passed in), the __init__ method. We can then use a transform like this:

tx = Transform(params)
transformed_sample = tx(sample)

Observe below how these transforms are generally applied to both the image and its keypoints.

Fabulous.

1.5 Test the Transforms

This just fires off some transforms to see what they do and whether they do what was expected. They do.

1.6 Create the Transformed Dataset

Finally, the tools are assembled, and the actual dataset (that will be used momentarily for training) can be created. The image is rescaled such that the smaller dimension is 256 pixels, a 224x224 segment is randomly cropped from somewhere within bounds, color normalization and the grayscale filter are applied, and finally the image can be converted to a Tensor.

This data_transform pipeline will be used repeatedly throughout this project.

2.0 Define the Convolutional Neural Network

This section will progress through assembling the model; training the network; benchmarking its performance; and, if it doesn't perform well, repeating the process until it does.

2.1 CNN Architecture

Convolutional Neural Networks (CNNs) are comprised of a few specific layers:

I also use dropout layers, which help prevent overfitting, or when the model can't generalize beyond the training set. Dropout layers do slow down the training process, but – within the safety of this personal project – I can just increase the learnrate and lower the epochs. Reckless training won't crash any autonomous cars here.

The model I build in model.py is just a copy of NaimishNet from this scholarly paper, which sounds like I'm being academic, but I'm really just letting someone smarter think for me – thanks, gang! So, the architecture looks something like this:

2.2 PyTorch Neural Networks

Udacity provides the following notes on using PyTorch:

To define a neural network in PyTorch, you define the layers of a model in the function __init__ and define the feedforward behavior of a network that employs those initialized layers in the function forward, which takes in an input image tensor, x. The structure of this Net class is shown below and left for you to fill in.

Note: During training, PyTorch will be able to perform backpropagation by keeping track of the network's feedforward behavior and using autograd to calculate the update to the weights in the network.

Define the Layers in __init__

As a reminder, a conv/pool layer may be defined like this (in __init__):

# 1 input image channel (for grayscale images), 32 output channels/feature maps, 3x3 square convolution kernel
  self.conv1 = nn.Conv2d(1, 32, 3)
  # maxpool that uses a square window of kernel_size=2, stride=2
  self.pool = nn.MaxPool2d(2, 2)

Refer to Layers in forward

Then referred to in the forward function like this, in which the conv1 layer has a ReLu activation applied to it before maxpooling is applied:

x = self.pool(F.relu(self.conv1(x)))

Best practice is to place any layers whose weights will change during the training process in __init__ and refer to them in the forward function; any layers or functions that always behave in the same way, such as a pre-defined activation function, should appear only in the forward function.

Import model

You are tasked with defining the network in the models.py file so that any models you define can be saved and loaded by name in different notebooks in this project directory. For example, by defining a CNN class called Net in models.py, you can then create that same architecture in this and other notebooks by simply importing the class and instantiating a model:

   from models import Net
    net = Net()

Originally, it was important that model.py was a separate file, because this used to be multiple Jupyter notebooks and each needed access to it. However, I've since merged all the notebooks, and model.py remains more as a vestige of that workflow than as an example of good separation of concerns.

2.3 Transform the Data

Again, this appears to already have been done above, but this notebook used to be multiple notebooks and there was some overlap. It doesn't hurt to see the transformation pipeline again; and, this time, it calls from data_load.py, which is a better workflow.

What follows are the transforms I defined:

2.4 Batch the Data

Batch size is another one of those hyperparameters that (a) I theoretically understand, but (b) I still just randomly twist the knob up and down. Smaller batch sizes can help a network generalize better, but larger batch sizes will train faster. In my reading, I see many people suggesting 32 as an acceptable all-around size; but, in this case, I had some spare time and dropped the size to 10 – testing improved by an amount so small that it was probably just variance.

Just pop the transformed dataset into PyTorch's DataLoader.

From Udacity:

Note for Windows users: Please change the num_workers to 0 or you may face some issues with your DataLoader failing.

Good note, thanks.

2.5 Load the Test Data

While loading the training set, why not also load the testing set? The test data isn't shown to the model during training, as their names imply. Unseen data like this will be the only accurate way to confirm whether the network can accomplish the objective, because the network will have seen contents of the training set many thousands of times by the time training finishes.

2.6 Apply Model to Test Sample

Currently, the model has random starting weights and isn't remotely "intelligent" yet; but that doesn't mean it can't be given samples!

For some sanity checking, I assembled this pipeline earlier than necessary and passed through a sample, so I could see the base performance of the untrained model. Plus, this step will also help detect leaks in the pipe.

2.7 Debugging

At this point, I was seeing size errors, because the NaimishNet architecture didn't transfer verbatim. The following was helpful for debugging:

2.8 Visualize Predicted Keypoints

Udacity provided this show_all_keypoints function to visualize both the predicted keypoints (pink) and the actual keypoints (green):

But that's not all. Once the model completes the feed-forward process on a sample and returns an inference, the sample remains transformed – it'll need to be un-transformed before it can be appreciated by human eyeballs.

That's where the following helper function comes in:

Alright, cool – so, when the model doesn't yet know any better, it just places all the keypoints in the center. That's interesting.

2.9 Loss Function and Optimizer

All that remains before training the network is setting the loss function and optimizer and defining the hyperparameters.

In my image classifier, cross entropy loss was the obvious choice for a classification task, i.e. one that produces many outputs each equal to the probability of something being true. However, in a regression task like this one, there will only be two outputs per keypoint: an x coordinate and a y coordinate. According to this documentation, regression can use mean squared error (or MSE) instead.

As for the optimizer, Adam was a good friend in high school, so I always give him a chance first. Yee, boi!

2.10 Training

At this point, everything has been explained – probably more than once! But some guesswork does remain. I always start with just a few epochs to save time and determine whether everything works before finally jumping up to 20 epochs for the real run. See you on the other side!

2.11 Testing

The previous training routine took about 2 hours to complete, but I ran some variation of the last couple steps several times before accepting this model. Below, you'll find the results of this test – remember: the pink points are predicted by the model, while the green points are the actual, known keypoints.

Uh-oh, I still see some signs of overfitting. Notice how the predicted points tend to be slightly more centralized than the actual keypoints, or how the predicted points don't deform their structure quite as much when the face isn't looking directly into the camera?

This was a pretty good showing, however; I definitely saw worse on previous runs. So, I'll move on to...

2.12 Save the Model

2.13 Nanodegree Questions

The following are questions that I had to answer when submitting this project to Udacity for my Computer Vision Nanodegree:

Question 1: What optimization and loss functions did you choose and why?

Answer: I used Adam, because it seems to be the latest hotness that all the gals down in Stack Overflow are talking about. For loss function, I used MSE, but I don't know whether it was an impactful choice – all I know is that it had to be a function suitable to a regression problem.

Question 2: What kind of network architecture did you start with and how did it change as you tried different architectures? Did you decide to add more convolutional layers or any layers to avoid overfitting the data?

Answer: My network is based on the NaimishNet example, because I doubt that I can do better than people who have dedicated their lives to the study of this subject. Oh, wait, I actually did try to do better! I added fully-connected layers to NaimishNet, but they only increased overfitting. Today I Learned.

Question 3: How did you decide on the number of epochs and batch_size to train your model?

Answer: From what I can tell, 20 epochs seems to be an accepted standard for a proof of concept. I chose a batch size of 10, because smaller batches have been shown to improve a model's ability to generalize. But I've since observed that batch sizes that are divisible by 8 seem to be more fashionable; so, in the future, I'll probably default to 16 or 32.

Question 4: Choose one filter from your trained CNN; what purpose do you think it plays? What kind of feature do you think it detects?

Answer: The first filter (index=0) doesn't do anything dramatic – it's really just a gentle blur. However, I wonder whether mild filters like this one are also beneficial; perhaps they preserve certain features for extraction by subsequent layers.

2.14 Feature Visualization

Now that I have a "working" model, the first thing I wanna do is jump into those guts and see what's happening inside. Aren't you curious what these images look like after they've been transformed, convoluted, and maxpooled? Let's take a look at what's going on inside a CNN.

First, an example of what a convolution filter looks like:

This little 5x5 block slides left-to-right, top-to-bottom across the image, multiplying the pixels underneath. That seemingly random pattern might highlight important features and/or reduce everything else to a featureless gray soup. The end result of this filtering process is called a feature map and can look something like this:

This particular feature map might function as edge detection for a light source originating from the top-left corner, but feature maps aren't created intentionally or with such clear purpose. Basically, the network values this feature map for some reason, because the filter weights were incrementally shifted until this map was created – but I can only speculate why.

3.0 Face and Facial Keypoint Detection

Now that the network has been trained and proven (somewhat) effective through testing, it can apply facial keypoints to any image – for proper function, that image ought to include at least one human face, but I suppose you're free to confuse the network by passing something else through.

However, the network currently know how to detect facial keypoints; it doesn't know how to detect faces yet. This dataset had already been preproccessed in such a way that the faces contained within were more or less the focal point of the image. And what about images with more than one face?

In section 3, the pipeline will be extended to include the following:

  1. Detect every face within an image using a face detector, so those faces can be cropped out of the original image
  2. Preprocess those cropped samples into the format the network expects, i.e. grayscale Tensors of size 224x224
  3. Apply the trained model and be amazed!

Let's get back into it...

The following will display the selected image (again, you could replace this with any image at this point), completely unprocessed:

Looking good, Obamas!

3.1 Detect All Faces

In order to detect where exactly their faces are, or even how many faces there are, I'm using a pre-trained Haar Cascade detector via OpenCV. The detector I'm using is located in the ./detectors/ directory and works by using tiny 2x2 and 3x3 filters like these, which are called Haar features:

Unlike the filter shown above, you'll notice that Haar features are strictly 100% or 0% – there are no shades of gray. They also have specific, prescribed patterns, which are useful for locating the edges of shapes in images.

It's called a Haar Cascade detector, because it uses a cascade of classifiers that all use Haar features. So, rather than constantly running and re-running thousands of "is this a face?" checks, the cascade begins by running a few "could this region possibly contain a face?" checks. If the region can't, then the detector tosses out that region and moves on to the next. But, if the region can, the cascade continues through a series of other classifiers that check for increasingly specific indicators that a face is present – if any of these fail, the detector tosses it out and moves on again.

Here's an example of successful Haar Cascade detection:

3.2 Load Trained Model

As before, this project used to be multiple notebooks. So, loading the model again isn't necessary anymore, but it's always good to see the full process.

3.3 Keypoint Detection

This time, when I loop over the faces in the sample image, I will transform the selection according to what the network expects; once again, that's a grayscale Tensor of size 224x224 with normalized color channels. The transforms will be the same ones from data_load.py that I have already used several times.

Udacity provides this helpful hint for how to handle faces that varying significantly from the size the model expects:

Hint: The sizes of faces detected by a Haar detector and the faces your network has been trained on are of different sizes. If you find that your model is generating keypoints that are too small for a given face, try adding some padding to the detected roi before giving it as input to your model.

Finally, the image can be passed through the network, and hopefully something like this will emerge:

That looks pretty good!

But I do see further examples of the model overfitting, like how the points aren't elastic enough to accomodate Barack's beautiful chin.

4.0 Facial Filters

Alright, friends – the network can detect facial keypoints, and it can now detect faces! There remains only one thing left to do before the singularity can begin: the network must be capable of putting "Deal With It" sunglasses on any detected face! Here's what I'm shooting for:

First, the sunglasses...

4.1 Overlay Sunglasses

Like me, you might recognize the concept of an alpha channel from Photoshop. But, now, that channel has a novel use: the network can use that channel to decide which pixels from the sunglasses image (all pixels with alpha > 0) should be overlaid on the face image and, once scaled, where they should be placed relative to the image anchor.

The last puzzle to solve will be how to place that anchor...

Let's flashback to the example of numbered facial keypoints:

Using this example as the key, everything from eye width to nose length can be calculated programatically! It will require some calibration to figure out exactly how sunglasses fit on someone's face; but, once the variables have been nailed down, no face will be safe!

But there will be one important hitch to remember: the numbers above are off by one, because the keypoint array will begin with keypoint[0].

Now, I gotta grab some keypoints and start calibrating...

Heck yeah! Data science!

Made by SeanvonB | Source