Transfer learning lost to my small CNN on FER2013

For my second PyTorch project I wanted something that would push back harder than MNIST, and FER2013 fit the description nicely: 35,887 grayscale 48×48 face crops sorted into seven emotions, the Angry and Disgust and Fear and Happy and Sad and Surprise and Neutral that the dataset insists every face falls into. It is a famously rough dataset, with human accuracy topping out somewhere around 65% and published state of the art sitting in the low 70s, and I was not trying to beat any of that. What I wanted was to carry the infrastructure I had built for the MNIST classifier into a problem that would actually resist me, and along the way to test an assumption I had absorbed somewhere without ever quite checking, which is that transfer learning is always the easy win. The code is on GitHub, and the result that came out the other end was that my hand-rolled CNN beat a frozen VGG16 backbone, and the gap between them was not small.

The imbalance problem

FER2013 is badly lopsided, with something like 7,200 training images of Happy faces against 436 of Disgust, and if you train a model naively on a split like that it learns to be confidently happy at everything it sees. The genuinely nasty part is that the accuracy on paper still looks reasonable, because Happy so dominates the distribution that a model can coast on it and the headline number quietly hides the fact that the model has given up on the rare classes entirely. I corrected for this in two places at once, which with a 16× imbalance felt earned rather than redundant. At the dataloader a WeightedRandomSampler weighted each sample by the inverse of its class frequency, so that every batch came out roughly balanced no matter what the underlying ratio was, and for the transfer-learning model I also passed class weights into CrossEntropyLoss so that a missed Disgust example cost the network more than a missed Happy one.

The self-made model

FER2013V1 is three convolutional blocks and a small classifier head. Each block has the same shape, a pair of Conv2d → BatchNorm → GELU layers followed by a MaxPool(2), and the head runs Flatten → Dropout(0.25) → Linear(2304, 64) → GELU → Dropout(0.25) → Linear(64, 64) → GELU → Linear(64, 7). The one genuinely unusual choice was the channel counts. The textbook instinct is to widen as you go deeper, 64 then 128 then 256, and I went the other way entirely, 256 then 128 then 64. The reasoning was that a 48×48 input does not have much spatial structure to expand into, and what it does have is fine-grained pixel patterns, the curve of a mouth and the angle of a brow, which benefit from a wide early layer that can capture all that texture before the maxpools start compressing it away. By the time the feature map is down to 6×6, 64 channels is plenty to describe what is left of it. It is a choice that only makes sense for genuinely tiny inputs, and I would not carry it over to anything larger. I trained it with AdamW(lr=0.001, weight_decay=1e-2) and CrossEntropyLoss, at about 1.3M parameters in total.

Transfer learning was supposed to be the easy win

The second model was the textbook recipe, which is to take VGG16 pretrained on ImageNet, freeze the feature extractor, bolt on a fresh 7-class head, and fine-tune. ImageNet has fourteen million images behind it, so VGG16’s filters surely had to be better than anything I could train from scratch on 28k tiny grayscale faces, or at least that was the assumption I was going in to test. Making it actually run took some massaging on the input side, converting the grayscale to three channels, upscaling the 48×48 to 64×64, which is smaller than VGG’s native 224 to keep training tractable but still large enough that the conv stack would not collapse to a single pixel, applying ImageNet’s normalization statistics, and replacing the 1000-class head with a Linear(25088, 512) → BatchNorm → Dropout(0.5) → Linear(512, 256) → BatchNorm → Dropout(0.3) → Linear(256, 7). I added a ReduceLROnPlateau scheduler too, since I was expecting long, slow fine-tuning to be the order of the day. I trained it for 50 epochs, roughly 36 minutes on a Colab GPU, and the best test accuracy it reached was 44.7%, hit at epoch 32 and never bettered after. Meanwhile the self-made CNN had cleared 49% by epoch 2, before my Colab session disconnected on me.

Why the bigger model lost

It made sense once I had sat with it for a while. VGG16’s early filters are tuned for natural-image statistics, the color edges and ImageNet textures and the object-scale gradients you find in a photograph of a dog, and none of that is what separates “sad” from “neutral” in a 48×48 grayscale face. The signal there lives in small geometric differences around the eyes and the mouth, and the pretrained filters have no particular reason to be sensitive to it. The resolution only made things worse, because even after upscaling, 64×64 is more than three times smaller than what VGG was trained on, so most of the network ends up processing feature maps that have already collapsed to 2×2 by the final block, leaving the classifier almost no spatial information to work with. And because the backbone is frozen, the network cannot adapt its way out of any of this. The training loss kept dropping, which is the head memorizing whatever signal does make it through, but the test loss never followed, which is the classic shape of a head overfitting on top of a base network locked into the wrong inductive bias. It is the thing the transfer-learning tutorials never quite say out loud, that pretrained features only help you when your problem actually looks like the pretraining task, and 48×48 grayscale faces do not look like 224×224 ImageNet photos.

What carried over from MNIST

The entire helper_functions.py module from the MNIST project dropped in almost unchanged, the train and set_seeds and save_best_model and create_writer and the Cloudflare-tunnel TensorBoard helper all working as they were. That was the whole thesis of the last post, that the model is the small part and the workflow is the thing that transfers, and two projects in it is still holding up. The only extension I made was adding an optional scheduler argument to train so that ReduceLROnPlateau could step on the test loss each epoch, which is a small change and one worth keeping.

What’s next

The self-made model is clearly the right base, so the next round is about pushing it further rather than swapping it out. There is a clean 50-epoch run of it to finish end to end, since the disconnect cut me off at epoch 2, and a confusion matrix to plot, because I suspect Disgust and Fear are bleeding into each other even with the sampler doing its job. I want to try unfreezing the last block of VGG rather than freezing the whole feature extractor, to see whether partial fine-tuning closes any of the gap, and to swap the plain conv blocks for a ResNet-style residual block written from scratch. There is also data augmentation to tune for faces specifically, small rotations and horizontal flips but no vertical flip, since a person upside down is not really a person.

Takeaway

The cliché I walked in with was that you always start with transfer learning, and the result I walked out with is more careful than that. Pretrained features help when the source and target tasks share an inductive bias, and they mostly do not otherwise, and a network designed for the actual problem and trained from scratch on the actual data can comfortably beat a much larger pretrained one that was never going to fit the shape of what you were asking it to do.