Robustness and Security in ML Systems, Spring 2021
Jonathan Soma
January 19, 2021
Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel
a.k.a. LeCun90c
Chief AI Scientist (and several other titles) at Facebook, “founding father of convolutional nets.”
All kinds of badly programmed computers thought that “Le” was my middle name. Even the science citation index knew me as “Y. L. Cun”, which is one of the reasons I now spell my name “LeCun”.
From Yann’s Fun Stuff page
How to turn handwritten ZIP codes from envelopes into numbers
ONLY concerned with converting a single digit’s image into a number
Inspiration for CNNs, based on the relationship between the human eye and brain. A large difference is that LeCun used backprop, which makes the paper much simpler to read and the output more effective!
After segmentation, ~40x60 pixel greyscale image .
How much of this information do we need?
What is the information we need?
16x16 grid of greyscale values, from -1
to 1
.
Normalized from ~40x60px original, preserving aspect ratio. Network needs consistent size!
To a human: Potential classes of 0
, 1
, 2
…9
To the computer: Ten nodes, activated from -1
to +1
. Higher value means higher probability of it being that digit. More or less one-hot encoding.
Given the output
[0 0 0.5 1 0 -0.3 -0.5 0 0.75 0]
The network’s prediction is nine because 0.75
is the highest number. Next most probable is a three with a score of 0.5
.
Not fully-connected. “A fully connected network with enough discriminative power for the task would have far too many parameters to be able to generalize correctly.”
A convolution is used to “see” patterns around a pixel like horizontal, vertical or diagonal edges.
It’s just linear algebra: a kernel is applied to create a new version of a pixel dependent on the pixels around it. The kernel (or convolutional matrix) is just a matrix that is multiplied against each pixel and its surroundings.
Edges of the image are padded with -1
to allow kernel to be applied to outermost pixels. The result is called a feature map.
The -1
+1
range of each feature map highlights a specific type of feature at a specific location.
Four different 5x5 kernels are applied, creating four different 576-node feature maps that each highlight a different type of feature.
We don’t need all that detail, though! Layer H2 averages the 24x24 feature maps down to 12x12, converting local sets of 4 nodes in H1 to a single node in H2.
H3 is another feature layer, operating just like H1 but generating twelve 8x8 feature maps. Each kernel is again 5x5.
Note that not all H3 kernels are applied to all H2 layers. Selection is “guided by prior knowledge of shape recognition.” This simplifies the network.
H4 is similar to H2, in that it averages the previous layer.
This reduces H3’s 8x8 size to 4x4.
10 nodes, fully connected to H4. Each activates between -1
and +1
with a higher score meaning a more likely prediction for that digit.
Robust model that generalizes very well when presented with unusual representations of digits.
Throughput is mainly limited by the normalization step! Reaches 10-12 classifications per second.
But then they went to sleep for one of the many AI winters. For successful deep learning you generally need:
ImageNet: 15 million labeled high-res images, belonging to ~22,000 categories. Labeled by people on Mechanical Turk.
ILSVRC: ImageNet Large-Scale Visual Recognition Challenge - subset of ImageNet with 1,000 images in each of 1,000 categories
Up until 2012, the entrants were very concerned with quick, manually engineered features and SVMs.
Best performance error rates:
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Another CNN! But bigger, deeper, and much more optimized.
But also remarkably similar to LeCun90c!
Input was variable-size normal images.
Why are we keeping RGB values? What are downsides of using RGB?
Not the 256x256 image! 244x244 images instead.
Top left, top right, bottom left, bottom right, and center. Reflected, too. Can actually create these on CPU while the GPU is working.
What’s the point of doing this? They’re basically the same image!
Very deep - LeCun’s only had two convolutional layers. Why do we suddenly have all these extra layers?
What is missing compared to LeCun’s digit analysis?
LeCun90c: “Averaging layer,” take 2x2 pixel area and condense to 1 pixel
AlexNet: “Pooling” - take every other pixel, average with the surrounding 8 pixels. This resizes the same amount, but with overlap!
Happens on layers 1, 2 and 5. Why wouldn’t you do this on every layer? Why do this at all?
Force competition between kernels at the same location.
If a lot of kernels have high levels of activity, adjust so only the most active ones express themselves.
a.k.a. if an area has a few features, the network mostly notices the most obvious ones.
Occurs after convolution but before max pooling.
Why not just focus on all of the features?
Uses ReLUs instead of hyperbolic tangent for neuron model.
Why is training speed important?
Can do in 5 epochs what would have taken ~37 epochs with tanh!
Limiting factor for model is training time. Decreasing training time = increase dataset or train longer.
Wrote a GPU implementation of 2D convolutions. GPUs are excellent at running computations in parallel, which allowed a much larger CNN than previous work.
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
Feature maps are spread across two GPUs, GPUs only talk to each other between the 2nd and 3rd layers and after the last conv layer.
Parallelization = speedup = more training epochs
Similar to a “columnar” CNN.
It would be nice to combine multiple prediction models, but too expensive even with the speedups!
Instead, DROPOUT: during training, randomly set half of the neurons to 0 and don’t let them participate. Almost like having multiple models, and only increases training time 1-2x.
It’s like breaking your hand and having to write with your non-dominant one.
Only used in the first two layers.
Amazing performance! Blows the competition out of the water!
Also tested against ImageNet 2009, top-1 and top-5 error rates were 67.4% and 40.9% compared to best published results on of 78.1% and 60.9%.
After 2017 ImageNet stopped hosting because the models were too good, beating humans!
7.3% top-5 error! 19 layers.
Stacking convolutional layers: Use multiple 3x3 layers instead of larger layers. Smaller filters, but deeper network!
6.7% top-5 error! 22 layers.
Inception module: What size convolution should we use? All of them! Then let the network figure out which one to pay attention to. Far fewer parameters.
3.57% top-5 error! 152 layers!
Residual blocks: More layers aren’t always better layers! Residual blocks have output of layer n feed into layer n+1 but also layer ~n+3.