Robustness and Security in ML Systems, Spring 2021
Jonathan Soma
January 19, 2021
Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel
a.k.a. Le Cun 1990
Chief AI Scientist (and several other titles) at Facebook, “founding father of convolutional nets.”
All kinds of badly programmed computers thought that “Le” was my middle name. Even the science citation index knew me as “Y. L. Cun”, which is one of the reasons I now spell my name “LeCun”.
From Yann’s Fun Stuff page
How to turn handwritten ZIP codes from envelopes into numbers
ONLY concerned with converting a single digit’s image into a number
Input: 16x16 grid of greyscale values, from -1 to 1
Normalized from ~40x60px original, preserving aspect ratio. Network needs consistent size!
To a human: Potential classes of 0, 1, 2…9
To the computer: Ten nodes, activated from -1 to +1. Higher value means higher probability of it being that digit. More or less one-hot encoding.
Given the output
[0 0 0.5 1 0 -0.3 -0.5 0 0.75 0]
The network’s prediction is nine because 0.75 is the highest number. Next most probable is a three with a score of 0.5.
Not fully-connected. “A fully connected network with enough discriminative power for the task would have far too many parameters to be able to generalize correctly.”
A convolution is used to “see” patterns around a pixel like horizontal, vertical or diagonal edges.
It’s just linear algebra: a kernel is applied to create a new version of a pixel dependent on the pixels around it. The kernel (or convolutional matrix) is just a matrix that is multiplied against each pixel and its surroundings.
Edges of the image are padded with -1 to allow kernel to be applied to outermost pixels. The result is called a feature map.
The -1 +1 range of each feature map highlights a specific type of feature at a specific location.
Four different 5x5 kernels are applied, creating four different 576-node feature maps that each highlight a different type of feature.
We don’t need all that detail, though! Layer H2 averages the 24x24 feature maps down to 12x12, converting local sets of 4 nodes in H1 to a single node in H2.
H3 is another feature layer, operating just like H1 but with 12 8x8 feature maps. Each kernel is again 5x5.
Note that not all H3 kernels are applied to all H2 layers. Selection is “guided by prior knowledge of shape recognition.” This simplifies the network.
H4 is similar to H2, in that it averages the previous layer. This reduces H3’s 8x8 size to 4x4.
10 nodes, fully connected to H4. Each activates between -1 and +1 with a higher score meaning a more likely prediction for that digit.
Robust model that generalizes very well when presented with unusual representations of digits.
Throughput is mainly limited by the normalization step! Reaches 10-12 classifications per second.