Afaik weight decay is inspired from L2 regularisation which goes back to linear ...

nonameiguess · on Aug 10, 2023

This. Weight decay is just a method of dropping most weights to zero which is a standard technique used by statisticians for regularization purposes for decades. As far as I understand, it goes back at least to Tikhorov from 1970 and was mostly called ridge regression in the regression context. Normal ordinary least squares attempts to minimize the L2 norm of the squared residuals. When a system is overdetermined, adding a penalty term (usually just a scalar multiple of an identity matrix) and also minimizing the L2 norm of that biases the model to produce mostly near-zero weights. This helps with underdetermined systems and gives a better conditioned model matrix that is actually possible to solve numerically without underflow.

It's kind of amazing to watch this from the sidelines, a process of engineers getting ridiculously impressive results from some combo of sheer hackery and ingenuity, great data pipelining and engineering, extremely large datasets, extremely fast hardware, and computational methods that scale very well, but at the same time, gradually relearning lessons and re-inventing techniques that were perfected by statisticians over half a century ago.

tbalsam · on Aug 10, 2023

L1 drops weights to zero, L2 biases towards Gaussianality.

It's not always relearning lessons or people entirely blindly trying things either, many researchers use the underlying math to inform decisions for network optimization. If you're seeing that, then that's probably a side of the field where people are newer to some of the math behind it, and that will change as things get more established.

The underlying mathematics behind these kinds of systems are what has motivated a lot of the improvements in hlb-CIFAR10, for example. I don't think I would have been able to get there without sitting down with the fundamentals, planning, thinking, and working a lot, and then executing. There is a good place for blind empirical research too, but it loses its utility past a certain point of overuse.

whimsicalism · on Aug 10, 2023

this comment is so off base, first off no l2 des not encourage near 0 weights, second off they are not relearning, everyone already knew what l1/l2 penalties are