The State of Sparsity in Neural Networks

jfrankle · on Feb 28, 2019

This is an illuminating (and notably rigorous) read for anyone interested in neural network sparsity and compression. But - equally importantly - it's a valuable read for anyone interested in the replicability of neural network research in general. The authors make clear the urgent need to evaluate research (and reevaluate received wisdom) on networks of the scale and complexity used in practice. I hope this paper will spark some important conversations in the community about our standards for assessing new ideas (mine included). As this paper makes exceedingly clear, plenty of techniques and behaviors for MNIST and CIFAR10 manifest differently (if at all) in industrial-scale settings.

My biggest question coming out of this work was as follows: which small scale (or - at the very least - inexpensive) benchmarks share enough properties in common with these large scale networks that we should expect results to scale with reasonable fidelity? Resnet50 is still far too slow and expensive to use as a day-to-day research network in academia, let alone transformer. Personally, I've found resnet18 on CIFAR10 to pretty reliably predict behavior on resnet50 on imagenet, but that's anecdotal. For the academics who can't drop hundreds of thousands of dollars (or more) on each paper but still want to contribute to research progress, we should carefully assess (or design) benchmarks with this property in mind.

(With respect to the lottery ticket hypothesis, we have a complimentary ICML submission about its behavior on large-scale networks coming shortly!)

ekelsen · on March 1, 2019

I think the goal should be use the smallest dense network possible as the baseline. For MNIST, this might be a LeNet style convnet with [3, 9, 50] instead of the [20, 50, 500] network which is standard (and way overkill).

I haven't explored on CIFAR, but my guess is that using a more efficient architecture like mobilenetv2 would yield more likely to transfer results.

The general theme is that you should be using the smallest dense model you possibly can as a baseline.

ekelsen · on Feb 28, 2019

We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al., 2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification.