Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.
IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.
I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?
They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.
What is the performance of reformer or linformer or any of these other new models in practical applications (not the benchmarks that researchers game)? Is it better than BERT?
I actively follow the state of the art pre trained models on paperswithcode.com and Nlp progress.
The state of the art (often outperforming BERT by far) is XLnet and sadly is from 2019.
2020 has been stagnating (except for the special case of generative tasks with GPT3)
I have observed that zero researchers have tried to improve on top of XLnet.
While BERT has had ~20 alternatives implementations that improve upon it.
Researchers are often unaware of what is the current state of the art,this induce a lag in research progress.
https://linear-transformers.com/