Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.
IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.
I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?
They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.
https://linear-transformers.com/