A alternative to this has recently been published at ICML that claims to be fast...

blueblimp · on July 3, 2020

Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.

joeddav · on July 3, 2020

IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.

cgearhart · on July 3, 2020

Thanks for sharing. This was great.

I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?

algo_trader · on July 3, 2020

Are these reformer/linformer mosty space-efficient or also inference-runtime improved?

101101001010 · on July 3, 2020

They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.

[edit] Linformer (https://arxiv.org/pdf/2006.04768.pdf) is a different project from the one linked in https://linear-transformers.com/ (Transformers are RNNs https://arxiv.org/pdf/2006.16236.pdf).

algo_trader · on July 3, 2020

Thanks also for the edit and the heads up!! Missed that

mountainriver · on July 3, 2020

This is amazing!