Hacker News new | past | comments | ask | show | jobs | submit login

A alternative to this has recently been published at ICML that claims to be faster. The website and tutorial video are very nice, too.

https://linear-transformers.com/




Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.


IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.


Thanks for sharing. This was great.

I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?


Are these reformer/linformer mosty space-efficient or also inference-runtime improved?


They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.

[edit] Linformer (https://arxiv.org/pdf/2006.04768.pdf) is a different project from the one linked in https://linear-transformers.com/ (Transformers are RNNs https://arxiv.org/pdf/2006.16236.pdf).


Thanks also for the edit and the heads up!! Missed that


This is amazing!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: