The Reformer – Pushing the limits of language modeling

bitforger · on July 3, 2020

A alternative to this has recently been published at ICML that claims to be faster. The website and tutorial video are very nice, too.

https://linear-transformers.com/

blueblimp · on July 3, 2020

Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.

joeddav · on July 3, 2020

IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.

cgearhart · on July 3, 2020

Thanks for sharing. This was great.

I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?

algo_trader · on July 3, 2020

Are these reformer/linformer mosty space-efficient or also inference-runtime improved?

101101001010 · on July 3, 2020

They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.

[edit] Linformer (https://arxiv.org/pdf/2006.04768.pdf) is a different project from the one linked in https://linear-transformers.com/ (Transformers are RNNs https://arxiv.org/pdf/2006.16236.pdf).

algo_trader · on July 3, 2020

Thanks also for the edit and the heads up!! Missed that

mountainriver · on July 3, 2020

This is amazing!

random_savv · on July 3, 2020

This could be a transformative way to share research papers - it seems you can run all their models inside the paper!

Der_Einzige · on July 3, 2020

What is the performance of reformer or linformer or any of these other new models in practical applications (not the benchmarks that researchers game)? Is it better than BERT?

The_rationalist · on July 3, 2020

I actively follow the state of the art pre trained models on paperswithcode.com and Nlp progress. The state of the art (often outperforming BERT by far) is XLnet and sadly is from 2019. 2020 has been stagnating (except for the special case of generative tasks with GPT3) I have observed that zero researchers have tried to improve on top of XLnet. While BERT has had ~20 alternatives implementations that improve upon it. Researchers are often unaware of what is the current state of the art,this induce a lag in research progress.

ThomThom · on July 3, 2020

"zero researchers have tried to improve on top of XLnet" I question this assertion.

In particular at least the Roberta model by Facebook is already improving significantly upon XLNet.

algo_trader · on July 3, 2020

Are reformer/linformer more space-efficient or also inference-runtime improved?

For me the greatest trick is improved runtime compared to older seq/RNN techniques