Hacker News new | past | comments | ask | show | jobs | submit login
The Reformer – Pushing the limits of language modeling (colab.research.google.com)
62 points by datashrimp on July 2, 2020 | hide | past | favorite | 13 comments



A alternative to this has recently been published at ICML that claims to be faster. The website and tutorial video are very nice, too.

https://linear-transformers.com/


Are the results actually good? Table 2 reports 3.40 bits/dim on CIFAR-10, but PixelRNN in 2016 got 3.06 bits/dim (Table 3 in https://arxiv.org/abs/1601.06759). I would like to compare the MNIST results also but I'm having trouble converting between bits/dim and nats in a way that gives a sensible result. It's a bit annoying that the paper does not compare to previously-reported numbers on these benchmarks.


IMO the theoretical insight w.r.t. transformers as RNNs through the kernel formulation of self-attention is more interesting than the experimental results.


Thanks for sharing. This was great.

I wonder how much of a limitation it poses that the gradient requires some massaging to preserve efficient training; does that only work for some kernels, or can it be automated for arbitrary kernels?


Are these reformer/linformer mosty space-efficient or also inference-runtime improved?


They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism.

[edit] Linformer (https://arxiv.org/pdf/2006.04768.pdf) is a different project from the one linked in https://linear-transformers.com/ (Transformers are RNNs https://arxiv.org/pdf/2006.16236.pdf).


Thanks also for the edit and the heads up!! Missed that


This is amazing!


This could be a transformative way to share research papers - it seems you can run all their models inside the paper!


What is the performance of reformer or linformer or any of these other new models in practical applications (not the benchmarks that researchers game)? Is it better than BERT?


I actively follow the state of the art pre trained models on paperswithcode.com and Nlp progress. The state of the art (often outperforming BERT by far) is XLnet and sadly is from 2019. 2020 has been stagnating (except for the special case of generative tasks with GPT3) I have observed that zero researchers have tried to improve on top of XLnet. While BERT has had ~20 alternatives implementations that improve upon it. Researchers are often unaware of what is the current state of the art,this induce a lag in research progress.


"zero researchers have tried to improve on top of XLnet" I question this assertion.

In particular at least the Roberta model by Facebook is already improving significantly upon XLNet.


Are reformer/linformer more space-efficient or also inference-runtime improved?

For me the greatest trick is improved runtime compared to older seq/RNN techniques




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: