Memorizing Transformers

lucidrains · on May 20, 2022

have an implementation of this over at https://github.com/lucidrains/memorizing-transformers-pytorc..., for any researcher exploring retrieval and memory with attention networks

knrz · on May 20, 2022

Dude your repo’s are great, marvellous code quality too for cutting edge papers. Keep it up!

lucidrains · on May 20, 2022

hey thanks! :^) hope someone makes the next big discovery with them

silencedogood3 · on May 20, 2022

Neat! Can you explain what the KNn is doing? I can’t quite follow the paper.

visarga · on May 20, 2022

It's a sparse attention scheme. They store and reuse activations thus "memorising" the past without the need for training. In order to keep the sequence short enough to fit into memory they only recall the k most similar memories from a much larger log.

axg11 · on May 20, 2022

See also RETRO, a type of retrieval transformer: [0], [1], [2]

[0] - https://www.deepmind.com/publications/improving-language-mod...

[1] - https://jalammar.github.io/illustrated-retrieval-transformer...

[2] - https://arsham.substack.com/p/retrieval-transformers-for-med...

6gvONxR4sf7o · on May 20, 2022

External memory with pretrained models (or more generally, external not-necessarily-differentiable memory) is one of the most exciting areas of ML right now. It opens up models to external things like facts and databases.

silencedogood3 · on May 20, 2022

Can you explain what the big deal is? I’m still in the early learning stages.

6gvONxR4sf7o · on May 20, 2022

As an example, if you want to encode all of the data in wikipedia with embeddings and train a model to answer questions with that information, historically, that would mean a model that encodes all of wikipedia, encodes the question, uses all of encoded wikipedia to decode an answer, then does backprop through all of that and updates the weights. Then it re-encodes all of wikipedia with the new weights and goes all over again, again and again at each training step, also somehow holding all of that in GPU memory. Meaning you basically couldn’t do it that way.

Today, we’re seeing big models that can encode all of wikipedia in useful ways. If the encodings are “good enough” then you can encode all of wikipedia once, before training another model that just has to encode a question, then use encoded wikipedia to decode an answer, then do backprop through just the answer and question. If wikipedia changes in the meantime, you can probably just update your database of encoded stuff and your learned QA model will be able to incorporate that new information.

amelius · on May 20, 2022

Replace Wikipedia by the internet, and you can replace Google Search by some (hopefully) soon to be discovered algorithm based on these principles. Exciting times.

jerpint · on May 20, 2022

The basic idea is to have a q,k,v cache of all the previously seen tokens that gets updated over time. The transformer can decide to do self-attention (and ignore the cache) or focus on elements from the cache (enabling it to attend to previously seen tokens). They mainly apply this to large documents, i'd be very curious to see a followup on time-dependent tasks like videos

shallichange · on May 20, 2022

Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime, Laserbeak, Megatron, Astro Train, Jazz

UmbertoNoEco · on May 21, 2022

People here dont deserve you :)

lukaszkups · on May 21, 2022

this is what I've been expecting when clicking on this submission

tipsytoad · on May 20, 2022

Could there be any merit training this on a common-sense dataset such as Cyc?

https://www.lesswrong.com/tag/cyc

ipsum2 · on May 20, 2022

Probably not, most common facts (sandcat is a type of feline) are already known by transformers. Maybe some obscure ones.

mountainriver · on May 20, 2022

Love it! Its seems like a lot of the ideas from reinforcement learning are making their way into transformer land and NLP

blackbear_ · on May 20, 2022

> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.

visarga · on May 20, 2022

> Wow.

Transformers are very limited in the size of the attention window. They can take a few thousand tokens at maximum. But your data might not fit into the window, and you also don't want to have to fine-tune the model. This paper offers a solution.

spullara · on May 20, 2022

It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.

jameshart · on May 20, 2022

The ‘ethics’ section seems surprisingly cursory and lacking in references.

“The ability to memorize large databases of facts could have potential ramifications for society, especially if those databases include sensitive personal information or copyrighted works. However, one advantage of using an external memory is that the memory can be easily cleared of all such information”

That’s it? Just ‘may have ramifications’?

No concern that this enables ‘Tay’-like failure modes where a system can be manipulated through input into generating particular output?

Or even just grappling with whether adding ‘memory of experiences’ to a language model might open the door to creating a system that has beliefs, or opinions…? and that maybe there might be some ethical concerns with just wiping that out?

ipsum2 · on May 20, 2022

That'd be a waste of space. Most transformer models have the same ethical concerns, which have been addressed in countless other papers. Why bother copy pasting the same essays in every minor tweak of transformers?

dotnet00 · on May 21, 2022

The ethics sections for ML papers almost always seem extremely superfluous. It's like asking a CPU designer to talk about the danger that their CPU can run code for computing firing trajectories. It's a paper about providing memory to ML models, it'll have all the possible applications that require memory, what else does one need?

kettleballroll · on May 20, 2022

The ethics section is a tacked on thing which is required by some large ML conferences. They're essentially a PR stunt. No ML researcher i know cares about it, or devotes more than the 5 minutes it takes to write some platitudes to the task. There are simply no incentives to write this properly. And quite frankly, i don't think there should be. We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout (which, let's face it, would usually require a whole additional paper for most meaningful contributions). I don't really see how we could change this.

Tldr: as a general rule you can ignore the ethics section of ML papers.

6gvONxR4sf7o · on May 20, 2022

> We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout

That’s the whole problem that led to the introduction of these sections.

belval · on May 20, 2022

That's debatable, would an "ethics" section on the original deepfake paper have changed anything?

ML research isn't as inaccessible as genetics research, if there's something idiotic that people can do with DL, they will eventually do it. Acting as if having people add a paragraph to their paper where they "reflect" on the consequences will change anything is only showing how disconnected you are with reality.

Research is research, there shouldn't be any "forbidden knowledge", we have laws for a reason.

balthigor · on May 21, 2022

> not to think about all potential fallout

You're doing it wrong then.

Ignoring ethics is lazy.

enchiridion · on May 20, 2022

Yep, this is correct.

YeGoblynQueenne · on May 20, 2022

>> Tldr: as a general rule you can ignore the ethics section of ML papers.

More generally still, you can ignore the ethics of ML researchers- pretty much for the same reasons that you can ignore the Great Turnip of Justice in the sky.

refulgentis · on May 20, 2022

I'm not sure it's scientific or helpful to include the risk that a program develops "beliefs" or "opinions", and terminating the program is "wiping [someone] out"

visarga · on May 20, 2022

> No concern that this enables ‘Tay’ like failure modes where a system can be manipulated through input into generating particular output?

Isn't that the core idea in prompting and few shot learning for large language models?

changoplatanero · on May 20, 2022

My feeling is that those topics would be best addressed in a separate paper by authors who have more of a background in ethics.