Hacker News new | past | comments | ask | show | jobs | submit login
Memorizing Transformers (arxiv.org)
179 points by silencedogood3 on May 20, 2022 | hide | past | favorite | 32 comments



have an implementation of this over at https://github.com/lucidrains/memorizing-transformers-pytorc..., for any researcher exploring retrieval and memory with attention networks


Dude your repo’s are great, marvellous code quality too for cutting edge papers. Keep it up!


hey thanks! :^) hope someone makes the next big discovery with them


Neat! Can you explain what the KNn is doing? I can’t quite follow the paper.


It's a sparse attention scheme. They store and reuse activations thus "memorising" the past without the need for training. In order to keep the sequence short enough to fit into memory they only recall the k most similar memories from a much larger log.



External memory with pretrained models (or more generally, external not-necessarily-differentiable memory) is one of the most exciting areas of ML right now. It opens up models to external things like facts and databases.


Can you explain what the big deal is? I’m still in the early learning stages.


As an example, if you want to encode all of the data in wikipedia with embeddings and train a model to answer questions with that information, historically, that would mean a model that encodes all of wikipedia, encodes the question, uses all of encoded wikipedia to decode an answer, then does backprop through all of that and updates the weights. Then it re-encodes all of wikipedia with the new weights and goes all over again, again and again at each training step, also somehow holding all of that in GPU memory. Meaning you basically couldn’t do it that way.

Today, we’re seeing big models that can encode all of wikipedia in useful ways. If the encodings are “good enough” then you can encode all of wikipedia once, before training another model that just has to encode a question, then use encoded wikipedia to decode an answer, then do backprop through just the answer and question. If wikipedia changes in the meantime, you can probably just update your database of encoded stuff and your learned QA model will be able to incorporate that new information.


Replace Wikipedia by the internet, and you can replace Google Search by some (hopefully) soon to be discovered algorithm based on these principles. Exciting times.


The basic idea is to have a q,k,v cache of all the previously seen tokens that gets updated over time. The transformer can decide to do self-attention (and ignore the cache) or focus on elements from the cache (enabling it to attend to previously seen tokens). They mainly apply this to large documents, i'd be very curious to see a followup on time-dependent tasks like videos


Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime, Laserbeak, Megatron, Astro Train, Jazz


People here dont deserve you :)


this is what I've been expecting when clicking on this submission


Could there be any merit training this on a common-sense dataset such as Cyc?

https://www.lesswrong.com/tag/cyc


Probably not, most common facts (sandcat is a type of feline) are already known by transformers. Maybe some obscure ones.


Love it! Its seems like a lot of the ideas from reinforcement learning are making their way into transformer land and NLP


> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.


> Wow.

Transformers are very limited in the size of the attention window. They can take a few thousand tokens at maximum. But your data might not fit into the window, and you also don't want to have to fine-tune the model. This paper offers a solution.


It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.


The ‘ethics’ section seems surprisingly cursory and lacking in references.

“The ability to memorize large databases of facts could have potential ramifications for society, especially if those databases include sensitive personal information or copyrighted works. However, one advantage of using an external memory is that the memory can be easily cleared of all such information”

That’s it? Just ‘may have ramifications’?

No concern that this enables ‘Tay’-like failure modes where a system can be manipulated through input into generating particular output?

Or even just grappling with whether adding ‘memory of experiences’ to a language model might open the door to creating a system that has beliefs, or opinions…? and that maybe there might be some ethical concerns with just wiping that out?


That'd be a waste of space. Most transformer models have the same ethical concerns, which have been addressed in countless other papers. Why bother copy pasting the same essays in every minor tweak of transformers?


The ethics sections for ML papers almost always seem extremely superfluous. It's like asking a CPU designer to talk about the danger that their CPU can run code for computing firing trajectories. It's a paper about providing memory to ML models, it'll have all the possible applications that require memory, what else does one need?


The ethics section is a tacked on thing which is required by some large ML conferences. They're essentially a PR stunt. No ML researcher i know cares about it, or devotes more than the 5 minutes it takes to write some platitudes to the task. There are simply no incentives to write this properly. And quite frankly, i don't think there should be. We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout (which, let's face it, would usually require a whole additional paper for most meaningful contributions). I don't really see how we could change this.

Tldr: as a general rule you can ignore the ethics section of ML papers.


> We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout

That’s the whole problem that led to the introduction of these sections.


That's debatable, would an "ethics" section on the original deepfake paper have changed anything?

ML research isn't as inaccessible as genetics research, if there's something idiotic that people can do with DL, they will eventually do it. Acting as if having people add a paragraph to their paper where they "reflect" on the consequences will change anything is only showing how disconnected you are with reality.

Research is research, there shouldn't be any "forbidden knowledge", we have laws for a reason.


> not to think about all potential fallout

You're doing it wrong then.

Ignoring ethics is lazy.


Yep, this is correct.


>> Tldr: as a general rule you can ignore the ethics section of ML papers.

More generally still, you can ignore the ethics of ML researchers- pretty much for the same reasons that you can ignore the Great Turnip of Justice in the sky.


I'm not sure it's scientific or helpful to include the risk that a program develops "beliefs" or "opinions", and terminating the program is "wiping [someone] out"


> No concern that this enables ‘Tay’ like failure modes where a system can be manipulated through input into generating particular output?

Isn't that the core idea in prompting and few shot learning for large language models?


My feeling is that those topics would be best addressed in a separate paper by authors who have more of a background in ethics.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: