Hacker News new | past | comments | ask | show | jobs | submit login

I don't think that is true? As far as I know the grokking phenomenon was first observed (and the name coined) in this paper, not in any blog post:

https://arxiv.org/abs/2201.02177




That's true, and I probably should have done some better backing up, sorting out, and clarification. I remember when that paper came out, it rubbed me the wrong way too then, because it is people rediscovering double descent from a different perspective, and not recognizing it as such.

What it would be better defined as is "a sudden change in phase state after a long period of metastability". Even then it ignores that those sharp inflections indicate a poor KL between some of the inductive priors and the data at hand.

You can think about it as the loss signal from the support of two gaussians extremely far apart with narrow standard deviations. Sure, they technically have support, but in a noisy regime you're going to have nothing.... nothing.... nothing....and then suddenly something as you hit that point of support.

Little of the literature, definitions around the word, or anything like that really takes this into account generally, leading to this mass illusion that this is not a double descent phenomenon, when in fact it is.

Hopefully this is a more appropriate elaboration, I appreciate your comment pointing out my mistake.


Singular learning theory explains the sudden phase changes of generalization in terms of resolution of singularities. Alas it's still associated with the LW crowd.

https://www.lesswrong.com/s/mqwA5FcL6SrHEQzox/p/fovfuFdpuEwQ...


If it's any consolation, that post is...hot word salad garbage. It's like they learned the words on Wikipedia and then proceeded to try to make a post that used as many of them as possible. It's a good litmus test for experience vs armchair observers -- certainly scanning the article without decoding the phrasing to see how silly the argument is would seem impressive because "oooooh, fancy math". It's sort of why LW is more popular, because it is basically white collar flat-earthery, and many of the relevant topics discussed have already been discussed ad infinitum in the academic world and are accepted as general fact. We're generally not dwelling on silly arguments like that.

One of the most common things I see is people oftentimes assuming something that came from LW is novel and "was discovered through research published there", and that's because oftentimes it's really incentivized to make a lot of noise and sound plausible over there. Whereas arxiv papers, while there is some battle for popularity, are inherently more "boring" and formal.

For example, the LW post as I understand it completely ignores existing work and just... doesn't cite things which are rigorously reviewed and prepared. How about this paper from five years ago in a long string of research about generalization loss basins, for example? https://papers.nips.cc/paper_files/paper/2018/hash/be3087e74...

If someone earnestly tried to share the post you linked at a workshop at a conference, they would not be laughed out of the room, but instead have to deal with the long, draining, and muffling silence of walking to the back of the room without any applause when it was over. It's not going to fly with academics/professionals who are academia-adjacent.

This whole thing is not too terribly complicated, either, I personally feel -- a little information theory and the basics, and time studying and working on it, and someone is 50% of the way there, in my personal opinion. I feel frustrated that this kind of low quality content is parasitically supplanting actual research with meaning and a well-documented history. This is flashy nonsense that goes nowhere, and while I hesitate to call it drivel, is nigh-worthless. This barely passes muster for a college essay on the subject, if even that. If I was their professor, I would pull them aside to see if there is a more productive way for them to channel their interests in the Deep Learning space, and how we could better accomplish that.


I appreciate the thoughts. In such a fast moving field, it's difficult for the layman to navigate without a heavy math background. There's some more academic research I should have pointed to like https://arxiv.org/abs/2010.11560




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: