Hacker News new | past | comments | ask | show | jobs | submit login
Recent Advances in Natural Language Processing (deponysum.com)
278 points by saadalem on Aug 16, 2020 | hide | past | favorite | 176 comments



I think the point about language being a model of reality was interesting. I have an MA in linguistics including some NLP from about a decade ago and was looking at a career in academic NLP. I ultimately left to become a programmer because (of life circumstances and the fact that) I didn't see much of a future for the field, precisely because it was ignoring the (to me) obvious issues of written language bias, ignorance of multi-modality and situatedness etc. that are brought up in this post.

All of these results are very interesting, but I'm not really feeling like we've been proved wrong yet. There is a big question of scalability here, at least as far as the goal of AGI goes, which the author also admits:

> Of course everyday language stands in a woolier relation to sheep, pine cones, desire and quarks than the formal language of chess moves stands in relation to chess moves, and the patterns are far more complex. Modality, uncertainty, vagueness and other complexities enter but the isomorphism between world and language is there, even if inexact.

This woolly relation between language and reality is well-known. It has been studied in various ways in linguistics and the philosophy of language, for instance by Frege and not least Foucault and everything after. I also think many modern linguistic schools take a very different view of "uncertainty and vagueness" than I sense in the author here, but they are obviously writing for non-specialist audience and trying not to dwell on this subject here.

My point is, when making and evaluating these NLP methods and the tools they are used to construct, it is extremely important to understand that language models social realities rather than any single physical one. It seems to me all too easy, coming from formal grammar or pure stats or computer science, to rush into these things with naive assumptions about what words are or how they mean things to people. I dread to think what will happen if we base our future society on tools made in that way.


My point is, when making and evaluating these NLP methods and the tools they are used to construct, it is extremely important to understand that language models social realities rather than any single physical one.

I'd claim actual involves language both a general shared reality and quite concrete and specific discussions of single physical and logic facts/models. Some portion of language certainl looks mostly like a stream of associations. But within it is also are references to physical reality and world model and the two are complexly intertwined (making logical sense is akin to a "parity check" - you can for a while without it but then you have to look at the whole to get it). I believe one can see this in a GPT paragraph, where the first two sentences seem intelligence, well written but the third sentence contradicts the first two sentences sufficiently one's mind isn't sure "what's being said" (and even here, our "need for logic" might be loose enough that we only notice the "senselessness" after the third logical error).


In human language I think physical reality is always a few layers out. Language is social, first and foremost, and naming something is not neutral. We can hardly refer to single objects directly, we mostly do it through their class membership, which always, always include a whole range of associations, reductions, metaphor, etc. that are cultural.


In human language I think physical reality is always a few layers out.

Yes, the point is you can neglect physical (and logic) reality for a while in a stream of sentences. But not forever and that's where current NLP's output has it's limits. Just a simple level, a stream of glittering adjectives can be added to a thing and just add up to "desirable" unless those adjectives "go over threshold" and contradict each other and then the description can get tagged by the brain as a bit senseless.


Are you a NLP bot?


Oh Gawd, I suppose that quip's topical.

It seems like at this point, there's no way to distinguish the coherent comments of a bot from a person. The bot could have written sentence X for just about any X. It's just that bots can't sustain stream of logical claims that are consistent with each other.

So it's easier to demonstrate a paragraph is done by a bot, ie, that a paragraph makes no sense, than it is demonstrate that paragraph is not done by a bot - since both humans and bots write some sensible paragraphs.

Still, I'm egotistic enough to think a bot couldn't come up with that argument though I could be wrong.


Much as I'm assuming the person you're responding to was joking, I've encountered a number of comments/commenters where I felt the same way I felt about GPT output.

The best way I can describe the feeling is that it reminds me of conversations and friendships I've had with schizophrenics, people in the process of having a psychotic breakdown, and people with alzheimers.

There's a feeling that what they're saying is not entirely non-sensical, a feeling of 'catching up' to what they're trying to say (akin to translating in a language one isn't too proficient in). But reflecting on the conversation, I find myself wondering how much I managed to understand what they were trying to convey, and how much it was just my brain trying to make sense of something that ultimately doesn't.

'Understanding' or 'communication' aside, I've often valued these kinds of conversations because they tickle the more free-associative side of my own thinking, and the results, however I/we got there, were useful to me.

As a result, I'm much more interested in how these developments in 'AI' might augment this creative process than I am in how they might convincingly appear human. Not that the latter isn't interesting too, though.


These tools seem to be getting pretty good at fiction. In particular, playing around with AI Dungeon, it doesn't believe anything, or alternately you could say it believes everything at once. It's similar to a library that contains books by different authors, some fact, some fiction. Contradictions don't matter. Only local consistency matters, and even then, not that much.

Unfortunately, many people want to believe that they are being understood. But, on the bright side, this stuff seems artistically useful? Entertainment is a big business.


> many people want to believe that they are being understood

Not only that, they also want to understand. I read some computer poetry a few years ago that worked. The computer did not intend any meaning, but I found it anyway.

And of course I did. This is how language works. We assume that people (and by extension, the texts they write) follow a set of rules which structures how we engage with them in a successful manner. Paul Grice formulated this as the cooperative principle in his study of everyday language, but I believe it is the exact thing that is at play when people meet NLP:

https://en.wikipedia.org/wiki/Cooperative_principle


It's probably worse than that.

It's a model of reality as filtered through the human brain, with all of its neuroses, most of which we have only begun to model.

Aren't we stuck on things like sarcasm? How are you going to model everything from confirmation bias to undisclosed PTSD, in order to have a prayer of filtering the noise from actionable information?


> it is extremely important to understand that language models social realities rather than any single physical one...

Oh gosh yes. Gorge the ML tool with the right streams of text and you'd have a glib anti-vax, white supremacist, flat-earth generator pouring out endless logorrheic twaddle.

Oh wait... could that already... it would explain so much...


I found the science exams results interesting and skimmed the paper [1]. They report an accuracy of >90% on the questions. What I found puzzling was that they have a section in the experimental results part where they test the robustness of the results using adverserial answer options, more specifically they used some simple heuristic to choose 4 additional answer options from the set of other questions which maximized 'confusion' for the model. This resulted in a drop of more than 40 percentage points in the accuracy of the model. I find this extremely puzzling, what do these models actually learn? Clearly they don't actually learn any scientific principles.

[1] https://arxiv.org/pdf/1909.01958.pdf


I would be interested in hearing the results from humans presented with adversarial answer options. You may say that a machine learning correlations between words isn’t really learning science, but I wonder how many human students aren’t either, just pretty much learning correlations between words to pass tests...


They do give an example of a question, in which the model chose an incorrect answer in the adversarial setting:

"The condition of the air outdoors at a certain time ofday is known as (A) friction (B) light (C) force (D)weather[correct](Q) joule (R) gradient[selected](S)trench (T) add heat"

I assume this might be characteristic for other questions as well, although I don't know anything about the Regents Science Exam and whether there are multiple questions about closely related topics.


That’s a terribly worded question anyway. Of the original answers, ‘weather’ is the least worst but it’s still vague.


It is a well-worded question for its purpose. The whole point is that, of all the options given, only one is justifiable (and it does not require a tendentious stretch to justify it, either.) Even “light” (which was not chosen) only applies half the time, on average. This is a valid test of natural language understanding.


Remember when IBM went on Jeopardy? There was a question about which Egyptian pharaoh. A human with some knowledge of history might mix up Ramses and Seti, or whatever, or just not know the answer, but know that they didn't know. Watson answered "What are trousers?" with supreme confidence.

Jeopardy is fun and games and it was great for the blooper reel, but they're trying to sell this stuff to diagnose cancer and guide police efforts. Failure modes are kind of important.


Most multiple choice math problems can be completely circumvented by simply finding the digital root of the expression in the problem. I was surprised to find this to be true, even on college entrance exams.


Seriously? Can you explain the mechanism? 'cos that sounds like - well, numerology, to be honest.


The digital root of the problem expression will be the digital root of the answer, because the answer IS the problem expression. They're the same number, just with all the pieces jumbled up. Comparing against digital roots of answer choices will usually eliminate most of the wrong answers. Any remaining wrong answers tend to be obviously wrong. 2 + 2 != 13 obviously.


Can you give an example? I am intrigued, although suspect you are making a joke!


You can check the properties of digital roots on wikipedia:

https://en.wikipedia.org/wiki/Digital_root#Properties

Example from ASVAB practice math test:

(x+4)(x+4) =

A. x^2+16x+8

B. x^2+16x+16

C. x^2+8x+16

D. x^2+8x+8

Since we don't have to solve for X in this problem, we can just assume x is 1, which would make the digital root of the problem expression 7. Assuming x = 1, the digital roots of A, B, C and D are 7, 6, 7 and 8 respectively. C is the answer because its last term is the square of 4.


Perhaps I am being dumb, but I don't see what the digital root brings to this.

You can do the same thing without the digital root. substitute 1 into all the possible answers gives: A: 25 B: 33 C: 25 D: 17

Substitute 1 into the question gives 25. Therefore only A and C are possible answers and it must be C because 4*4 = 16

This is the same process as your answer but without the digital root!


Agreed. The real trick here is knowing that, since we don't need to find x, we can sub in an arbitrary x and do the comparison numerically instead of algebraically.

Of course there's a tiny chance that one of the other equations would be equal at your arbitrary x value. If so and you get two 'right' answers just try again with a different x.


It is the DeMillo-Lipton-Schwartz–Zippel lemma

If two polynomials evaluate to equal numbers with random variables values, the polynomials are almost certainly equal themselves

Ofc, x = 0 and x = 1 are hardly random


I suppose such is possible without digital roots. Let's try the same problem, but this time we'll multiply our constants by 3,571:

(x + 14284) (x + 14284)=

A: x^2 + 204032656x + 28568

B: x^2 + 204032656x + 204032656

C: x^2 + 28568x + 204032656

D: x^2 + 204032656x + 204032656

The digital root of the problem expression is 4. The digital roots of A, B, C and D are 4, 6, 4 and 3 respectively. The digital root of the square of 14284 is 1, and the digital roots of the last term of A and C are 2 and 1, respectively. The answer is C.


Of course, it is quicker just to multiply and add the last digit of the constant to find the only possible answer (or count digits to get an order of magnitude estimate, which will also show C as the only possible answer).

When a problem of this form uses big enough numbers to warrant a shortcut, digital root is almost always a suboptimal shortcut.


Yes you are the heuristics master!


Maybe is case of large numbers you can compute digital root easier than the answer.

For example, 25 * 26 = 650, digital root 2. Or take digital root of 25 and 26 first, get 7 * 8 = 56, same digital root 2.


I can't decide if I've just seen magic, or conjuring. I'm gonna have to go away and think about this.


This is true, but in many cases amounts to reducing the problem to a more time consuming problem, though one that requires less knowledge. On timed tests, this can be a big negative.


Adversiarial training means that they specifically search for answers that the network would misunderstand. If this only leads to a 40 percent loss (the network still answers correctly 50% of the time) I still consider that remarkable.

Choosing the best from 8 answers where some of them were adversarially derived should be equivalent to choosing the best from all possible answers, of which there could be tens of thousands. How would a human do in that situation?

Although the kinds of mistakes the network makes seem like a mistake you would never do (i.e. you wouldn't call the condition of air outdoors gradient), the opposite could also be true, that it would easily answer questions you would have a problem with.


I admire the way that you have reinterpreted the issue, but I don’t buy it.

For one thing, I can imagine a knowledgeable human working through ten thousand alternatives and picking the best, (if, objectively, there is one) - it would just take a long time. On the other hand, It does not seem obvious to me that current NLP systems would do better than humans on such a task, and nor is it obvious to me that one can conclude that from the fact that the machine sometimes ignores adversarial examples (if the assumption is that most of the ten thousand would be de-facto adversarial, that is a lot to not make one mistake on; that’s not so much a problem for a human with an understanding of the issue, and who might well come up with the right answer unprompted.)

This is probably all moot, however, as the most obvious way to compare the machine results to those of humans is to give both the actual tests in question, rather than substitute a dubious “equivalent” test.


That’s the thing. Machine Learning is a misnomer, the models don’t “learn” anything about the domain they operate in. It’s just statistical inference.

A dog can learn to turn left or to turn right for treats. But they don’t understand the concept of “direction”, their brain isn’t wired that way.

Machine learning models perform tricks for treats. The tricks they do get more impressive by the day. But don’t be deceived, they aren’t wired to gain knowledge.


My god if I hear this argument one more time I'm going to pop. What on earth gives you the idea that you are something more than statistical inference?


Yes, it’s widely believed that statistical inference is a part of how the brain operates. But we have barely scratched the surface in our understanding how the human brain works.

Do you honestly believe statistical inference completely explains a human’s ability to learn?


Of course.


I think that the reason your mind pops is that you are not trying to understand what the person saying that is actually meaning.

Stating that we are no more than statistical inference is no different to saying that we are no more math - really, you can build the deepest mathematical truths on statistical inference.

The thing is that the math/s.i. that ML models do is too trivial in comparison to the math that would be required to describe a human identity. It can detect faces very well, it can calculate fifths roots very quickly, it can solve discrete problems quickly, yes, but its achievements are, in comparison to actual human understanding (and purpose), more closer to those of another tool, like a hammer, than those of humans.

edit: typos/grammar


I hear what they are saying and I reject it as trivially incorrect. Of course all we are is math. What else do you think there is?

"The thing is that the math/s.i. that ML models do is too trivial in comparison to the math that would be required to describe a human identity" why do you think this? How do you know the expressive power of neural networks? How have you guaged the expressive power of biological neural networks. How have you made any of these conclusions?

I am exasperated.


You are exasperated because when I wrote "that ML do is too trivial..." you did not realize that for ML I meant "the current state of the art of ML"...

I never assumed we are more than math. Just that there is a lot to still learn about the structure in which those buildings blocks are organized in order to fully tame what is understood as human intelligence. If we could simulate all the neurons of a real adult living brain accurately enough it would be intelligent from a human perspective, and still, both the model and us would be equally far, very far, from understanding how to arrive to that point of neural organization to achieve that kind of intelligence.


For a moment I had myself convinced that your response was written by GPT-3. Part of me hopes that is the actual fact.


Because statistical inference does not create complex operating models designed to fit a subset of reality from a larger encompassing model of reality which humans base our daily decisions and life philosophies. Just because we do not deep dive into the complexities of existential questions every waking moment does not say we're not operating scaffolds which depend upon prior existential examinations. We're complex to say the least, and we require reasons - and that very idea: reason is what is not present in statistical inference. Without a prior body of human conversation, statements, and larger expressions of philosophy based upon human reason, statistical inference goes nowhere.


>reason is what is not present in statistical inference.

But why think reason isn't a kind of statistical inference? GPT-3 demonstrates a rudimentary capacity for reasoning and its "merely" a statistical inference engine.


Because it is possible to invent or imagine new things without resorting to endless random walks or blind trial and error.

Induction isn’t probabilistic. It’s the origin of all discovery and based in sussing out what’s important in patterns, selectively proposing causal mechanisms and testing those hypotheses by following priors and implications — an essential basis for original and creative thought that no purely probabilistic engine can employ.


You think a probabilistic engine can't model a human doing induction? At a certain level of capability, mere pattern imitation closes upwards.


Doesn't your brain derive knowledge through statistical inference? How else? I can imagine the AGI of the future (having perfected knowledge gathering through statistical inference) some day saying: "Humans perform tricks for treats. The tricks they do are impressive. But don’t be deceived, they aren’t wired to gain knowledge."


I didn’t claim that the way brains work didn’t include statistical inference. That’s why the dog metaphor works. Both the dog and the machine learning model use statistical inference to perform tricks.

Dogs haven’t been to the moon, however. There’s more to the brain we don’t understand.


I'm curious why you think a "difference in kind" rather than simply "difference in degree" must exist between us and dogs.


Adversarial examples probably exist for humans too, they are just less easy to find since we can't easily backprop through human judgements like we can a bunch of array multiplications and dot products.

Doesn't mean that they are not "actually" learning any more than we are.


These are "adversarial answers" - additional answers to a multi-choice test. If you know the correct answer to the test, adding more possible answers shouldn't make you change your selection.


> Models are transitive- if x models y, and y models z, then x models z. The upshot of these facts are that if you have a really good statistical model of how words relate to each other, that model is also implicitly a model of the world.

This right here is a great way of putting the success of GPT-3 into context. We think GPT is smart, because when it says something eerily human-like we apply our model of the world onto what it is saying. A conversation like this:

> Me: So, what happened when you fell off the balance beam?

> GPT: It hurt.

> Me: Why'd it hurt so bad?

> GPT: The beam was high up and I feel awkwardly.

> Me: Wow, that sounds awful.

In this conversation, one of us is thinking far harder than the other. GPT can have conversations like this now, which is impressive. But only I can model the beam, the fall, and the physical reality. When I say "that sounds awful," I actually do a miniature physics simulation in my head, imagining losing my balance and falling off a high beam, landing, the physical pain, etc. GPT does none of that. In either case, when it asks the question or when it answers it, it is entirely ignorant of this sort of "shadow" model that's being constructed.

Generalizing a bit, our "shadow" model of reality in every single domain is far more powerful than language's approximation. That's why we won't be able to use GPT to do a medical diagnosis or create a piece of architecture or whatever else people are saying it's going to do now.


> In this conversation, one of us is thinking far harder than the other. GPT can have conversations like this now, which is impressive. But only I can model the beam, the fall, and the physical reality. When I say "that sounds awful," I actually do a miniature physics simulation in my head, imagining losing my balance and falling off a high beam, landing, the physical pain, etc. GPT does none of that. In either case, when it asks the question or when it answers it, it is entirely ignorant of this sort of "shadow" model that's being constructed.

GPT-3 could presumably write a paragraph like that one. You can claim to have a working physics model in your head, but why should I believe that unless it becomes evident from the things that you communicate to me? I've certainly met humans who could have a superficially legitimate conversation about objects in motion while harbouring enormous misconceptions about the physics involved.

Maybe the biggest takeaway from GPT-3 should be that we should raise our standards for human conversation, demanding more precise language and giving less credit to flourishes that make the meaning ambiguous.


I mean the miniature physics simulation of me actually imagining another embodied human falling off of a beam. There is a huge knowledge of the physical world encoded into that. Think about it, you can viscerally imagine balancing on a beam, losing your footing, falling, and the subsequent forces and accelerations on all parts of your body during and after the fall, in real time. All of that information is in our model, and none of it is in GPT's. Everyone has this, it's builtin. I'm not talking about Newton's Laws here.

The physical mechanics of being an embodied thing isn't the only example of this. Any experiential knowledge, any knowledge with feedback suffers from this. And any knowledge that is difficult to put into words, like emotional knowledge, interpersonal knowledge, etc., will be off-limits.

That's my point.


While I don't disagree, if I were to have the same conversation I would probably be spending more time thinking about the appropriate response than trying to remember and simulate an experience even remotely similar. So I'd be more like GPT, perhaps?


wow that last point is profound, it would almost solve the problem of misinformation, bogus headlines, double speak, etc. Simply raising the standard for what good speech is.


The author's quote is asserting the opposite of what you're saying here.

Filling in "GPT" for x, "text" for y, and "world" for z, the author is stating "if GPT models text, and text models the world, then GPT models the world."

In particular, the author directly addresses your point that "Generalizing a bit, our 'shadow' model of reality in every single domain is far more powerful than language's approximation."

> Modality, uncertainty, vagueness and other complexities enter but the isomorphism between world and language is there, even if inexact.

My own personal view tends to lean towards the author's. GPT-3 , despite its many astonishing achievements, has many many limitations. Yet I don't see any indication as of yet that any of those limitations are fundamental ones inherent to the approach behind the GPT series, which is why GPT-3 both excites and frightens me.


Yes, fundamentally I agree, of course. If you could find a way to get all of natural knowledge into a sufficient enough network, you would probably have a very intelligent network. But getting text in is easy. It's the most manipulable proxy model. Getting other data in, data that represents what I'm talking about, is much much more difficult.


Getting in, say, video frames instead of words is more difficult, but possible imo. After all, they demonstrated it’s possible to do with raw audio: https://openai.com/blog/jukebox


GPT models the world through text*

There are many other things that go into modelling the world, like drawing pictures, visualising things, communicating in non-text ways essentially. GPT can only model the world in the ways that text can model the world, which is limited.


Although a differentiating factor in the near-term, in the grand scheme of things that is merely an implementation detail.

The rise of computing and its myriad encoders is an existence proof that the world we humans know can be, to an arbitrary degree of precision, modeled by 0s and 1s and what are 0s and 1s but the purest essence of text?

Although fraught with many large implementation difficulties, a GPT trained on binary data does not represent a fundamental difficulty vs one trained on text.


No, and not even close. Text cannot model many things. Try learning surgery using a textbook without pictures. Try driving a racecar competitively or learning a foreign language after only reading a book, even one with pictures.

Models of all kinds are always poor surrogates for reality, and models that cannot employ logic or causality CLEARLY cannot model a world in which mechanisms cause all change.

Statistical models can indeed describe many observations, but can never break out from the echo chamber of copy-catting patterns it has already seen. If a probabilistic engine like a deep net hasn’t been exposed to a concept during training, it will never induce its existence. Imagination requires initiative, the proposal of an unknown and unfamiliar outcome via logical inference or causal induction. Until deep nets can employ both of these skills, they will never master many skills humans use routinely to explain the world or extend our understanding of how it works.


If logic helps you predict the next word in the text, then a text based model will learn logic.


What makes you think it’s limited to text? See for example https://openai.com/blog/jukebox


Right. I recently saw a podcast with Dileep George of Vicarious AI. I don't know if his specific techniques will work fully generally but at the high level he is talking about something quite similar to what you are saying in terms of grounding language understanding and having real models of the world. So I am definitely following him.

More broadly there is a growing group of researchers who are working on trying to achieve better world modeling (although of course there have been many for decades, it just seems that the number who are coming from deep learning towards AGI with more emphasis on world modeling is increasing).

Such as Ferran Alet. Sort of the Tenenbaum school I guess. Or Melanie Mitchell has been saying similar for quite awhile. They have recent talks on Good AI's YouTube channel.


Not a single mention if this is only applicable to english or to other natural languages. Afaict this mostly lists advancements in ELP (english language processing), Especially the Winograd schema (or ar least the given example) seems to be heavily focused on english.

Relevant article for this problem: https://news.ycombinator.com/item?id=24026511


But there's no reason the models are english specific...


Yes, there is. A "model" is a set of parameters optimisted by some algorithm or system trained on the data in a specific dataset. Thus, a language model trained on a dataset of English language examples is only capable of representing English language utterances, not, e.g. French, or Greek, or Gujarati utterances. Diagrammatically:

  data --> system --> model
What is not necessarily English-specific are the systems used to train different language models, at least in theory. In practice, systems are typically hand-crafted and fine-tuned to specific datasets, to such a degree that most of the work has to be done anew to train on a different dataset.


This is very interesting, and I would be interested in learning more about the language-specific accommodations that are made. My minimal understanding of the GPT systems is that they are initially trained on a large corpus of English-language text, and then often, though not necessarily, given few-shot training on example tasks before being tested with questions that are scored.

It would be unremarkable if the same system would have to be retrained from scratch to work with a different language, but I take it that you mean more than that. One guess is that the training data is annotated with grammatical information, in which case I would wonder if this is just a shortcut, or whether it solves a fundamental problem for such systems. Another guess is that the training set includes disambiguation, but that would seem to render meaningless the results on Winograd schema. (Update: I withdraw this last point, as presumably the Winograd-schema questions themselves are not disambiguated. Disambiguation of the training corpus would be a quite significant language-specific accommodation, if that is what is happening.)


> It would be unremarkable if the same system would have to be retrained from scratch to work with a different language, but I take it that you mean more than that.

It really is that unremarkable, despite the GPs insinuations to the contrary.


If training a language model was "unremarkable" we wouldn't have a hypefest everytime anyone releases a new model, with a catchy name no less. BERT, ROBERTA, GPT-2/3/... etc. are remarkable enough for their creators to write papers exclusively for the purpose of announcing the model and describing the architecture used to train it etc.

In any case, I don't remember anyone ever naming their n-gram models or HMMs etc. The reason for this of course is that training a language model with a deep neural network architecture is anything but unremarkable, not least because the costs in data, human-hours and compute are beyond the reach of most entities other than large corporate teams. If training a new model from scratch were truly "unremarkable" we'd all have our own.


> If training a language model was "unremarkable" we wouldn't have a hypefest everytime anyone releases a new model, with a catchy name no less. BERT, ROBERTA, GPT-2/3/... etc. are remarkable enough for their creators to write papers exclusively for the purpose of announcing the model and describing the architecture used to train it etc.

Actual NLP practitioners are not having a hype-fest every time someone spends a shitton to train a transformer LM on a new dataset. It's definitely cool, but I wouldn't cal it remarkable. What was cool about GPT-3 was the demonstration of the effectiveness of the "few-shot"/"conditioning" approach. I didn't find it surprising, but it was cool to push it to the next level.

> In any case, I don't remember anyone ever naming their n-gram models or HMMs etc

That's because they are literally different model architectures. The 3-gram model is the same everywhere, it doesn't depend on dataset. A 3-gram "trained" on Russian would still be a 3-gram. Similarly, BERT trained on a German corpus would still be BERT, it wouldn't need a "new catchy name." That's my point.

As a side remark, you are vastly overestimating the amount of dataset specific tuning that you need - you can often just use the exact same hyperparams and run it on a new dataset - there's been some recent literature showing that a lot of these architectures are actually quite resilient to different datasets and hyperparameter choices and arrive at similar local minima. Sure, you might be off by 1-2% from SOTA accuracy (let's say on a classification task) but in production systems you're generally not shooting for the SOTA that was released in the last 2 months but rather something close-ish to those results.

Honestly, quite confused this is the hill you want to die on.


I'm unsure what you mean by "actual NLP practitioners". In any case does it really make a difference? We've been having GPT-3 related articles on HN and in the techie press for weeks after GPT-3 was published.

>> That's because they are literally different model architectures. The 3-gram model is the same everywhere, it doesn't depend on dataset. A 3-gram "trained" on Russian would still be a 3-gram. Similarly, BERT trained on a German corpus would still be BERT, it wouldn't need a "new catchy name." That's my point.

N-gram models are not "the same everywhere". An n-gram model is a set of probabilities assigned to, well, n-grams. Different sets of n-grams, or probabilities trained from different datasets would be- different. The n-grams would be different and the probabilities would be different and the structure of the model would be different. A Russian n-gram model would represent the probabilities of Russian n-grams, an English one would reprsent the probabilities of English n-grams and so on. The type of model would be the same, an n-gram model; but the model itself would be different. Otherwise, why bother training more models?

Same goes for a model trained with a transformer architecture. BERT is an English language model trained with a transformer architecture and representing the probabilities of tokens in an English corpus. A model trained with a transformer architecture and representing the probabilities of tokens in a German corpus would be, again, a different model. You could also call it "BERT" but that would eventually become very confusing.

>> Honestly, quite confused this is the hill you want to die on.

I'm not dying on any hill. Could you please not do that?


I disagree with how you define "model" to mean the parameters, rather than the form of the relationship between input and output. I don't think every time we make a new estimate of the electron mass, we are creating a "new" Standard Model.

I don't really think it's super worth quibbling over, since I think we're generally on the same page. It's worth noting that BERT trained on different corpora is generally still called BERT (you can look at the huggingface/transformers library, for instance).

I'm sorry that my message came across as more adversarial than intended.


>> I'm sorry that my message came across as more adversarial than intended.

Thanks. I enjoyed our conversation.


You have hung a lot on the single word “unremarkable”, so perhaps I did not make it completely clear what I thought was unremarkable. The only thing that I meant was that if a given architecture for NLP would require retraining from scratch in order to work with a new language, I would not think that fact would be surprising, or be indicative of some fundamental flaw or limitation in the approach taken.

If, on the other hand, a given architecture could not be retrained to perform well on a different language, or only on a group of related languages, that might be indicative of a fundamental limitation in how far it could be developed, depending on what sort of accommodations have to be made for new languages.


Thanks for clarifying. Part of the reason why training a new model from scratch is so difficult is the amount of dataset-specific tweaking and fine-tuning that goes on and that is rarely discussed in publications. For example, papers willt describe the architecture of a system but will rarely go into much detail about how a particular component was selected over another (why this cost function over the other etc). This fine-tuning is essential to achieving state-of-the-art results and without it you can expect to produce a model that is simply not the same as the original. Certainly in terms of performance metrics.

Basically retraining the same system on a different dataset is a bit like entering the same river in two points in time. Yes, in theory it's "the same" system and only the dataset chnges. In practice, it takes so much architectural tweaking and fine-tuning that it's a different system and you've done almost all the work from scratch. This also is part and parcel of why so much fanfare surrounds the publication of a new model. Because it's really, really hard to train a new model on a new dataset (assuming one wants to get state of the art results).


I was using model to mean the architecture, ie. the way you're modeling the data flows - not the specific params, which are ofc language specific.


Is it as "simple" as retraining a model on an identical (but translated) dataset, if such a dataset were to exist?


In a certain way, they could be, simply because their structure may work better on the type of language that English is. It may not work as well for languages that exhibit other grammatical patterns (e.g. morphologically rich languages, or those that exhibit superficially more flexible word order which nonetheless conveys information about topic/focus).


Some languages like Chinese have a large corpus of information that can be studied, and there has been a continuous effort to standardize the language. it wouldn't surprise me if it turns out to be easier than English. I would expect it would be more difficult when you're talking about languages that have regional differences in usage of words, perhaps northern Vietnamese vs. Southern Vietnamese, or the Spanish of Spain versus the Spanish of Mexico.


Chinese has large regional differences, too. Cantonese has different grammar, not just word usage. Modern Chinese grammar is somewhat similar to English while Cantonese follows ancient Chinese, which places verb to the end of sentence.


Cantonese is, like English and Mandarin, SVO.

E.g. I bought this book: ngóh(S) maáihjó(V) nìbún syù(O).

Contrast with Japanese: watashi wa(S) kono hon o(O) kaimashita(V).

Differences from Mandarin Grammar are listed here: https://en.wikipedia.org/wiki/Cantonese_grammar#Differences_...

Classical Chinese is also SVO: https://en.wikipedia.org/wiki/Classical_Chinese_grammar


True, and Mandarin would be the better word to use here. Mandarin and Cantonese are for all intents and purposes separate languages. They are more different than English and Spanish.


Chinese is the written language, based on Mandarin. Mandarin and Cantonese are separate, though closely related, spoken languages. They're a lot closer to each other than English and Spanish (more like French and Spanish), which is why the same written language works for both sets of speakers: most of the vocabulary is shared.


This is only because almost all native Cantonese speakers are trained to read and write in Standard Chinese from the very beginning.

Standard Chinese can be spoken in Cantonese, but while understandable, it is never done, in contrast to Mandarin.

In recent years written online conversations (like blogs, Facebook posts, instant messages) by Cantonese speakers are increasingly written in Cantonese instead, and this is slowly spreading to more "formal" media such as books.

Source: I'm a native Cantonese speaker.


Yes, it seems I underestimated the difference: "In spoken Cantonese the frequency of use of Cantonese dialect words is close to 50 percent (Li, et. al., 1995, p236)." (from https://www.aclweb.org/anthology/P98-2238.pdf).

They're still a lot closer than English and Spanish or even French and Spanish (https://www.ezglot.com/most-similar-languages.php).


If we're going down that path, then it should more specifically be Standard Mandarin, since Standard Mandarin and e.g. Southwestern Mandarin are only partially mutually intelligible like French and Spanish are partially mutually intelligible.


All language has significant regional difference if it covers enough regions


For some of them, there is. Russian (my mother tongue), for example, has the kind of morphology that makes linguists' hair stand on end. Also, verbs and nouns are gendered (and therefore they must agree, and you must know what everything refers to at all times, often pretty far away from where the noun was first mentioned), and there are declensions on various parts of speech, and they must agree as well. And words can be freely formed out of several roots, prefixes, and suffixes. For any of this to be understood or generated, your model has to be able to model all of that, which is much harder than modeling English.


I think that modern BPE and techniques are more than sophisticated enough to handle things like this.


Could be, but I've yet to see an example of any NLP task that performs anywhere near as well on Russian as it does on English. Trouble with NLP in general, is that humans, armed with their imperceptible error correction and general intelligence, expect unreasonable things from systems which fundamentally have neither. Error rate in single digit percentages is commonly perceived as terrible, and because of the cardinality of the input space "plausible" errors don't really happen and failure modes are mostly catastrophic.

This applies to all languages BTW. There are only very, very few NLP systems that actually work, and they are all in a setting which tolerates a fairly high error rate (e.g. ranking).


> any NLP task that performs anywhere near as well on Russian as it does on English

This has a lot more to do with smaller Russian corpora than anything to do with peculiarities of the Russian language.

> This applies to all languages BTW. There are only very, very few NLP systems that actually work, and they are all in a setting which tolerates a fairly high error rate (e.g. ranking).

I've trained NLP models used on over 15 different languages used in production


I'm 100% sure NLP will do the same mistake with non-English language as face recognition did with faces of black people.

I'm taking bets.


ie. a dataset/corpus problem?


Darn, based on the title, I was hoping for an overview of recent research.

Lots of people are having fun playing with GPT-3 or AI Dungeon, myself included, but it seems like there is other interesting research going on like the REALM paper [1], [2]. What should I be reading? Why aren't people talking about REALM more? I'm no expert, but it seems like keeping the knowledge base outside the language model has a lot going for it?

[1] https://ai.googleblog.com/2020/08/realm-integrating-retrieva... [2] https://arxiv.org/abs/2002.08909


REALM is amazing, and Google's PEGASUS[1] taught itself to count between 2 and 5 (which is mindblowing).

Basically the NLP groups from Google and FB[2][3] are always worth watching.

[1] https://ai.googleblog.com/2020/06/pegasus-state-of-art-model...

[2] https://ai.facebook.com/blog/covost-v2-expanding-the-largest...

[3] https://ai.facebook.com/blog/introducing-a-new-large-scale-d...


I think there's a pretty large gap between what people in academia NLP/industry labs find interesting and what the bitcoing/VC/whatever people on twitter seem to find interesting.

realm is definitely cool, and i agree that techniques like this are the way forward.


Separate comment because separate point.

To my mind, all this benchmark-beating is just pointless faffing about and not a little frustrating to see that this is where the current state of the art is at. NLP systems keep getting better and better at beating benchmarks and yet we still don't have artificial agents that can hold an intelligent conversation, even in a restricted context.

Or rather, we had such programs since the late 1960's, with no one else's than Terry Winograd's own doctoral thesis system, SHRDLU [1], but it seems that in the race to build powerful classifiers, we forgot how to build systems that can interact intelligently with a human.

Here's a small excerpt from SHRDLU (taken from the wikipedia page on SHRDLU, originally in an appendix in Terry Winograd's thesis):

    Person: Is there a large block behind a pyramid?
    Computer: YES, THREE OF THEM: A LARGE RED ONE, A LARGE GREEN CUBE, AND THE BLUE ONE.
    Person: Put a small one onto the green cube which supports a pyramid.
    Computer: OK.
    Person: Put the littlest pyramid on top of it.
    Computer: OK.
    Person: Does the shortest thing the tallest pyramid's support supports support anything green?
    Computer: YES, THE GREEN PYRAMID.
    Person: What color is it?
    Computer: BY "IT", I ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS.
    Computer: RED
Note that this is not just mindless text generation. The conversation is held in the context of a "blocks world" where a robot with a single hand and eye ("Computer" in the excerpt above) moves blocks of various shapes and colours around, as directed by a human user in free-form natural language. When the Computer says "OK" after it's directed to "put the littlest pyramid on top of it" it's because it really has grabbed the smallest pyramid in the blocks world and placed it on top of the small block in an earlier sentence, as the Person asked. The program has a memory module to keep track of what ellipses like "it", "one" etc refer to throughout the conversation.

SHRDLU was a traditional program hand-crafted by a single PhD student- no machine learning, no statistical techniques. It included, among other things, a context-free grammar (!) of natural English and a planner (to control the robot's hand) all written in Lisp and PLANNER. In its limited domain, it was smarter than anything ever created with statisical NLP methods.

______________________

[1] https://en.wikipedia.org/wiki/SHRDLU


We knew hand-crafted program in limited domains can work for NLP, computer vision and voice recognition long time ago. The challenge is always, that the limited domain can be extremely limited, and to get anything practically interesting requires a lot of human involvement to encode the world (expert system).

Statistical methods traded that. With data, some labelled, some unlabelled and some weakly-labelled, we can generate these models with much more efficient human involvement (refine the statistical models and labelling data).

I honestly don't see the frustration. Yes, current NLP model may not be the "intelligent agent" everyone looking for yet to any extent. But claiming it is all faffing and no better than 1960s is quite a stretch.


>> We knew hand-crafted program in limited domains can work for NLP, computer vision and voice recognition long time ago.

Yes, we did. So- where are the natural language interfaces by which we can communicate with artificial agents in such limited domains? Where are the applications, today, that exhibit behaviour as seemingly intelligent as SHRDLU in the '60s? I mean, have you personally seen and interacted with one? Can you show me an example of such a modern system? Edit: Note again that SHRDLU was created by a single PhD student with all the resources of ... a single PhD student. It's no stretch to imagine that an entity of the size of Google or Facebook could achieve something considerably more useful, still in a limited domain. But this has never even been attempted.

Yes, it is faffing about. Basically NLP gave up on figuring out how language works and switched to a massive attempt to model large datasets evaluated by contrived benchmarks that serve no other purpose than to show how well modern techniques can model large datasets.


> It's no stretch to imagine that an entity of the size of Google or Facebook could achieve something considerably more useful, still in a limited domain. But this has never even been attempted.

How do you know if no one has attempted it, or if all attempts so far have failed to achieve their goals? One of the claimed downsides of "hand-crafted systems in limited domains" (i.e., something like the SHRDLU approach) is that they would take too much effort to create when the domain is expanded to something even slightly bigger than SHRDLU's domain, so a lack of successful systems could be evidence of no one trying, or it could be evidence that the claimed downside is indeed true.

The fact that a working system for the limited domains of e.g. customer support or medical diagnosis would be worth a lot of money suggests to me that they must have been tried, but that nothing useful could be built, and we didn't hear about the failed attempts, meaning that those domains (at least) are too big for hand-crafted systems to work.

> Yes, it is faffing about. Basically NLP gave up on figuring out how language works and switched to a massive attempt to model large datasets evaluated by contrived benchmarks that serve no other purpose than to show how well modern techniques can model large datasets.

It is inaccurate to say that all benchmarks are useless. For instance, language translation (as in Google Translate) is a benchmark NLP task, but it is also something I personally use at least every week, and deep learning based solutions beat handcrafted systems by a lot for this particular task (speaking as an end user who has used systems based on both approaches). The same comments apply to audio transcription (e.g. generating subtitles on YouTube) as well.


>> The fact that a working system for the limited domains of e.g. customer support or medical diagnosis would be worth a lot of money suggests to me that they must have been tried, but that nothing useful could be built, and we didn't hear about the failed attempts, meaning that those domains (at least) are too big for hand-crafted systems to work.

That's actually a good point and one I've made myself a few times in the past- negative results never see the light of day. So, yes, I agree we can't know for sure that such systems haven't been attempted, despite the certainty with which I assert this in my comment above.

On the other hand, there are some strong hints. First of all, there was an AI winter in the 1980's after which most of the field turned very hard towards statistical techniques and away from the kind of work in SHRDLU. This kind of work became radioactive for many years and it would have been very hard to justify putting a PhD student, or ten, to work trying to even reproduce it, let alone extend it. That's in academia. In the industry, it's clear that nowadays at least companies like the FANGs strongly champion statistical machine learning and anyone proposing spending actual money on such a research program ("Hey, let's go back to the 1960's and start all over again!") would be laughed out of the building. That is, I believe there are strong hints that the political climate in academia and the culture in large companies, has suppressed any attempt to do work of this kind. But that's only my conjecture, so there you have it.

>> It is inaccurate to say that all benchmarks are useless.

Of course. My point is that current benchmarks are useless.

The fact that you find Google translate useful is unrelated to how well Google translate scores in benchmarks, which are not designed to measure user satisfaction but instead are supposed to tell us something about the formal properties of the system. In any case, for translation in particular, it's not controversial that there are no good benchmarks and metrics and many people in NLP will tell you that is the case. In fact, I'm saying this myself because I was told during my Master's, by our NLP tutor who is a researcher in the field. Also see the following article which indcludes a discussion of commonly used metrics in machine translation and the difficulty of evaluating machine translation systems:

https://www.skynettoday.com/editorials/state_of_nmt


Yeah, I know about how current automatic MT benchmarks don't reflect "user satisfaction" very accurately, and that it's an open problem to get one that is serviceable. However you make it sound like all deep learning solutions to tasks perform well on the task benchmark but poorly on the real-world task the benchmarks try to approximate, whereas that's not true for MT - they are bad at the benchmarks, but outperform non-deep-learning-based translation approaches at the real-world-task.

On the subject of benchmarks, how about speech transcription? I was under the impression that those benchmarks are pretty reliable indicators of "real-world accuracy" (or about as reliable as benchmarks are in general)


I don't know much about speech transcription, sorry.

How do you mean that machine translation outperforms non deep learning based approaches at the real world task? How is this evaluated?


> How do you mean that machine translation outperforms non deep learning based approaches at the real world task? How is this evaluated?

There were two ways that performance can be evaluated that I had in mind:

1. Commercial success - what approach does popular sites like Google Translate use? 2. Human evaluation - in a research setting, ask humans to score translations - this is mentioned in your link


OK, thanks for clarifying. The problem with such metrics is that they're not objective results. It doesn't really help us learn much to say that a system outperforms another based on subjective evaluations like that. You might as well try to figure out which is the best team in football by asking the fans of Arsenal and Manchester United.


The subjective human evaluations used in research are blinded - the humans rate the accuracy of translation without knowing what produced the translation (whether a NMT system, a non-ML MT system, or a human translator), whereas the football fans in your scenario are most definitely not blinded. There are some criticisms you could make about human evaluation, but as far as how well they correspond to the real-world task, I think they're pretty much the best we can do. I'm very curious to know if you actually think they're a bad target to optimize for.

More to the point, you still have yet to show that NMT "serves no other purpose than to show how well modern techniques can model large datasets", given that they do well on human evaluations and they're actually producing value by serving actual production traffic (you know, things humans actually want to translate) in Google Translate. If serving production traffic like this is not "serving a purpose", what is?


Sorry for the late reply.

Regarding whether human evaluations are a good target to optimise for, no, I certainly don't think so. That's not very different than calculating BLEU scores, except that instead of comparing a machine generated translation with one reference text, it's compared with peoples' subjective criteria, which actually serves to muddle the waters even more -because who knows why different people thought the same translation was good or bad? Are they all using the same criteria? Doubtful! But if they're not, then what have we learned? That a bunch of humans agreed that some translation was good, or bad, each for their own reasons. So what? It doesn't make any difference when the human evaluators are blinded, you can make the same experiment with human translations only and you'll still have not learned anything about the quality of the translation- just the subjective opinions of a particular group of humans about it.

See, the problem is not just with machine translation. Evaluating human translaion results is also very hard to do because translation itself is a very poorly characterised task. The question "what is a good translation is very difficult to answer. We don't have, say, a science of translation to tell us how a text should be translated between two languages. So in machine translation people try to approximate not only the task of translation, but also its evaluation- without understanding either. That's a very bad spot to be in.

In fact, a "science of translation" would be a useful goal for AI research, but it's the kind of thing that I complain is not done anymore, having been replaced with beating meaningless benchmarks.

Regarding the fact that neural machine translation "generates value", you mean that it's useful because it's deployed in production and people use it? Well, "even a broken clock is right twice a day" so that's really not a good criterion of quality at all. In fact, as a criterion for an AI approach it's very disappointing. Look at the promise of AI: "machines that think like humans!". Look at the reality: "We're generating value!". OK. Or we could make an app that adds bunny ears and cat noses to peoples' selfies (another application of AI). People buy the app- so it's "generating value". Or we can generate value by selling canned sardines. Or selling footballs. Or selling foot massages. Or in a myriad other ways. So why do we need AI? It's just another useless trinket that is sold and bought while remaining completely useless. And that for me is a big shame.


> Look at the promise of AI: "machines that think like humans!". Look at the reality: "We're generating value!". OK.

OK, I hadn't realised we had such different implicit views on what the "goal" of AI / AI research was. Of course, I agree that the goal of "having machines think like humans" is a valid goal, and "generates value by serving production traffic" is not a good subgoal for that. However, this is not the only goal of AI research, nor is it clear to me that for e.g. public funding bodies see it as the only goal.

I use MT (at least) every week for my job and for my hobbies, mostly translating stuff I want to read in another language. I love learning languages but I could not learn all the languages I need to a high enough level to read the stuff I want to read. The old non-NMT approaches produced translations that were often useless, whereas the NMT-based translations I use now (mostly deepl.com), while not perfect, are often quite good, and definitely enough for my needs. Without NMT, realistically speaking, there is no alternative for me (i.e., I can't afford to pay a human translator, and I can't afford to wait until I'd learned the language well enough). So how can you say that AI "remains completely useless"?

Basically, you have implicitly assumed that "make machines that think like humans" is the only valid goal of AI research. And, from that point of view, it is understandable that evaluating NMT systems for how well they approach that goal using human evaluations, has many downsides. However, while some people working on NMT do have that goal, many of them also have the goal of "help people (like zodiac) translate stuff", and in the context of that goal, human evaluation is a much better benchmark target.


In general, yes, that's it. But to be honest I'm actually not that interested in making "machines that think like humans". I say that's the "promise of AI" because it was certainly the goal at the beginning of the field, specifically at the Dartmouth workshop where John McCarthy coined the term "Artificial Intelligence" [1]. Researchers in the field have varying degrees of interest in that lofty goal but the public certainly has great expectations, as seen everytime OpenAI releases a language model and people start writing or tweeting stuff about how AGI is right around the corner etc.

Personally, I came into AI (I'm a PhD research student) because I got (really, really) interested in logic programming languages and well, to be frank, there's no other place than in academia that I can work on them. On the other hand, my interest in logic programming is very much an interest in ways to make computers not be so infuriatingly dumb as they are right now.

This explains why I dislike neural machine translation and similar statistical NLP approaches: because while they can model the structure of language well, they do nothing for the meaning carried by those structures which they completely ignore by design. My favourite example is treating sentences as a "bag of words", as if order doesn't make a difference- and yet this is a popular technique... because it improves performance on benchmarks (by approximately 1.5 fired linguists).

The same goes with google translate. While I'd have to be more stubborn than I am to realise that people use it and like it, I find it depends on the use case -and on the willingness of users to accept its dumbness. For me, it's good where I don't need it and bad where I do. For example, it translates well enough between languages I know and can translate to some extent myself, say English and French. But if I want to translate between a language that is very far from the ones I know - say I want to translate from Hungarian to my native Greek- that's just not going to work, not least because the translation goes through English (because of a dearth of parallel texts and despite the fact that Google laughably claims its model actually has an automatically learned "interlingua") and the result is mangled twice and I get gibberish on the other end.

I could talk at length on why and how this happens, but the gist of it is that Google translate stubbornly refuses to use any information to decide what translation to choose, among many possible translations of an expression, by looking at the frequencies of token collocations- and nothing else. So for example, if I ask it to translate a single word, "χελιδόνι", meaning the bird swallow, from Greek to French, I get back "avaler" which is the word for the verb to swallow- because translation goes through English where "swallow" has two meanings and the verb happens to be more common than the bird. The information that "χελιδόνι" is a noun and "avaler" is a verb exists, but Google translate will just not use it. Why? Well, because the current trend in AI is to learn everything end-to-end from raw data and without prior knowledge. And that's because prior knowledge doesn't help to beat benchmarks, which are not designed to test world-knowledge in the first place. It's a vicious circle.

So, yes, to some extent it's what you say- I don't quite expect "machines that think like humans", but I do want machines that can interact with a human user in a slightly more intelligent manner than now. I gave the example of SHRDLU above because it was such a system. I'm sad the effort to reproduce such results, even in a very limited domain, has been abandoned.

P.S. Sorry, this got a bit too long, especially for the late stages of an HN conversation :)

___________

[1] That was in 1956: https://en.wikipedia.org/wiki/Dartmouth_workshop


I hear you about those language pairs that have to be round-tripped through a third language. I completely agree, too, that the big open questions in NLP are all about understanding meaning, semantic content, pragmatics, etc rather than just syntax.

I don't think that "NMT and similar techniques" ignore meaning by design though. What they do do by design, compared to expert systems etc, is avoid having explicitly encoded knowledge (of the kind SHRDLU had). Take word2vec for instance, it's not NMT but fits into the "statistical NLP" description - its purpose is to find encodings of words that carry some semantic content. Now, of course it's very little semantic content compared to what an expert could plausibly encode, but it is some semantic content, and this content improves the (subjective human) evaluation of NMT systems that do use word2vec or something similar.

Also, we should carefully distinguish "prior knowledge" as in "prior common-sense knowledge" and "prior linguistic knowledge". The end-to-end trend eschews "prior linguistic knowledge", while current NLP systems tend to lack "common-sense" knowledge, for rather different reasons.

End-to-end training tends to eschew prior linguistic knowledge because it improves (subjectively evaluated) performance in real-world tasks - I believe this is true for MT as well, but an easier example if you want to look into it is in audio transcription. I don't think there's a consensus about why this happens, but I think it is something like - the previously way people were encoding linguistic knowledge was too fragile / simplified (think about how complicated traditional linguistic grammars are), and if that information can somehow be learned in the end-to-end process, that performs better.

Lacking "common-sense" knowledge - that's more in the realm of AGI, so there's a valid debate about to what extent neural networks can learn such knowledge, but the other side of that debate is that expressing common-sense knowledge in today's formal systems is really hard and expensive, and AIUI this is also something that attempts to generalize SHRDLU run into. But it is definitely incorrect to say that it's ignored by anyone by design...

BTW, the biggest improvements (as subjectively evaluated by me) I've seen in MT on "dissimilar languages" have come from black box neural nets and throwing massive amounts of (monolingual or bilingual) data at it, rather than anything from formal systems. I use deepl.com for Japanese-English translation of some technical CS material, and that language pair used to be really horrible in the pre-deep-learning days (and it's still not that good on google translate for some reason).


Sorry for the late reply again.

I agree about word2vec and embeddings in general- they're meant to represent meaning or capture something of it anyway. I'm just not convinced that they work that well in that respect. Maybe I can say how king and queen are analogous to man and woman etc, but that doesn't help me if I don't know what king, queen, man or woman mean. I don't think it's possible to represent the meaning of words by looking at their collocation with other words- whose meaning is also supposedly represented by their collocation with other words etc.

I confess I haven't used any machine translation systems other than google translate. For instance, I've never used deepl.com. I'll give it a try since you recommend it although my use case would be to translate technical terms that I only know in English to my native Greek and I don't think anything can handle that use case very well at all. Not even humans!

Out of curiousity, you say neural machine translation is better than earlier techniques, which I think is not controversial. But, have you tried such earlier systems? I've never had the chance.


Good question!

As you say SHRDLU was the best program of that time.

I've read Winnograd's book and looked at the SHRDLU source code (not that I'm much of a lisp hacker but I got some time). It's built-on a parser and planner (logic program, pre-prolog). And it's built the old-fashioned-way, rewriting the source code with the parser rewriting input and then re-rerunning and other harry things. I think this achieves the parsing of idiomatic constructions on a high level. I believe the "raw lisp" of the day was both incredibly powerful since you could do anything and incredibly hard to scale ... you could do anything.

Winnograd wrote it himself but I think that's because he had to write it himself. In a sense, a programmer is always most productive when they are writing something by themselves because they don't have to explain anything they are doing (until the fix-ups and complexity overwhelm them but the base fact remains). And in the case of SHRDLU, Winnograd would have had an especially hard time explaining what he was doing. I mean, there was a theory behind - I've read Winnograd's book. But there was lots and lots of brilliant spaghetti/glue-code to actually make it work, code that jumped module and function boundaries. And the final had a reputation of being very buggy, sometimes it worked and sometimes it didn't. And Winnograd was a brilliant programmer and widely read in linguistics and other fields.

The software industry is an industry. No company wants to depend on the brilliance of it's workers. A company needs to produce based on some sort of average and a person working with just average skills isn't going to do SHRDLU.

So, yeah, I think that's why actual commercial programs never reached the level of SHRDLU


So what you're saying is that we really shouldn't be relying on industry for good AI?


Well, an "industrial model" is a model of a factory and it seems unlikely you could product a fully-functional, from scratch GFAI program like SHRDLU in something like a factory.

Perhaps one could create a different kind of enterprise for this but it's kind of an open problem.


Winograd made changes to the Lisp assembly to make SHRDLU work and he never back-ported them to his SHRDLU code, but his original version worked fine and was stable. The experience of breaking refers to later versions that were expanded by his doctoral students and others and to ports to java and I think C. The original code was written in 1969 but inevitably suffered from bit rot in the intervening years so it's true that there is no stable version today that can reliably do what Winograd's original code could do... but Winograd's original code was rock solid, according to the people who saw it working.

There's some information about all that here:

http://maf.directory/misc/shrdlu.html

[Dave McDonald] (davidmcdonald@alum.mit.edu) was Terry Winograd's first research student at MIT. Dave reports rewriting "a lot" of SHRDLU ("a combination of clean up and a couple of new ideas") along with Andee Rubin, Stu Card, and Jeff Hill. Some of Dave's interesting recollections are: "In the rush to get [SHRDLU] ready for his thesis defense [Terry] made some direct patches to the Lisp assembly code and never back propagated them to his Lisp source... We kept around the very program image that Terry constructed and used it whenever we could. As an image, [SHRDLU] couldn't keep up with the periodic changes to the ITS, and gradually more and more bit rot set in. One of the last times we used it we only got it to display a couple of lines. In the early days... that original image ran like a top and never broke. Our rewrite was equally so... The version we assembled circa 1972/1973 was utterly robust... Certainly a couple of dozen [copies of SHRDLU were distributed]. Somewhere in my basement is a file with all the request letters... I've got hard copy of all of the original that was Lisp source and of all our rewrites... SHRDLU was a special program. Even today its parser would be competitive as an architecture. For a recursive descent algorithm it had some clever means of jumping to anticipated alternative analyses rather than doing a standard backup. It defined the whole notion of procedural semantics (though Bill Woods tends to get the credit), and its grammar was the first instance of Systemic Functional Linguistics applied to language understanding and quite well done." Dave believes the hardest part of getting a complete SHRDLU to run again will be to fix the code in MicroPlanner since "the original MicroPlanner could not be maintained because it had hardwired some direct pointers into the state of ITS (as actual numbers!) and these 'magic numbers' were impossible to recreate circa 1977 when we approached Gerry Sussman about rewriting MicroPlanner in Conniver."

Regarding the advantage of a lone programmer- that's real, but large teams have built successful software projects before, very often. I guess you don't even need a big team, just a dozen people who all know what they're doing. That shouldn't be hard to put together given FANG-level resources. Hell, that shouldn't be hard to do given a pool of doctoral students from a top university... but nowadays even AI PhD students would have no idea how to recreate something like SHRDLU.

Edit: I got interested in SHRDLU recently (hence the comments in this thread) and I had a look at Winograd's thesis to see if there was any chance to recreate it. The article above includes a link to a bunch of flocharts of SHRDLU'S CFG but even deciphering those hand-drawn and occasionally vague plans would take a month or two of doing nothing else, something for which I absolutely do not have the time. And that's only the grammar - the rest of the program woudl have to be reverse-engineered from Winograd's thesis, examples of output from the original code or later clones, etc. That's a project for a team of digital archeologists, not software developers.


"... his original version worked fine and was stable. The experience of breaking refers to later versions that were expanded by his doctoral students and others and to ports to java and I think C."

I can believe this but I think your details overall reinforce my points above.

For a recursive descent algorithm it had some clever means of jumping to anticipated alternative analyses rather than doing a standard backup.

Yeah, fabulous but extremely hard to extend or reproduce. The aim of companies was to scale something like this. It seems like the fundamental problem was only a few really smart people could programs to this level and no one could take them beyond it (the saw that a person half to be twice as smart to debug a program as to write comes in, etc).


My impression is that the systems never progressed much after SHRDLU even though there were attempts at larger scale "expert systems". But adding more advanced rules and patterns proved extremely difficult and did not always have the expected effect of making the systems more general.

There was the whole AI winter thing, of course, but that was as much a result of things not living up to the hype as a cause.


This doesn't directly address your question, though it perhaps can give you some pointers if you want to read about the history of AI and the AI winter of the '80s, but in a way SHRDLU featured prominently in the AI winter, at least in Europe, particularly in the UK.

So, in the UK at least the AI winter was precipitated by the Lighthill Report, a report published in 1973, compiled by a Sir James Lighthill and commissioned by the British Research Council, i.e. the people who held all the research money at the time in the UK. The report was furiously damning of AI research of the time, mostly because of grave misunderstandings e.g. with respect to combinatorial explosion and basically accused researchers of, well, faffing about and not doing anything useful with their grant money. The only exception to this was SHRDLU, that Lighthill praised as an example of how AI should be done.

Anyway, if you have time, you can watch the televised debate between Lighthill and three luminaries of AI, John McCarthy (the man who named the field, created Lisp and did a few other notable things), Donald Michie (known for his MENACE reinforcement-learning program running on... matchboxes, and basically setting up AI research in the UK) and Richard Gregorie (a cognitive scientist from the US for whom I confess I don't know much). The (short) wikipedia article on the Lighthill Report has links to all the youtube videos:

https://en.wikipedia.org/wiki/Lighthill_report

It's interesting to see in the videos the demonstration of the Freddy robot from Edinburgh, that was capable of constructing objects by detecting their components with early machine vision techniques. In the 1960's. Incidentally:

Even with today's knowledge, methodology, software tools, and so on, getting a robot to do this kind of thing would be a fairly complex and ambitious project.

http://www.aiai.ed.ac.uk/project/freddy/

The above was written sometime in the '90s, I reckon but it is still true today. Unfortunately, Lighthill's report killed the budding robotics research sector in the UK and it has literally never recovered since. This is typical of the AI winter of the '80s. Promising avenues of research were abandoned not because of any scientific reasons, as is sometimes assumed ("expert systems didn't scale" etc) but, rather, because pencil pushers in charge of disbursing public money didn't get the science.

Edit: A couple more pointers. John McCarthy's review of the Lighthill Report:

http://www-formal.stanford.edu/jmc/reviews/lighthill/lighthi...

An article on the AI winter of the '80s by the editor of IEEE Intelligent Systems:

https://www.computer.org/csdl/magazine/ex/2008/02/mex2008020...


Interesting, thank you for the clarifications.


The modern natural language interfaces with limited domains are Alexa and Siri. Yes, they’re limited. But they are far more impressive and useful than SHRDLU.


Alexa and Siri (and friends) are competely incapable of interacting with a user with the precision of SHRDLU. You can ask them to retrieve information from a Google search but e.g. they have no memory of the anaphora in earlier sentences in the same conversation. If you say "it" a few times to refer to different entities they completely lose the plot.

They are also completely incapable of reasoning about their environment, not least because they don't have any concept of an "environment" - which was represented by the planner and the PROGRAMMAR language in SHRDLU.

And of course, systems like Siri and Alexa can't do anything even remotely like correctly disambiguating the "support support supports" show-off sentence in the excerpt above. Not even close.

Edit: Sorry, there's a misunderstanding about "limited domain" in your comment. Alexa and Siri don't operate in a limited domain. A "limited domain" would be something like being in charge of your music collection and nothing else. Alexa annd Siri etc are supposed to be general-use agents. I mean, they are, it's just that they suck at it... and would still suck in a limited domain also.


It’s not meaningful to compare SHRDLU with today’s verbal search interfaces. The world SHRDLU manipulated were only stackable blocks and the only relations it knew were ‘above’ and ‘below’. The entire scope of its endeavor was describing the state of block stacks and basic ways to reorder stacks to satisfy the single relation of above-ness and below-ness.

Time and causality and even basic probability were all absent from SHRDLU’s model. Not surprisingly the work was a dead end that even Winograd was quick to abandon, as he subsequently exited the field of experimental AI for the more conceptual models of cognitive science and HCI.


>> The world SHRDLU manipulated were only stackable blocks and the only relations it knew were ‘above’ and ‘below’.

If that were true then SHRDLU would have operated in a one-dimensional world. Instead, it moved blocks around a 3-dimensional world. It could understand more spatial relations than "above" and "below", as in the example I quote above where it is asked "Is there a large block behind a pyramid?". It stacked blocks, but in doing so it also had to move others out of the way etc. That is no big mystery, like I say SHRDLU used a planner and even in the 1960s' there were planners capabe of solving block stacking problems in 3d environments.

Time and causality were certainly not "absent" from SHRDLU! Like I say, it had a memory that allowed it to keep track of what it had done previously and it could explain the reasoning behind its actions, like in this excerpt from the demonstration quoted in wikipedia:

    Person: Had you touched any pyramid before you put the green one on the little cube?
    Computer: YES, THE GREEN ONE.
    Person: When did you pick it up?
    Computer: WHILE I WAS STACKING UP THE RED CUBE, A LARGE RED BLOCK AND A LARGE GREEN CUBE.
    Person: Why?
    Computer: TO GET RID OF IT.
    Person: Why did you do that?
    Computer: TO CLEAN OFF THE RED CUBE.
    Person: Why did you clear off that cube?
    Computer: TO PUT IT ON A LARGE GREEN CUBE.
    Person: Why did you do that?
    Computer: TO STACK UP THE RED CUBE , A LARGE RED BLOCK AND A LARGE GREEN CUBE.
    Person: Why did you do that?
    Computer: BECAUSE YOU ASKED ME TO.
    Person: How did you do it?
    Computer: BY PUTTING A LARGE RED BLOCK ON THE TABLE ; THEN LETTING GO OF IT ; THEN PUTTING A LARGE GREEN CUBE ON IT ; THEN LETTING GO OF THAT CUBE ; THEN PUTTING THE RED CUBE ON THAT CUBE; THEN LETTING GO OF THAT CUBE.
This is a question about time and causality: the user asks why the program performed a set of operations in an earlier time.

>> Not surprisingly the work was a dead end that even Winograd was quick to abandon, as he subsequently exited the field of experimental AI for the more conceptual models of cognitive science and HCI.

Regarding Winograd's subsequent work direction, this is what he had to say about it himself:

How would you say SHRDLU influenced your subsequent work and/or philosophy in AI?

Having insight into the limitations I encountered in trying to extend SHRDLU beyond micro-worlds was the key opening to the philosophical views that I developed in the work with Flores. The closest thing I have online is the paper Thinking machines: Can there be? Are we?

How would you characterize AI since SHRDLU? Why do you think no one took SHRDLU or SHRDLU-like applications to the next level?

There are fundamental gulfs between the way that SHRDLU and its kin operate, and whatever it is that goes on in our brains. I don't think that current research has made much progress in crossing that gulf, and the relevant science may take decades or more to get to the point where the initial ambitions become realistic. In the meantime AI took on much more doable goals of working in less ambitious niches, or accepting less-than-human results (as in translation).

What future do you see for natural language computing and/or general AI?

Continued progress in limited domain and approximate approaches (including with speech). Very long term research is needed to get a handle on human-level natural language.

http://maf.directory/misc/shrdlu.html

My reading of this is he realised that natural language understanding is not an easy thing. I don't disagree one bit and I don't think for a moment that SHRDLU could "understand" anything at all. But it was certainly capable of much more intelligent-looking behaviour than modern statistical machine-learning based systems. Winograd's reply above says that it's hard to extend SHRDLU outside of its limited domain, but my point is that a program that can operate this well in a limited domain is still useful and much more useful than a program that can operate in arbitrary domains but is dumb as bricks, like modern conversational agents that have nothing like a context of their world that they can refer to, to choose appropriate contributions to a conversation. He also hints to the shift of AI research targets from figuring out how natural language works to "less ambitious niches" and "less than human results", which I also point out in my comments in this thread. This is condensed - I'm happy to elaborate if you wish.

I have to say I was very surprised by your comment, particularly the certainty with which you asserted SHRDLU's limitations. May I ask- where did you come across information that SHRDLU could only understand "above and below" relations and that time and causality were absent from its "model"?


The thing that qualifies as "faffing" in my opinion isn't the statistical NLP programs, which "are what they are" but claims of progress based primarily on benchmarks, as YeGoblynQueene rightly states.

And "limited domain" is relative. A program that gets many aspects of language right talking about a small world might be said to have a larger domain than a program that outputs stream of semi-plausible, semi-gibberish involving the whole of English language. Which again isn't saying modern NLP is nothing but rather we should a somewhat better to talk of it (and machine learning generally) than "hitting benchmarks".


I read grandparent's post as saying that despite all the research and untold amounts of compute power poured into NLP over the decades, its practitioners have yet to address the original real-world goals that led us to study it in the first place. Missing the forest for the trees, if you will.

(I don't assert that it has hasn't. I know just enough about the topic to see how little I know. But it seems, from the outside, a valid criticism, and not one unique to the field.)


Since this post is receiving downvotes, I'd like to know which part of the HN guidelines it is violating.


Why is it surprising that a CFG can approximate a subset of English grammar?


> Why is it surprising that a CFG can approximate a subset of English grammar?

"Colorless green ideas sleep furiously" is a famous example of a sentence that is grammatical, but meaningless. The goal of SHRDLU was far more ambitious than approximating English grammar.


>> The Winograd schema test was originally intended to be a more rigorous replacement for the Turing test, because it seems to require deep knowledge of how things fit together in the world, and the ability to reason about that knowledge in a linguistic context. Recent advances in NLP have allowed computers to achieve near human scores:(https://gluebenchmark.com/leaderboard/).

The "Winograd schema" in Glue/SuperGlue refers to the Winograd-NLI benchmark which is simplified with respect to the original Winograd Schema Challenge [1], on which the state-of-the-art still significantly lags human performance:

The Winograd Schema Challenge is a dataset for common sense reasoning. It employs Winograd Schema questions that require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models are evaluated based on accuracy.

WNLI is a relaxation of the Winograd Schema Challenge proposed as part of the GLUE benchmark and a conversion to the natural language inference (NLI) format. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. While the training set is balanced between two classes (entailment and not entailment), the test set is imbalanced between them (35% entailment, 65% not entailment). The majority baseline is thus 65%, while for the Winograd Schema Challenge it is 50% (Liu et al., 2017). The latter is more challenging.

https://nlpprogress.com/english/common_sense.html

There is also a more recent adversarial version of the Winograd Schema Challenge called Winogrande. I can't say I'm on top of the various results and so I don't know the state of the art, but it's not yet "near human", not without caveats (for example, wikipedia reports 70% accuracy on 70 problems manually selected from the originoal WSC).

__________

[1] https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492


I know that there are allegedly NLP algorithms for generating things like articles about sports games. I assume they have something more like the type signature (timeline of events) -> (narrative about said events)

What this article is about is more (question/prompt) -> (answer/continuation of prompt)

Does anyone know if there is progress in the (timeline of events) -> (narrative about said events) space?


For an intermediate goal on the way to sports games, the financial press version of (timeline of events) -> (narrative about said events) could be tackled as a memoryless system.


> A lot of the power of the thought experiment hinges on the fact that the room solves questions using a lookup table, this stacks the deck. Perhaps we be more willing to say that the room as a whole understood language if it formed an (implicit) model of how things are, and of the current context, and used those models to answer questions.

Some define intelligence (entirely separately from consciousness) precisely as the ability to develop an internal model. Coupled to a regulatory feedback the system can then modify itself in response to some set of internal and/or external conditions (Joscha Bach for instance suggests consciousness is a consequence of extremely complex self-models)


> In my head- and maybe this was naive- I had thought that, in order to attempt these sorts of tasks with any facility, it wouldn’t be sufficient to simply feed a computer lots of text.

(Tasks here referring to questions in the New York Regent’s science exam)

Same for me.

But it makes sense of course that learning from text only is entirely possible. I certainly have not directly observed the answer to eg. 'Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal', I have been taught, from text books, what the answer should be.

I do have a much better grounding of what growth is, what apples and apple trees are though.


A bit I found rather strange, on the language-side:

> This is to say the patterns in language use mirror the patterns of how things are(1).

> (1)- Strictly of course only the patterns in true sentences mirror, or are isomorphic to, the arrangement of the world, but most sentences people utter are at least approximately true.

Presumably this should really say something like "...but most sentences people utter are at least approximately true of their mental representation of the world."


NLP is great for many things, but, from my own experience as a NLP developer, machines are not even close to understand human language. They can interpret well some kind of written speech, but they will struggle to grasp two humans speaking to each other. The progress we are make on building chatbots and vocal assistants is mainly due to the fact We are learning how to speaking to the machines, and not the contrary.


I find it a little bit strange that there is an unspoked assumption in almost all natural language processing: That speech and text are perfectly equivalent.

All of the examples in the article work on English text, not spoken English. I would consider spoken English to be a much better "Gold standard" of natural language.

I'm really looking forward to machine translation operating purely on a speech in/speech out basis, instead of converting to text as an intermediate step.


the thing is humans have most efficiently encoded (in detail) reality in text. humans already highlight what is worth encoding about reality.

for example, you can finetune gpt-2 to have an idea of sexual biology by having it read erotica. just like how you can have a model learn the same by watching porn. but it is much more efficient to read the text, since there is much less information that is "useless"


Note this is pre-GPT-3. In fact I expect GPT-4 will be where interesting things start happening in NLP.


I honestly don't get where the big deal is with NLP. So far the most useful application has been customer support chatbots and those still don't rise to the level of having an actual human that can understand the intricacies of your special request.


I work on such a bot @Intercom.

A lot of support requests aren't actually intricate and special: there's almost always loads of simple requests that come in again and again.

When you ask a request like that and get the answer instantly, that really is ML delivering value.

You mightn't think it, but a lot of people work in customer support, and spend a lot of time answering rote questions again and again.

People talk about how much hype there is, or the chance of another AI winter. Yes there's a ton of hype - but I think they aren't considering the real value already being delivered this time around.

Everyone is excited about GPT3, but there's already been amazing progress in practical NLP already, over the last few years.


This past week, I used this new edit distance library to identify quasi-duplicates in a large dataset:

https://github.com/Bergvca/string_grouper

Saved me hours of work because all the Levenshtein implementations are pretty slow and I’m going to need to rerun the analysis as the dataset grows.

I don’t know about consumer-facing tools, but NLP stuff has helped me solve all kinds of tedious data problems at work.


Current NLP is bad. Still useful (Google search increasingly feels like it is doing NLP to change what I asked for into what it thinks I meant) but bad. A hypothetical future “perfect” NLP can demonstrate any skill that a human could learn by reading, and computers can read so much more than any given human.


Is reading enough to understand the real world without direct experience of the real world? Is there any research that tries to answer this question?


Is there any research that tries to answer this question?

That's the whole point of the experiment called GPT-3.


As of about ten years ago when I received my degree in linguistics, I understood there were two schools on this issue:

A: Of course not, let's do something else. B: What? You put text in the maths and stuff comes out.


That's just not even close to the only or most useful application.

I use NLP and associates s2s techniques every day. I struggle to see how so many people don't see the obvious benefits deep inference is bringing to stuff all around them.


Could be a more convincing statement with examples...


Speech recognition on any device? Translation?


those still don't rise to the level of having an actual human

What if they did? Do you see where the big deal is now?


If a chatbot was as good as a human then would you notice it was a chatbot?


Rather than a generator, I could use a good verifier, i.e., an accurate grammar checker


Has it happened that a "thought experiment" has become a real experiment ever?


Most historians think that this was actually a thought experiment:

https://en.wikipedia.org/wiki/Galileo's_Leaning_Tower_of_Pis...

An equivalent experiment was famously carried out for real in 1971 on the surface of the Moon.


Not sure exactly what you mean, but Einstein developed relativity largely through thought experiments. And relativity has been verified by real experiments.


In a sense every experiment starts as a thought experiment.


I'd go one step further: Humans themselves don't understand anything, we are just good at constructing logical-sounding (plausible, testable) stories about things. These are mental models, and it's the only way we can make reasonable predictions to within error tolerances of our day-to-day experience, but they are flat-out lies and stories we tell ourselves not based on a high-fidelity understanding of anything.

Rumination, deep thinking, etc is simply actor-critic learning of these mental models for story-telling.


Mental models are not lies.

"The car is red" is not a lie just because I didn't phrase it internally as "The car reflects photons of a frequency of around 700nm".

We have to be able to simplify and internalize simplified models in order to make any sense of anything. It's the same reason your eyes only have a focal point in the dead center: attention requires vast amounts of processing power.

To reiterate, a simplification is not a lie. Especially not a flat-out lie.


Yes. Bridges are designed using old-school Newtonian mechanics, without (in general) considering relativistic or quantum effects at all, but they stand up nonetheless.

That wouldn't be the case if they were built based on "flat-out lies".

Edit: the sticking point here is the use of the word "lie", I think.

A lie is a falsehood told with intent to deceive.

Thus, novels are not "lies", despite being works of the imagination. Everyone knows that the people in Moby-Dick or Wuthering Heights were made up.

Neither is a simplified model a "lie" if it is close enough.

All engineers know that the steel and concrete components of a bridge have quantum and relativistic effects going on inside them, so ignoring those effects isn't a "lie" in any meaningful sense.

It just doesn't matter for the purpose at hand.


This is sort of a digression, sort of a return to the original topic. As I was reading the sentence about Moby-Dick, I wondered if perhaps someday this comment section would become part of a training corpus for GPT-9001 or whatever. Then a researcher might read it a paragraph from Moby-Dick and ask it whether those events actually happened or whether they were fictional. And it might answer "Of course they were made up. Everyone knows that the people in Moby-Dick or Wuthering Heights were made up."

And if the researcher asked it to explain how it could know anything about what is fiction and what is not, it might be able to use the contents of these comments to figure out how to converse intelligently about its own limitations.


Do current NLP systems understand arithmetic, and can the do it with unfamiliar numbers they've never seen on? If not, I'd think that your theory is demonstratably false, as a child can extrapolate mathematical axioms from just a few example problems, whereas NLP models are not able to do so.


>Do current NLP systems understand arithmetic, and can the do it with unfamiliar numbers they've never seen on?

I don't know if that question is rhetorical or not, but GPT-3 can do basic math for problems it has not been directly trained on, and there's been a fair amount of debate, including right here at hn, about what the takeaway is supposed to be.


GPT-3 can't do arithmetic very well at all. There is a big, fat, extraordinary claim that it can in the GPT-3 paper but it's only based on perfect accuracy on two-digit addition and subtraction, ~90% accuracy on three digit addition and subtraction and ... around 20% accuracy on addition and substraction between from three to five digits and multiplication from between two measly digits. Note: no division at all and no arithmetic with more than five digits. And very poor testing to ensure that the solved problems don't just happen to be in the model's training dataset to begin with, which is the simplest explanation of the reported results given that the arithmetic problems GPT-3 solves correctly are the ones that are most likely to be found in a coprus of natural language (i.e. two- and three- digit addition and subtraction).

tl;dr, GPT-3 can't do basic math for problems it has not been directly trained on.

____________

[1] https://arxiv.org/abs/2005.14165

See section 3.9.1 and Figure 3.10. There is an additional category of problems of combinations of addition, subtraction and multiplication between three single-digit numbers. Performance is poor.


This is incorrect. The paper shows that it can do basic math for problems it has not been directly trained on. The only way to dispute that is to twist that very simple claim into something it's not, as you do here by talking about all the extra math that it can't do. If I say I can cut carrots, it doesn't mean I'm claiming to be a world class chef. I am disappointed that a claim as simple as this can be completely misunderstood.


How ironic you claim that the paper overstates it, when you very carefully leave out every single qualifier about BPEs and how GPT-3's arithmetic improves massively when numbers are reformatted to avoid BPE problems. Pot, kettle.


“Byte pair encoding”: more discussion at https://www.gwern.net/GPT-3#bpes


>but it's only based on perfect accuracy on two-digit addition and subtraction, ~90% accuracy on three digit addition and subtraction and ...

Which grade of school it corresponds to? It would be astonishing to have an AI system of this capability level mere few years ago.


Performing arithmetic per se is not impressive. A calculator can do it and the rules of arithmetic are not so complex that they can't be hand-coded, as they are, routinely. The extraordinary claim in the GPT-3 paper is that a language model is capable of performing arithmetic operations, rather than simply memorising their results [1]. Language models compute the probabilities of a token to follow from a sequence of tokens and in particular have no known ability to perform any arithmetic, so if GPT-3, which is a language model, were capable of doing something that it is not designed to do, then that would be very interesting indeed. Unfortunately, such an extraordinary claim is backed up with very poor evidence, and so amounts to nothing more than invoking magick.

__________

[1] From the paper linked above:

In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.

Now that I re-read this I'm struck by the extent to which the authors are willing to pull their results this way and that to force their preferred interpretation on them. Their model answers two-digit addition problems correctly? It's learned addition! Their model is making mistakes? It's because it's actually trying to compute the answer and failing! The much simpler explanation, that their model has memorised solutions to a few problems but there are many more it hasn't even seen, seems to be assigned a very, very low prior. Who cares about such a boring interpration? Language models are few shot learners! (once trained with a few billion examples that is).


The funny thing with GPT-3 is that it comes up with sentences that seem to make correct deductions using arithmetic or other submodels of reality but it will blithely generate further sentences that contradict these apparent understandings. It's impressive and incoherent all at the same time.


I question the premise here. Can young children (say <5) who don't know arithmetic at all learn the axioms from just a bunch of examples? This isn't how children are taught arithmetic; they are taught the rules/algorithm not just a bunch of data-points. Do you have a source on the claim?


I get what you're saying, but I think this is a side-effect of saving thinking in abstractions. We use abstractions to plug holes to avoid having to drill down on every concept we're exposed to.

That can look like a surface-level linguistic understanding, but it's not, it's a surface-level abstraction. It's not arbitrary, it has structure, and when you flesh it out you're fleshing it out with actual abstract structure, not just painting over the gaps with arbitrary language.


But you claim to understand right in this comment how humans understand other things.

Isn't that self-contradictory?


He's just not very good at constructing logical-sounding (plausible, testable) stories about things :)


no he did not claim so. His claim is just as much a lie as all other claims. Lies can be valuable, even if they are not strictly true.


What about your comment? Is that a lie too? :)


of course, yes. Not in the sense that it's 0 percent true, but rather in the sense that it's not 100 percent true.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: