Hacker News new | past | comments | ask | show | jobs | submit login

It's even worse. AI is a really smart but inexperienced person who also lies frequently. Because AI is not accountable to anything, it'll always come up with a reasonable answer to any question, if it is correct or not.



To put it in other words: it is not clear when and how they hallucinate. With a person, their competence could be understood and also their limits. But a llm can happily give different answers based on trivial changes in the question with no warning.


In a conversation (conversation and attached pictures at https://bsky.app/profile/liotier.bsky.social/post/3ldxvutf76...), I delete a spurious "de" ("Produce de two-dimensional chart [..]" to "Produce two-dimensional [..]") and ChatGPT generates a new version of the graph, illustrating a different function although nothing else has changed and there was a whole conversation to suggest that ChatGPT held a firm model of the problem. Confirmed my current doctrine: use LLM to give me concepts from a huge messy corpus, then check those against sources from said corpus.


LLM's are non-deterministic: they'll happily give different answers to the same prompt based on nothing at all. This is actually great if you want to use them for "creative" content generation tasks, which is IMHO what they're best at. (Along with processing of natural language input.)

Expecting them to do non-trivial amounts of technical or mathematical reasoning, or even something as simple as code generation (other than "translate these complex natural-language requirements into a first sketch of viable computer code") is a total dead end; these will always be language systems first and foremost.


This confuses me. You have your model, you have your tokens.

If the tokens are bit-for-bit-identical, where does the non-determinism come in?

If the tokens are only roughly-the-same-thing-to-a-human, sure I guess, but convergence on roughly the same output for roughly the same input should be inherently a goal of LLM development.


Most any LLM has a "temperature" setting, a set of randomness added to the otherwise fixed weights to intentionally cause exactly this nondeterministic behavior. Good for creative tasks, bad for repeatability. If you're running one of the open models, set the temperature down to 0 and it suddenly becomes perfectly consistent.


You can get deterministic output with even with a high temp.

Whatever "random" seed was used can be reused.


The model outputs probabilities, which you have to sample randomly. Choosing the "highest" probability every time leads to poor results in practice, such as the model tending to repeat itself. It's a sort of Monte-Carlo approach.


The trained model is just a bunch of statistics. To use those statistics to generate text you need to "sample" from the model. If you always sampled by taking the model's #1 token prediction that would be deterministic, but more commonly a random top-K or top-p token selection is made, which is where the randomness comes in.


It is technically possible to make it fully deterministic if you have a complete control over the model, quantization and sampling processes. The GP probably meant to say that most commercially available LLM services don't usually give such control.


Actually you just have to set temperature to zero.


> If the tokens are bit-for-bit-identical, where does the non-determinism come in?

By design, most LLM’s have a randomization factor to their model. Some use the concept of “temperature” which makes them randomly choose the 2nd or 3rd highest ranked next token, the higher the temperature the more often/lower they pick a non-best next token. OpenAI described this in their papers around the GPT-2 timeframe IIRC.


Computers are deterministic. LLMs run on computers. If you use the same seed for the random number generator you’ll see that it will produce the same output given an input.


The unreliability of LLMs is mostly unrelated to their (artificially injected) non-determinism.


There's no need for there to be changes to the question. LLMs have a rng factor built in to the algorithm. It can happily give you the right answer and then the wrong one.


> trivial changes in the question

i love how those changes are often just a different seed in the randomness... as just chance.

run some repeated tests with "deeper than surface knowledge" on some niche subjects and got impressed that it gave the right answer... about 20% of the time.

(on earlier openAI models)


Ask survey designers how “trivial” changes to questions impact results from humans. It’s a huge thing in the field.


"AI is a really smart but inexperienced person who also lies frequently." Careful. Here "smart" means "amazing at pattern-matching and incredibly well-read, but has zero understanding of the material."


And how is what humans do any different ? What does it mean to understand ? Are we pattern matching as well?


Sure, we're also pattern matching, but additionally (among other things):

1) We're continually learning so we can update our predictions when our pattern matching is wrong

2) We're autonomous - continually interacting with the environment, and learning how it respond to our interaction

3) We have built in biases such as curiosity and boredom that drive us to experiment, gain new knowledge, and succeed in cases where "pre-training to date" would have failed us


For one, a brain can’t do anything without irreversibly changing itself in the process; our reasoning is not a pure function.

For a person to truly understand something they will have a well-refined (as defined by usefulness and correctness), malleable internal model of a system that can be tested against reality, and they must be aware of the limits of the knowledge this model can provide.

Alone, our language-oriented mental circuits are a thin, faulty conduit to our mental capacities; we make sense of words as they relate to mutable mental models, and not simply in latent concept-space. These models can exist in dedicated but still mutable circuitry such as the cerebellum, or they can exist as webs of association between sense-objects (which can be of the physical senses or of concepts, sense-objects produced by conscious thought).

So if we are pattern-matching, it is not simply of words, or of their meanings in relation to the whole text, or even of their meanings relative to all language ever produced. We translate words into problems, and match problems to models, and then we evaluate these internal models to produce perhaps competing solutions, and then we are challenged with verbalizing these solutions. If we were only reasoning in latent-space, there would be no significant difficulty in this last task.


At the end of the day, we're machines, too. I wrote a piece a few months ago with an intentionally provocative title, questioning whether we're truly on a different cognitive level.

https://acjay.com/2024/09/09/llms-think/


The difference is less about noticing patterns than it is knowing when to discard them.


Humans can extrapolate as well as interpolate.

AI can only interpolate. We may perceive it as extrapolation, but all LLMs architectures are fundamentally cleverly designed lossy compression


I asked ChatGPT to help out: -----------------------------

"The distinction between AI and humans often comes down to the concept of understanding. You’re right to point out that both humans and AI engage in pattern matching to some extent, but the depth and nature of that process differ significantly." "AI, like the model you're chatting with, is highly skilled at recognizing patterns in data, generating text, and predicting what comes next in a sequence based on the data it has seen. However, AI lacks a true understanding of the content it processes. Its "knowledge" is a result of statistical relationships between words, phrases, and concepts, not an awareness of their meaning or context"


Anyone downvoting, please be aware that you are downvoting the AI's answer!

:)


No, they're downvoting you for posting an AI answer.


That AI answer is not spam, though. It's literally the topic under discussion.


Yeah, it's just the fact that you pasted in an AI answer, regardless of how on point it is. I don't think people want this site to turn into an AI chat session.

I didn't downvote, I'm just saying why I think you were downvoted.


people are downvoting because they don’t want to see walls of text generated by llms on hn


That's reasonable. I cut back the text. On the other hand I'm hoping downvoters have read enough to see that the AI-generated comment (and your response) are completely on-topic in this thread.


It's on topic indeed. But is it insightful?

I use llms as tools to learn about things I don't know and it works quite well in that domain.

But so far I haven't found that it helps advance my understanding of topics I'm an expert in.

I'm sure this will improve over time. But for now, I like that there are forums like HN where I may stumble upon an actual expert saying something insightful.

I think that the value of such forums will be diminished once they get flooded with AI generated texts.

(Fwiw I didn't down vote)


Of course the AI's comment was not insightful. How could it be? It's autocomplete.

That was the point. If you back up to the comment I was responding to, you can see the claim was: "maybe people are doing the same thing LLMs are doing". Yet, for whatever reason, many users seemed to be able to pick out the LLM comment pretty easily. If I were to guess, I might say those users did not find the LLM output to be human-quality.

That was exactly the topic under discussion. Some folks seem to have expressed their agreement by downvoting. Ok.


I think human brains are a combination of many things. Some part of what we do looks quite a lot like an autocomplete from our previous knowledge.

Other parts of what we do looks more as a search through the space of possibilities.

And then we act and collaborate and test the ideas that stand against scrutiny.

All of that is in principle doable by machines. The things we currently have and we call LLMs seem to currently mostly address the autocomplete part although they begin to be augmented with various extensions that allow them to take baby steps in other fronts. Will they still be called large language models once they will have so many other mechanisms beyond the mere token prediction?


If we wanted to talk to an LLM we would go there and do it, this place if for humans to put in effort and use their brains to think for themselves.


With respect, can I ask you to please read the thread?


Completely missing the point.

We don't care what LLMs have to say, whether you cut back some of it or not it's a low effort wasted of space on the page.

This is a forum for humans.

You regurgitating something you had no contribution in producing, which we can prompt for ourselves, provides no value here, we can all spam LLM slop in the replies if we wanted, but that would make this site worthless.


I think you're saying that reading the thread is completely pointless, because we're all committed to having a high-quality discussion.


It's actually even worse than that: the current trend of AI is transformer-based deep learning models that use self-attention mechanisms to generate token probabilities, predicting sequences based on training data.

If only it was something which we could ontologically map onto existing categories like servants or liars...


It is a wonderful irony that AI makes competence all the more important.

It’s almost like all the thought leading that proclaimed the death of software eng was nothing but self-promotional noise. Huh, go figure.


Don't count it out yet as being problematic for software engineering, bu not in the way you probably intend with your comment.

Where I see software companies using it most is as a replacement for interns and junior devs. That replacement means we're not training up the next generation to be the senior or expert engineers with real world experience. The industry will feel that badly at some point unless it gets turned around.


It’s also already becoming an issue for open-source projects that are being flooded with low-quality (= anything from “correct but pointless” to “actually introduces functional issues that weren’t there before”) LLM-generated PRs and even security reports —- for examples see Daniel Stenberg’s recent writing on this.


Agree. I think we are already seeing a hollowing out effect on tech hiring at the lower end. They’ve always been squeezed a bit, but it seems much worse now.


I agree with this irony.

That said, combining multiple ais and multiple programs together may mitigate this.


Hallucinations can be mostly eliminated with RAG and tools. I use NotebookLM all of the time to research through our internal artifacts, it includes citations/references from your documents.

Even with ChatGPT you can ask it to find web citations and if it uses the Python runtime to find answers, you can look at the code.

And to prevent the typical responses - my company uses GSuite so Google already has our IP, NotebookLM is specifically approved by my company and no Google doesn’t train on your documents


Even with RAG you’re bounded at some 93%, it’s not a panacea.


How are you bounded? When you can easily check the sources? Also you act as if humans without LLMs have a higher success rate?

There is an entire “reproducibility crisis” with research.


False equivalence. Humans fail in predictable ways that can be mitigated though process and training.

Try training an LLM.


How are the results any different? You can’t completely trust human output. There are all sorts of cognitive biases that come into play.


Facts can be checked with RAG, but the real value of AI isn't as a search replacement, but for reasoning/problem-solving where the answer isn't out there.

How do you, in general, fact check a chain of reasoning?


It’s not just a search engine though.

I can’t tell a search engine to summarize text for a technical audience and then another summary for a non technical audience.

I recently came into the middle of a cloud consulting project where a lot of artifacts, transcripts of discovery sessions, requirement docs, etc had already been created.

I asked NotebookLM all of the questions I would have asked a customer at the beginning of a project.

What it couldn’t answer, I then went back and asked the customer.

I was even able to get it to create a project plan with work streams and epics. Yes it wouldn’t have been effective if I didn’t already know project management, AWS and two decades+ of development experience.

Despite what people think, LLMs can also do a pretty good job at coding when well trained on the APIs. Fortunately, ChatGPT is well trained on the AWS CLI, SDKs in various languages and you can ask it to verify the SDK functions on the web.

I’ve been deep into AWS based development since LLMs have been a thing. My opinion may change if I get back into more traditional development


> I can’t tell a search engine to summarize text for a technical audience and then another summary for a non technical audience.

No, but, as amazing as that is, don't put too much trust in those summaries!

It's not summarizing based on grokking the key points of the text, but rather based on text vs summary examples found in the training set. The summary may pass a surface level comparison to the source material, while failing to capture/emphasize the key points that would come from having actually understood it.


I write the original content or I was in the meeting where I’m giving it the transcript. I know what points I need to get across to both audiences.

Just like I’m not randomly depending on it to do an Amazon style PRFAQ (I was indoctrinated as an Amazon employee for 3.5 years), create a project plan, etc, without being a subject matter expert in the areas. It’s a tool for an experienced writer, halfway decent project manager, AWS cloud application architect and developer.


If I had a senior member of the team that was incredibly knowledgeable but occasionally lied, but in a predictable way, I would still find that valuable. Talking to people is a very quick and easy way to get information about a specific subject in a specific context, so I could ask them targetted questions that are easy to verify, the worst thing that happens is I 'waste' a conversation with them.


Sure, but LLMs don't lie in a predictable way. Its just their nature that they output statistical sentence continuations, with a complete disregard for the truth. Everything that they output is suspect, especially the potentially useful stuff that you don't know whether it's true or false.


They do lie in a predictable way: if you ask them for a widely available fact you have a very high probability of getting the correct answer, if you ask them for something novel you have a very high probabilty of getting something made up.

If I'm trying to use some tool that just got released or just got a big update, I wont use AI, if I want to check the syntax of a for loop in a language I don't know I will. Whenever you ask it a question you should have an idea in your mind of how likely you are to get a good answer back.


I suppose, but they can still be wrong on the common facts like number of R's in strawberry that are counter-intuitive.

I saw an interesting example yesterday of type "I have 3 apples, my dad has 2 more than me ..." where of the top 10 predicted tokens, about 1/2 led to the correct answer, and about 1/2 didn't. It wasn't the most confident predictions that lead to the right answer - pretty much random.

The trouble with LLMs vs humans is that humans learn to predict facts (as reflected in feedback from the environment, and checked by experimentation, etc), whereas LLMs only learn to predict sentence soup (training set) word statistics. It's amazing that LLM outputs are coherent as often as they are, but entirely unsurprising that they are often just "sounds good" flow-based BS.


I think maybe this is where the polarisation of those who find chatGPT useful and those who don't comes from. In this context, the number of r's in strawberry is not a fact: its a calculation. I would expect AI to be able to spell a common word 100% of the time, but not to be able to count letters. I don't think in the summary of human knowledge that has been digitised there are that many people saying 'how many r's are there in strawberry', and if they are I think that the common reply would be '2', since the context is based on the second r. (people confuse strawbery and strawberry, not strrawberry and strawberry).

Your apples question is the same, its not knowledge, it's a calculation, it's intelligence. The only time you're going to get intelligence from AI at the moment is to ask a question that a significantly large number of people have already answered.


True, but that just goes to show how brittle these models are - how shallow the dividing line is between primary facts present (hopefully consistently so) in the training set, and derived facts that are potentially more suspect.

To make things worse, I don't think we can even assume that primary facts are always going to be represented in abstract semantic terms independent of source text. The model may have been trained on a fact but still fail to reliably recall/predict it because of "lookup failure" (model fails to reduce query text to necessary abstract lookup key).


Lying means stating things as facts despite knowing or believing that they are false. I don’t think this accurately characterizes LLMs. It’s more like a fever dream where you might fabulate stuff that appears plausibly factual in your dream world.


Saying that they "lie" creates the impression that they have knowledge that they make false statements, and they intend to deceive.

They're not that capable. They're just bullshit artists.

LLM = LBM (large bullshit models).


That sounds mostly like an incentives problem. If OpenAI, Anthropic, etc decide their LLMs need to be accurate they will find some way of better catching hallucinations. It probably will end up (already is?) being yet another LLM acting as a control structure trying to fact check responses before they are sent to users though, so who knows if it will work well.

Right noe there's no incentive though. People keep paying good money to use these tools despite their hallucinations, aka lies/gas lighting/fake information. As long as users don't stop paying and LLM companies don't have business pressure to lean on accuracy as a market differentiator, no one is going to bother fixing it.


Believe me, if they could use another LLM to audit an LLM, they would have done that already.

It's inherit to transformers that they predict the next most likely token, its not possible to change that behavior without making them useless at generalizing tasks (overfitting).

LLMs run on statistics, not logic. There is no fact checking, period. There is just the next most likely token based on the context provided.


Yes, most people who disagree with this have no clear understanding of how a LLM works. It is just a prediction mechanism for the next token. The implementation is very fancy and takes a lot of training, but it's not doing anything more than next token prediction. That's why it is incapable of doing any reasoning.


Yeah its an interesting question, and I'm a little surprised I got down voted here.

I wouldn't expect them to add an additional LLM layer unless hallucinations from the underlying LLM aren't acceptable, and in this case that means it is unacceptable enough to cost them users and money.

Adding a check/audit layer, even if it would work, is expensive both financially and computationally. I'm not sold that it would actually work, but I just don't think they've had enough reason to really give it a solid effort yet either.

Edit: as far as fact checking, I'm not sure why it would be impossible. An LLM wouldn't likely be able to run a check against a pre-trained model of "truth," but that isn't the only option. An LLM should be able to mimic what a human would do, interpret the response and search a live dataset of sources considered believable. Throw a budget of resources at processing the search results and have the LLM decide if the original response isn't backed up, or contradicts the source entirely.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: