It's a very strange result. I believe that NVIDIA is overvalued, but if DeepSeek...

lmm · 2025-01-27T12:48:39 1737982119

If the top-tier premium GPUs aren't the difference-maker they were thought to be then that will hurt NVIDIA's margins, even if they make some of it up on volume.

nialv7 · 2025-01-27T14:48:33 1737989313

You need to be prepared for the reality that naive scaling no longer works for LLMs anymore.

Simple question: where is GPT-5?

impossiblefork · 2025-01-27T16:04:10 1737993850

It is a possibility, but my understanding of what OpenAI has said is that GPT-5 is delayed because of the apparent promise of RL trained things like o1, etc. and that they've simply decided to train those instead of training a bigger base model training on better data, and I think this is plausible.

nialv7 · 2025-01-27T20:34:35 1738010075

OpenAI has an incentive to make people believe that the scaling laws are still alive, to justify their enormous capex if nothing else.

I wouldn't give what they say to much credence, and will only believe the results I see.

impossiblefork · 2025-01-27T22:14:48 1738016088

Yes, I think I agree that it seems unlikely that the spending they're doing can be recouped.

But it can still make sense for a state, even if it doesn't make sense for investors though.

Jlagreen · 2025-01-27T15:13:36 1737990816

If we expect that the demand for GPT-5 in AI compute is 100x of that of GPT-4 then if GPT-4 was trained in months on 10k of H100 then you would need years with 100k of H100 or maybe again months with 100k of GB200.

See, there is your answer. The issue is the compute of GPUs is way to low yet for GPT-5 if they continue parameter scaling as they used to do.

GPT3 took months on 10k A100s. 10k H100 would have done it in a fraction of a time. Blackwell could train GPT4 in 10 days with same amount of GPUs as Hopper which took months.

Don't forget GPT3 is just 2.5 years old. Training is obviously waiting for the next step up in large clusters of training speed increasement. Don't be fooled, the 2x Blackwell vs. Hopper is only chip vs. chip. 10k of Blackwell including all networking speedup is easily 10x or more faster than the same amount of Hopper. So building a 1 million Blackwell cluster means 100x more training compute compared to a 100k Hopper cluster.

Nobody starts a model training if it takes years to finish... too much risk in that.

Transfer model was introduced in 2017 and ChatGPT came out 2022. Why? Because they would have needed millions of Volta GPUs instead of thousands of Ampere GPUs to train it.

r00fus · 2025-01-27T19:00:35 1738004435

There is a theory that Deepseek gives based on it's distillation process that hints towards, that o1 is really a distillation of a bigger GPT (GPT5?).

Some consider this to be spurious/conspiracy.

impossiblefork · 2025-01-27T22:20:25 1738016425

There is a big model from NVIDIA that I assume is for this purpose, i.e. Megatron 530b, so it doesn't sound too unreasonable.

Edit: I assumed that the model was distillation, that is apparently not true.

franktankbank · 2025-01-27T12:46:56 1737982016

The architecture doesn't keep yielding better results, Jevon's paradox doesn't apply.

impossiblefork · 2025-01-27T12:48:19 1737982099

But surely it can be scaled up, or is this compression thing something making the approach good only for small models (I haven't read the Deepseek papers (can't allocate time to it))?

joak · 2025-01-27T13:23:38 1737984218

Anyhow, if you can deliver more with less, this is huge good news for AI industry.

After some readjustment we can expect AI companies to start using the new method to deliver more. Science fiction might happen sooner than expected.

Buy the dip.

alecco · 2025-01-27T12:54:07 1737982447

The limit is high quality data, not compute.

vicentwu · 2025-01-27T13:59:39 1737986379

RL doesn't need that much static data, it needs a lot of "good" tasks/challenges and computation.

grajaganDev · 2025-01-27T15:22:22 1737991342

Right and LLMs will not be able to generate their own high quality training data.

There are no perpetual motion machines.

throw310822 · 2025-01-27T17:37:23 1737999443

> LLMs will not be able to generate their own high quality training data.

Humans certainly did. We did not inherit our physics and poetry books from some aliens.

grajaganDev · 2025-01-27T17:54:34 1738000474

Humans and LLMs are different things.

LLMs can not reason - many people seen to believe that they can.

gatlin · 2025-01-27T19:47:45 1738007265

I can't prove that we did but I don't know that we /didn't/.

MRtecno98 · 2025-01-27T20:21:18 1738009278

LLMs are not humans, nowhere near.

sebastiennight · 2025-01-28T20:53:35 1738097615

Have you read about this specific model we're talking about?

My understanding is that the whole point of R1 is that it was surprisingly effective to train on synthetic data AND to reinforce on the output rather than the whole chain of thought. Which does not require so much human-curated data and is a big part of where the efficiency gain came from.

mike_hearn · 2025-01-27T18:45:41 1738003541

They already do. All the current leading edge models are heavily trained on synthetic data. It's called textbook learning.

belter · 2025-01-27T13:11:59 1737983519

> The limit is high quality data

If, as some companies claim, these models truly possess emergent reasoning, their ability to handle imperfect data should serve as a proof of that capability.

r00fus · 2025-01-27T19:51:58 1738007518

For Oracle (another Stargate recipient) it was reversion to the mean. For Nvidia, it's a big loss - I imagine they might have predicated their revenue based on the continued need for compute - and now that's in question.