Hacker News new | past | comments | ask | show | jobs | submit login

It's a very strange result.

I believe that NVIDIA is overvalued, but if DeepSeek really is as great as has been said, then it'll be even greater when scaled up to OpenAI sizes, and when you get more out you have more reason to pay, so this should if it pans out lead to more demand for GPUs-- basically Jevon's paradox.






If the top-tier premium GPUs aren't the difference-maker they were thought to be then that will hurt NVIDIA's margins, even if they make some of it up on volume.

You need to be prepared for the reality that naive scaling no longer works for LLMs anymore.

Simple question: where is GPT-5?


It is a possibility, but my understanding of what OpenAI has said is that GPT-5 is delayed because of the apparent promise of RL trained things like o1, etc. and that they've simply decided to train those instead of training a bigger base model training on better data, and I think this is plausible.

OpenAI has an incentive to make people believe that the scaling laws are still alive, to justify their enormous capex if nothing else.

I wouldn't give what they say to much credence, and will only believe the results I see.


Yes, I think I agree that it seems unlikely that the spending they're doing can be recouped.

But it can still make sense for a state, even if it doesn't make sense for investors though.


If we expect that the demand for GPT-5 in AI compute is 100x of that of GPT-4 then if GPT-4 was trained in months on 10k of H100 then you would need years with 100k of H100 or maybe again months with 100k of GB200.

See, there is your answer. The issue is the compute of GPUs is way to low yet for GPT-5 if they continue parameter scaling as they used to do.

GPT3 took months on 10k A100s. 10k H100 would have done it in a fraction of a time. Blackwell could train GPT4 in 10 days with same amount of GPUs as Hopper which took months.

Don't forget GPT3 is just 2.5 years old. Training is obviously waiting for the next step up in large clusters of training speed increasement. Don't be fooled, the 2x Blackwell vs. Hopper is only chip vs. chip. 10k of Blackwell including all networking speedup is easily 10x or more faster than the same amount of Hopper. So building a 1 million Blackwell cluster means 100x more training compute compared to a 100k Hopper cluster.

Nobody starts a model training if it takes years to finish... too much risk in that.

Transfer model was introduced in 2017 and ChatGPT came out 2022. Why? Because they would have needed millions of Volta GPUs instead of thousands of Ampere GPUs to train it.


There is a theory that Deepseek gives based on it's distillation process that hints towards, that o1 is really a distillation of a bigger GPT (GPT5?).

Some consider this to be spurious/conspiracy.


There is a big model from NVIDIA that I assume is for this purpose, i.e. Megatron 530b, so it doesn't sound too unreasonable.

Edit: I assumed that the model was distillation, that is apparently not true.


The architecture doesn't keep yielding better results, Jevon's paradox doesn't apply.

But surely it can be scaled up, or is this compression thing something making the approach good only for small models (I haven't read the Deepseek papers (can't allocate time to it))?

Anyhow, if you can deliver more with less, this is huge good news for AI industry.

After some readjustment we can expect AI companies to start using the new method to deliver more. Science fiction might happen sooner than expected.

Buy the dip.


The limit is high quality data, not compute.

RL doesn't need that much static data, it needs a lot of "good" tasks/challenges and computation.

Right and LLMs will not be able to generate their own high quality training data.

There are no perpetual motion machines.


> LLMs will not be able to generate their own high quality training data.

Humans certainly did. We did not inherit our physics and poetry books from some aliens.


Humans and LLMs are different things.

LLMs can not reason - many people seen to believe that they can.


I can't prove that we did but I don't know that we /didn't/.

LLMs are not humans, nowhere near.

Have you read about this specific model we're talking about?

My understanding is that the whole point of R1 is that it was surprisingly effective to train on synthetic data AND to reinforce on the output rather than the whole chain of thought. Which does not require so much human-curated data and is a big part of where the efficiency gain came from.


They already do. All the current leading edge models are heavily trained on synthetic data. It's called textbook learning.

> The limit is high quality data

If, as some companies claim, these models truly possess emergent reasoning, their ability to handle imperfect data should serve as a proof of that capability.


For Oracle (another Stargate recipient) it was reversion to the mean. For Nvidia, it's a big loss - I imagine they might have predicated their revenue based on the continued need for compute - and now that's in question.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: