They do sell shovels, you can get Google TPUs on Google Cloud.

matt-p · 2024-06-13T02:00:58 1718244058

Exactly and they are still about 1/18ths as good at training llms as a H100.

Maybe they are less than 1/18ths the cost, so google technically have a marginally better unit cost but i doubt it when you consider the R&D cost. They are less bad at inference, but still much worse than even an A100.

derefr · 2024-06-13T03:34:56 1718249696

Given that Google invented Transformer architecture (and Google AI continues to do foundational R&D on ML architecture) — and that Google's TPUs don't even support the most common ML standards, but require their own training and inference frameworks — I would assume that "the point" of TPUs from Google's perspective, has less to do with running LLMs, and more to do with running weird experimental custom model architectures that don't even exist as journal papers yet.

I would bet money that TPUs are at least better at doing AI research than anything Nvidia will sell you. That alone might be enough for Google to keep getting some new ones fabbed each year. The TPUs you can rent on Google Cloud might very well just be hardware requisitioned by the AI team, for the AI team, that they aren't always using to capacity, and so is "earning out" its CapEx through public rentals.

TPUs are maybe also better at other things Google does internally, too. Running inference on YouTube's audio+video-input timecoded-captions-output model, say.

UCBdaPatterson · 2024-06-13T13:52:04 1718286724

If you're interested in a peer reviewed scientific comparison, Google writes retrospective papers after contemporary TPUs and GPUs are deployed versus speculation about future products. The most recent compares TPU v4 and A100. (TPU v5 and H100 is for a future paper). Here is a quote from the abstract:

"Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. ... For similar sized systems, it is ~4.3x--4.5x faster than the Graphcore IPU Bow and is 1.2x--1.7x faster and uses 1.3x--1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~2--6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centers."

Here is a link to the paper: https://dl.acm.org/doi/pdf/10.1145/3579371.3589350

coder543 · 2024-06-13T14:32:56 1718289176

That quote is referring to the A100... the H100 used ~75% more power to deliver "up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100."[0]

Which sure makes the H100 sound both faster and more efficient (per unit of compute) than the TPU v4, given what was in your quote. I don't think your quote does anything to support the position that TPUs are noticeably better than Nvidia's offerings for this task.

Complicating this is that the TPU v5 generation has already come out, and the Nvidia B100 generation is imminent within a couple of months. (So, no, a comparison of TPUv5 to H100 isn't for a future paper... that future paper should be comparing TPUv5 to B100, not H100.)

[0]: https://developer.nvidia.com/blog/nvidia-hopper-architecture...

nomel · 2024-06-14T02:26:04 1718331964

As someone unfamiliar with this area, can one of the downvotes explain why they choose to downvote this? Is it wrong?

matt-p · 2024-06-13T20:00:34 1718308834

I'm sure it probably is faster for thier own workloads (which they are choosing to benchmark on), why bother making it if not. But that is clearly not universally true, a GPU is clearly more versatile. This means nothing to most if they can't for example train an LLM on them.

jeffbee · 2024-06-13T02:12:48 1718244768

I don't see how you can evaluate better and worse for training without doing so on cost basis. If it costs less and eventually finishes then it's better.

tmostak · 2024-06-13T03:18:51 1718248731

This assumes that you can linearly scale up the number of TPUs to get equal performance to Nvidia cards for less cost. Like most things distributed, this is unlikely to be the case.

logicchains · 2024-06-13T06:32:30 1718260350

This is absolutely the case, TPUs scale very well: https://github.com/google/maxtext .

pama · 2024-06-13T06:53:34 1718261614

The repo mentiones a Karpathy tweet from Jan 2023. Andrej has recently created llm.c and the same model trained about 32x faster on the same NVidia hardware mentioned in the tweet. I dont think the perfomance estimate that the repo used (based on that early tweet) was accurate for the performance of the NVidia hardware itself.

fbdab103 · 2024-06-13T03:24:34 1718249074

Time is money. You might be a lab with long queues to train, leaving expensive staff twiddling their thumbs.

blharr · 2024-06-13T02:14:13 1718244853

Also energy cost. 18 chips vs 1, it's probably costing a lot more to run 18

jeffbee · 2024-06-13T02:19:32 1718245172

Google claims the opposite in "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings " https://arxiv.org/abs/2304.01433

Despite various details I don't think that this is an area where Facebook is very different from Google. Both have terrifying amounts of datacenter to play with. Both have long experience making reliable products out of unreliable subsystems. Both have innovative orchestration and storage stacks. Meta hasn't published much or anything about things like reconfigurable optical switches, but that doesn't mean they don't have such a thing.

throwaway_ab · 2024-06-13T02:02:24 1718244144

Wouldn't that be renting a shovel vs selling a shovel?

candiddevmike · 2024-06-13T02:04:11 1718244251

NVIDIA sells subscriptions...

throwaway_ab · 2024-06-13T02:15:27 1718244927

I'm only aware of Nvidia AI Enterprise and that isn't required to run the GPU.

I think it's aimed at medium to large corporations.

Massive corporations such as Meta and OpenAI would build their own cloud and not rely on this.

The GPU really is a shovel, and can be used without any subscription.

Don't get me wrong, I want there to be competition with Nvidia, I want more access for open source and small players to run and train AI on competitive hardware at our own sites.

But no one is competing, no one has any idea what they're doing. Nvidia has no competition whatsoever, no one is even close.

This lets Nvidia get away with adding more vram onto an AI specific GPU and increase the price by 10x.

This lets Nvidia remove NVLink from current gen consumer cards like the 4090.

This lets Nvidia use their driver licence to prevent cloud platforms from offering consumers cards as a choice in datacenters.

If Nvidia had a shred of competition things would be much better.

kkielhofner · 2024-06-13T08:56:22 1718268982

I'm not sure why you're getting downvoted. It's very clear that Nvidia is moving towards directly offering cloud services[0].

[0] - https://www.nvidia.com/en-us/data-center/dgx-cloud/