Google’s Cloud TPU Pods are now publicly available in beta

pd0wm · on May 8, 2019

I wonder if they will apply the same terms of service as with their Cloud Machine Learning offerings (Auto ML, Cloud Vision, etc).

A snippet from https://cloud.google.com/terms/service-terms#12-google-cloud...:

  Customer will not, and will not allow third parties to: (i) use these Services
  to create, train, or improve (directly or indirectly) a similar or competing
  product or service or (ii) integrate these Services with any applications for
  any embedded devices such as cars, TVs, appliances, or speakers without Google's
  prior written permission. These Services can only be integrated with
  applications for the following personal computing devices: smartphones, tablets,
  laptops, and desktops. In addition to any other available remedies, Google may
  immediately suspend or terminate Customer's use of these Services based on any
  suspected violation of these terms, and violation of these terms is deemed
  violation of Google's Intellectual Property Rights. Customer will provide Google
  with any assistance Google requests to reasonably confirm compliance with these
  terms (including interviews with Customer employees and inspection of Customer
  source code, model training data, and engineering documentation). These terms
  will survive termination or expiration of the Agreement.

bduerst · on May 8, 2019

This is the TOS specifically for two of the managed services - it says so right in the link "Google Cloud Machine Learning Group and Google Cloud Machine Learning Engine"

These specific ML-as-a-service products are pretty easy to plug into and resell to third parties, which is why the wording is so restrictive in the TOS.

Many of Google Clouds other ML products, which typically are more customizeable and powerful, do not have these restrictions. The way your comment is worded makes it seem like this restriction is on all of Google Clouds ML products, but it's not.

sytelus · on May 8, 2019

I don’t understand these restrictions. Why is reselling bad? I would think people would still prefer first party services and even if they don’t you still get utilization in any case. Sure, there can be some outlier bad cases but these kind of restrictions can be a complete turnoff for enterprise users who just don’t want to take any legal risk of breaking TOS.

pd0wm · on May 8, 2019

It might have not been clear from my comment, but I was specifically referring to those ML as a service. The TPUs seem to fall under GCP which has a less restrictive ToS.

I do understand they don’t want to make it too easy to resell their research. But having access to all your source code, training data and documentation is a bit extreme. Why can’t they just tell you to stop selling your product?

ahjones · on May 8, 2019

From the link:

12.1 The following terms apply only to current and future Google Cloud Platform Machine Learning Services specifically listed in the "Google Cloud Platform Machine Learning Services Group" category on the Google Cloud Platform Services Summary page

Those services are (from https://cloud.google.com/terms/services):

  - Cloud AutoML
  - Cloud Text-to-Speech
  - Dialogflow Enterprise Edition
  - Google Cloud Data Labeling
  - Google Cloud Natural Language
  - Google Cloud Speech-to-Text
  - Google Cloud Video Intelligence
  - Google Cloud Vision

bduerst · on May 8, 2019

Right - that is the Google Cloud Machine Learning Group, which are managed ML-as-a-service products.

If you read the list, you will see they're highly specialized, out of the box ML products and super easy to resell, hence the TOS restrictions.

grej · on May 8, 2019

Sincerely thank you for the public service of posting that. You’ve probably saved a lot of folks on here from a big mistake.

I had no idea GCP had such terms. I had been considering alternative cloud hosting for a ML SaaS but will definitely not consider GCP.

dodobirdlord · on May 8, 2019

You'll probably find that all ML SaaS EULAs look about the same.

sk0g · on May 8, 2019

That is only in the terms for the managed offerings though, which I'd assume, is not what you'd be looking into.

aashu_dwivedi · on May 8, 2019

It's not for GCP but the ML products.

_hyn3 · on May 8, 2019

.. We will force you to admit that we OWN machine learning for all consumer use cases and will happily sue you just for renting our servers, or force you to turn over your Intellectual Property Rights to us. Thanks.

penagwin · on May 8, 2019

> These terms will survive termination or expiration of the Agreement.

Is this legal? Forgive me because I actually don't know if it could be?

geofft · on May 8, 2019

It's standard, even though it's a bit of a weird concept when you think about it: https://www.adamsdrafting.com/survival/

A common case is an employment contract that has an NDA or non-compete or non-poaching section survive the employment itself. I can't say, "I quit, therefore my employment contract doesn't apply any more, therefore I'm going to push all the company secrets to GitHub and you can't stop me."

Kurtz79 · on May 8, 2019

I guess that if someone doesn't think the terms are legal, they are fully entitled to take it to court against an army of Google lawyers...

Does someone knows of some instance of ToS from a big company being challenged in court?

solomatov · on May 8, 2019

IANAL, but I seen many agreements with the language like this.

penagwin · on May 8, 2019

To be fair, "Warranty Void if Removed" stickers are common but in many cases have no legal standing both in the US, EU, etc. They're just a common "boilerplate" kinda item.

inapis · on May 8, 2019

>(ii) integrate these Services with any applications for any embedded devices such as cars, TVs, appliances, or speakers without Google's prior written permission.

Genuinely curious about the restriction regarding embedded devices.

If there's a mishap, it is the responsibility of the owning company, certainly not Google.

So what gives? Why would they do this?

reilly3000 · on May 8, 2019

If they have patents for technology applied to those items, they would lose the patents if the didn’t enforce them. I’m sure they could arrange a license if you had a use case, and that may be free. Again, if they demonstrate they don’t enforce their IP (which is possibly incredibly valuable) at least in writing they could lose all rights to enforce their patent. I’m no IP lawyer but I paid one a bunch of money once.

grandmczeb · on May 8, 2019

This isn’t true. Are you thinking of trademarks?

reilly3000 · on May 8, 2019

Eh, you’re right, I was confusing it with trademarks. I’ll dust off that pitchfork, there isn’t any good excuse for them naming the industries they would like to own and barring their customers from competing with them.

m0zg · on May 8, 2019

This doesn't seem plausible. They have tens of thousands of patents on all sorts of stuff, and people "infringe" on them all day long with no repercussions. Some of those patents will probably be violated when I click the "reply" button under this message.

jstanley · on May 8, 2019

> violation of these terms is deemed violation of Google's Intellectual Property Rights

Oh, it's deemed? I guess that's settled then.

The word "deemed" is almost always used by somebody who doesn't actually have the authority to decide something, but wants you to think they do.

amelius · on May 8, 2019

Regardless of the ToS, my clients don't allow sending data to third parties so this service is completely useless to me.

kitotik · on May 8, 2019

Wow! That is even more aggressive than I would have guessed, and I’m cynical.

Do they have actual business customers that agree to those terms?

azurezyq · on May 8, 2019

I think this only applies with ML products.

jcims · on May 8, 2019

ML is about the only GCP service that’s interesting to large business customers.

manigandham · on May 8, 2019

It's definitely not, but the terms are indeed aggressive for using it.

jcims · on May 8, 2019

Ok one large business customer then. :). I kind of want to see the lawyers go at it over this.

crazysim · on May 8, 2019

Does GM's Cruise use GCP? They might just be using the GPUs/CPUs parts though.

snaky · on May 8, 2019

Customers of GM size have a chance to renegotiate the terms.

m0zg · on May 8, 2019

I was taken aback by this one as well. Basically this is an ineffectual attempt at preventing giving away their AI advantage by letting others use their proprietary models (and thus, indirectly, the gigantic dataset that make those models so good) as a "teacher" for their own models, using the fairly standard model refinement techniques popularized by their own Geoff Hinton. Keeping the cards close to their vest, as it were.

See e.g. https://www.quora.com/What-is-hiybbprqag for an example where Microsoft trained their ranker on Google's long tail and substantially reduced the relevance gap on the cheap, and in a way that makes it impossible for Google to trivially regain the advantage. People misinterpreted this as "Microsoft is copying the results", but that's not what it was. Google was unwillingly teaching Bing how to rank. Google responded with personalized search.

I don't think they can realistically prevent this, though. And it's not going to be anywhere near as traceable as "hiybbprqag" was.

mamon · on May 8, 2019

Google has been using different trick for the same reason with their own employees: that famous 20% of working time that can be dedicated to personal projects. You might think they did that to increase employee satisfaction, but the real reason is that this way they own copyright to whatever their employees create in their spare time, and avoid Facebook-WhatsApp scenario (i.e. having to spent $14B on something you could have aquired for free).

Now I guess they apply the same principle to their GCP customers.

EpicEng · on May 8, 2019

>but the real reason is that this way they own copyright to whatever their employees create in their spare time

Yeah it's not their spare time; it's work time. I would enjoy being able to spend one day a week working on whatever I like _for the company that employs me_. It's not a vacation day you spend in the office.

manigandham · on May 8, 2019

It was never for personal projects. The actual IPO quote is "We encourage our employees, in addition to their regular projects, to spend 20% of their time working on what they think will most benefit Google".

It's often done in other companies to a lesser degree with hackdays/weeks and is a good way to clean up tech debt or kickstart new product ideas.

paulddraper · on May 8, 2019

> create, train, or improve (directly or indirectly) a similar or competing product or service

Talking about Google/Alphabet? Good luck with finding something not similar or competing.

rryan · on May 8, 2019

Cloud TPU pods are seriously amazing. I'm a researcher at Google working on speech synthesis, and they allow me to flexibly trade off resource usage vs. time to results with nearly linear scaling due to the insanely fast interconnect. TPUs are already fast (non-pods, i.e. 8 TPU cores are 10x faster for my task than 8 V100s) but having pods open up new possibilities I couldn't build easily with GPUs. As a silly example, I can easily train on a batch size of 16k (typical batch size on one GPU is 32) if I want to by using one of the larger pod sizes, and it's about as fast as my usual batch size as long as the batch size per TPU core stays constant. Getting TPU pod quota was easily the single biggest productivity speedup my team has ever had.

totoglazer · on May 8, 2019

How, if at all, do you have to tweak model architecture or hyperparameters for pods vs tpu vs gpu?

rytill · on May 8, 2019

16k batch size going that fast... Damn. Use it for good.

sytelus · on May 8, 2019

Are TPUs drop-in replacement for CUDA if you were using TF?Can I simply change device from CUDA to TPU and run any TF code? Last I heard, TPUs still had long way to go towards making this happen...

nl · on May 8, 2019

If you are using TF, you can look in Tensorboard at your model graph and it will show you any incompatible operations.

These days it's pretty good. Not perfect, but you can do RNNs on it now. The FAQ has decent docs: https://cloud.google.com/tpu/docs/faq

p1esk · on May 8, 2019

I thought TPUs are comparable in speed to V100. What makes them 10x faster for your task?

sdenton4 · on May 8, 2019

IIUC, the difference is that the TPU allows insane data parallelism - exactly the giant batch sizes that rryan mentions. Google carried out extensive research on batch size, demonstrating that you can greatly increase training efficiency with giant batches. (And with tweaks to optimizers or some regularization, can push the "efficient" batch size domain even further.)

https://ai.googleblog.com/2019/03/measuring-limits-of-data-p...

p1esk · on May 8, 2019

Batch size is only limited by memory. For comparison Nvidia DGX-2 has 512GB.

sdenton4 · on May 8, 2019

Here's a writeup on tpu vs got throughput: https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transfor...

The underlying principle is that even if the silicon isn't strictly faster, you can shove a dumb amount of data through with the right architecture. And efficient training with data parallelism means that it's a worthwhile strategy.

p1esk · on May 8, 2019

I prefer actual benchmark results, which indicate TPU is similar in performance to v100.

Unless you have some data that proves otherwise?

sdenton4 · on May 8, 2019

This manifests in the training time + cost for imagenet in the stanford DAWN benchmark. The (partial) TPU pods win on training time, with a clear exponential speedup as the pod fraction increases. (How do you use more cores? Increase the batch size.)

https://dawn.cs.stanford.edu//2018/04/30/dawnbench-v1-result...

p1esk · on May 8, 2019

I’m looking at Imagenet training time for Resnet-50 [1] and it appears that 32 V100 chips are faster than half of a TPUv2 pod (64 chips). Are we looking at different benchmarks?

[1] https://dawn.cs.stanford.edu//benchmark/ImageNet/train.html

sdenton4 · on May 8, 2019

Note the submission dates; the faster times have an extra six months.

p1esk · on May 9, 2019

Here are more recent results, showing that V100 is competitive with TPUv3: https://www.google.com/amp/s/cloudblog.withgoogle.com/produc...

I’m having a hard time imagining a task where TPU would be 10x faster than V100.

64738 · on May 8, 2019

If you want to view their header image at larger size but right-clicking doesn't give you the option to "Open image in new tab", the direct link is below. Not a big deal, but it might save a few clicks for some:

https://storage.googleapis.com/gweb-cloudblog-publish/origin...

oooshha · on May 8, 2019

Is this pretty bad news for Nvidia?

p1esk · on May 8, 2019

Nvidia could have released a DL specific chip a long time ago, if they wanted to. I’m not sure why they haven’t (market not big enough?), but they probably will at some point.

sdenton4 · on May 8, 2019

They release a "data center" specific version of their gpu with slightly improved stats compared to the consumer models, and 10x the price... (And include a "no data centers" clause in the consumer model terms of use.)

jlebar · on May 8, 2019

(I work at Google on compilers for ml, including compilers for Nvidia gpus.)

Devices like the v100 and t4 are ml-specific chips. You can do graphics stuff on them, but that doesn't mean that Nvidia if leaving a ton of ml performance on the table by including that capability. Indeed there may be economies of scale for them in having fewer architectures to support.

They aren't dumb. :)

p1esk · on May 8, 2019

V100 has 640 tensor cores, and 5k general FP32/64 cores. Most of DL computation is done by tensor cores. Can you imagine how much faster it would get if they released a chip with say 10k tensor cores?

jlebar · on May 8, 2019

> Can you imagine how much faster it would get if they released a chip with say 10k tensor cores?

I can, actually. :) Adding 10k tensor cores to a GPU would not make it run much faster, and would be prohibitive in terms of die space. Moreover getting rid of the 5,000 FP32 cores would slow down DL workloads significantly.

The 640 Tensor Cores vs 5,000 F32 cores comparison is misleading, because they are not measuring the same thing.

An "FP32/64 core" corresponds to a functional unit on the GPU streaming multiprocessor (SM) which is capable of doing one scalar FP32 operation. One FLOP, or maybe two if you are doing an FMA. V100 has 5120 FP32 units and 2560 FP64 units.

In contrast, a "Tensor Core" corresponds to a functional unit on the SM which is capable of doing 64 FMAs per clock. That is, a Tensor Core does 64 times as much work as an FP32 core. Integrated circuits aren't magic, if you're doing 64 times as much work, you need more die space.

Moreover, there is nothing to say that nvidia isn't able to use some of the same circuits for both the fp32 and tensor core operations. If they are able to do this (I expect they are) then reducing one does not necessarily make space for the other.

Increasing the number of tensor core flops by a factor of 20 (500 -> 10,000) would not make the GPU 20x faster, probably not even 2x faster. This is because you will quickly run into GPU memory bandwidth limitations. This is not a simple problem to solve, nvidia GPUs are already pushing what is possible with HBM.

Lastly, although you're correct that, in terms of number of flops, most DL computation is done by tensor cores (if you've written your application in fp16), that doesn't mean we could get rid of the f32 compute units, or even that significantly reducing their number would have minimal effect on our models. Recall Amdahl's law. We usually think about it in terms of speedups, but it applies equally well in terms of slowdowns. If even 10% of our time is spent doing f32 compute, and we make it 10x slower...well, you can do the math. https://en.wikipedia.org/wiki/Amdahl%27s_law

Indeed, I was just looking at an fp16 tensor-core cudnn kernel yesterday, and even it did a significant amount of fp32 compute.

The implicit argument I read in parent post is that nvidia could build a significantly better DL chip "simply" by changing the quantities of different functional units on the GPU. This is predicated on nvidia being quite bad at their core competency of designing hardware, despite their being the market leader in DL hardware. It's kind of staggering to me how quickly nonexperts jump to this conclusion.

Here's a talk I gave at cppcon about much of this (note that it's pre Volta). https://www.youtube.com/watch?v=KHa-OSrZPGo&t=1s

p1esk · on May 8, 2019

Thank you for the detailed answer.

I think your main point is that memory bandwidth would prevent the performance speedup. Are V100s memory bound when executing F16 ops on tensor cores?

Second, do we really need dedicated FP32 cores for DL? Tensor cores accumulate in FP32 (is that what you meant when you said they did a significant amount of FP32 compute?), and recent papers indicate we’re moving towards 8 bit training [1]. Besides, do TPUs use dedicated FP32 hw?

Finally, if the memory bandwidth is indeed the bottleneck, perhaps all that die area from FP32 and especially FP64 cores could be used for massive amount of cache.

[1] https://arxiv.org/abs/1805.11046

jlebar · on May 9, 2019

V100s are often memory bound when using tensor cores, yes. But I guess my point is broader than that. There is a "right shape" for hardware that wants to excel at a particular workload, depending on the arithmetic intensity, degree of temporal locality, and so on. The point is that you usually can't just turn up one dimension to eleven, it's not usually that simple.

For example, massively increasing the GPU last level cache size would not have the effect of increasing memory bw much on most workloads, because cache only helps when you have temporal locality and gpus like to stream through many GB of data.

This is covered in Hennessy and Patterson if you're curious to learn more. I also talk about it some in the video I linked above.

(Also I doubt that getting rid of f64 support would be a significant die size win. I notice that v100 has, in their marketing speak, twice the fp32 cores as fp64 cores. What do you think are the chances that Nvidia decided a priori this is the optimal ratio? What if instead they are sharing resources between these functional units, at a ratio of two to one?)

To the question of, do you really need fp32 cores, I am not aware of any "widely deployed" GPU model today that does not do significant fp32 work. Perhaps there is research which suggests this isn't necessary! But that is a different thing than we were talking about here, that Nvidia could somehow make a much better chip for the things people are doing today.

I don't want to speak to the question of whether TPUs have f32 hardware, because I'm afraid of saying something that might not be public. But I think the answer to your question can easily be found by some searching and is probably even in the public docs.

sytelus · on May 8, 2019

I’d really like to know this as well. NVidia has became very AI centric company but this has been a huge blind spot. GPUs with power and chip real estate wasted on rendering hardware are unnecessary and relics of legacy for deep learning. Why they haven’t yet designed CUDA++ only chip yet?

dragandj · on May 8, 2019

Probably because most of their customers don’t work with Google-like problems, and will not buy chips that are 10x faster on paper but 10 times slower on the problems they DO have, at 10x the price...

lamchob · on May 8, 2019

No, because many companies run their own clusters with no access to TPU hardware, or are not keen on shipping all their models + data over to google.

harigov · on May 8, 2019

Can anyone comment on the reliability/availability of these pods?

reilly3000 · on May 8, 2019

They have been at this for multiple years internally; I have to imagine they are battle tested.

m0zg · on May 8, 2019

Unfortunately for Google, NVIDIA's offerings are very strong, and TPUs are a pain the rear to use and require TensorFlow, which in itself is a pain to use, making it doubly painful, to the extent that using their offering requires a significant degree of desperation or not knowing any better.

nl · on May 8, 2019

Well you'll be glad to know that PyTorch is available (in development form) on the TPU: https://github.com/pytorch/xla

m0zg · on May 8, 2019

If it's "available" but doesn't do anything useful, it's IMO not really "available". It's a good start though. I hope they don't throttle it in favor of TF because that'd only prolong TF's agony.

nl · on May 8, 2019

I think this is pretty unfair.

I know a bunch of PyTorch using people who have switched back to TF 2.0, and that is ignoring things like the Swift TF implementation.

The deployment story around TF is a lot better too.

akoumis · on May 8, 2019

Have you ever used Keras?

m0zg · on May 8, 2019

Yes I used Keras before PyTorch came out. IMO, Google needs to start over rather than try extensive wart removal on something that consists almost entirely of warts. These days Keras has subpar multi-GPU support (two GPUs are often slower than one), less flexibility, and more un-debuggable magic than PyTorch. And it doesn't look like things will be improving much (if at all) with TF 2.0.

And the TPU is a story of its own when it comes to usability. Take a look at their docs. TPU actually has to run on a separate machine, which you need to provision manually using their `ctpu` tool, and training of a trivial network on CIFAR looks like this: https://github.com/tensorflow/tpu/blob/master/models/experim...

so_tired · on May 8, 2019

Err - i don't get it. What alternative r u suggesting for development which is aimed at eventual production?

I am a complete noob, so i just did Keras (to TF) and GPUs.

If i switch to TPU - its still the same code.

When i get the budget to cluster, i think I am better off with one of the Deepmind libs which TF based.

No?

m0zg · on May 8, 2019

If you're just using off-the-GitHub models and pressing them into service, sure, Keras is adequate. If your work actually requires research and experimentation (as most practical problems on which actual cash money can be made do) you'll be much better served with PyTorch. The main cost in all this is not actually implementing or training your final model. That can be done in any framework. The cost is finding a model that works really well for a particular task. You will get there a lot sooner and with a lot fewer headaches if you use PyTorch for this more "researchy" part.

As to deployment, once you figure out what works, there's also libtorch, or if that doesn't work for you for whatever reason, implementing a model you know working hyperparameters for can be done in a day or two on whatever framework works for your backend, including TF.