> This is a rather strong claim not supported by real-world data on how Go’s GC ...

neonsunset · 2024-08-26T17:06:38 1724691998

binary-trees benchmark is using Go 1.23.0 which is the latest version. Go submissions do not do anything differently compared to other GC-based languages for the most part. The fastest Java submission has interestingly much lower execution time than other Java submissions, and I plan to look into it, but that would be pretty much the only caveat. I think it is relevant given the topic of the discussion is GC implementations and their characteristics.

I have constructed the following micro-benchmark to replicate your data: https://gist.github.com/neon-sunset/c6c35230e75c89a8f6592cac...

It tests

- Time to allocate 1M 32B/64B/128B arrays/slices and pass each to an empty outlined method in a loop (to ensure heap allocation)

- Same as above but 1M allocations are split between NumCPU Goroutines/Tasks.

The environment is macOS 15, M1 Pro (8C), Go 1.23.0 and .NET 8.

The Go results are 16.2ms, 18.2ms and 104.7ms for single-threaded execution and 11.1ms, 22.6ms and 48.3ms for multi-threaded.

The C# results are 4.7ms, 5.9ms and 8.2ms for single-threaded and 1.8ms, 2.6ms and 3.8ms for multi-threaded.

When selecting Workstation/Desktop GC instead of Server (Server is default for networked container workloads via asp.net core), the numbers don't exhibit multi-threaded scaling and are 4.7ms, 5.3ms and 7ms for single-threaded and 5.9ms, 9.4ms and 16.8ms for multi-threaded. This GC mode is used by default for console and light GUI application templates. Complex applications override it to Server GC. It is likely that it will be eventually deprecated because its target scenarios are now better addressed by the new mode of Server GC called DATAS, which is enabled by default in the upcoming .NET 9.

If you're interested, feel free to replicate the data on your hardware, add more languages, etc. My assumption that OpenJDK will perform even better since Java code is more allocation-heavy.

Regarding highly-loaded networked services with predictable or hand-tuned data lifetimes, in C# you are expected to completely bypass the allocations with the use of built-in ArrayPool<T>, stack-allocated buffers or NativeMemory.Alloc/Free which directly call into malloc/free. This is made possible by wrapping such buffers into Span<T> or Memory<T>. Things like Kestrel (web server) or Garnet (Redis server) do just that. I know Go has Arena API but am not sure regarding its current status. Are there newer/better alternatives?

lossolo · 2024-08-26T20:35:01 1724704501

> binary-trees benchmark is using Go 1.23.0 which is the latest version. Go submissions do not do anything differently compared to other GC-based languages for the most part. The fastest Java submission has interestingly much lower execution time than other Java submissions, and I plan to look into it, but that would be pretty much the only caveat. I think it is relevant given the topic of the discussion is GC implementations and their characteristics.

I think what you are still missing is the context that I've added in my last two replies. This is not the use case that I'm using Go for. My micro benchmark was only to refute your claim that it's hard for Go to do tens of millions of allocations per second in a multithreaded environment.

We are not CPU bound, we are IO bound, and we need low latency (in p95 and p99 too), low memory usage, and good throughput for our network services. We get all of that from Go. Would we get more throughput from Java or .NET Core? Probably, but what would we need to trade off? Memory usage or latency? Or both? Go is optimized for latency (not throughput) and always was. That works for us.

> Regarding highly-loaded networked services with predictable or hand-tuned data lifetimes, in C# you are expected to completely bypass the allocations with the use of built-in ArrayPool<T>, stack-allocated buffers or NativeMemory.Alloc/Free which directly call into malloc/free. This is made possible by wrapping such buffers into Span<T> or Memory<T>. Things like Kestrel (web server) or Garnet (Redis server) do just that.

The last time I used Java was probably 10 years ago. I was interested in C# when they introduced .NET Core with the possibility to compile it to a single binary and support for Linux. I actually got excited about Span<T> when they announced it. I'm not sure what the current status of Kestrel is, but back then (a few years ago) its latency, memory usage, and throughput were not so good. Garnet looks very good on paper. I added it to my bookmarks when they announced it. For now, we are happy with Redis, but maybe we'll consider Garnet one day.

I just googled some more recent real-life workload benchmarks and I see Go still does a lot better there than C#. It's probably not a super scientific benchmark. And benchmarks are just benchmarks especially micro benchmarks:

https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

Scroll down to http-server:

https://programming-language-benchmarks.vercel.app/go-vs-csh...

etc.

> I know Go has Arena API but am not sure regarding its current status.

I wouldn't use it in production unless you know how it's implemented and you are careful. The current implementation probably will not get into the language as it's on hold.

EDIT: It seems my love story with .NET Core, which began in 2016, didn't work out back then after all! https://news.ycombinator.com/item?id=12358370

neonsunset · 2024-08-26T23:17:00 1724714220

> https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

This uses .NET 7 which will soon go out of support as it is STS even. The "web server" template it uses (it's not really a template, it's a hand-rolled impl. that diverges from an outdated template) is much older still. It does not seem to be actively maintained. When choosing between the benchmarks game on salsa debian server and the other one, I chose the one which makes sure to use the latest versions and has the benchmarks to stress their respective languages to a greater degree (like binary-tree depths of 18 instead of 21 which substantially lengthens the execution time). The "first" benchmarksgame is already influenced by application startup latency and time-to-steady-state performance, but the other seems to put much more emphasis on it.

> https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

This doesn't seem to load outside US, huh.

Anyway, all these are micro-benchmarks, it's just interacting with the author of the first benchmarksgame left me with a good impression of focus on fairness and level of detail you can draw conclusions how the basic aspects of a particular langue perform from.

> EDIT: It seems my love story with .NET Core, which began in 2016, didn't work out back then after all!

Oh. Yeah, this makes a lot of sense. 2016 with its .NET Core 1 and up to Core 2.1 this was truly the roughest patch for the ecosystem.

Everything was extremely new, breaking changes were abound and a lot of code was just not ready to the new cross-platform conditions it was finding itself in, let alone containerized deployments. It was not for the faint of heart. Not to mention the sheer culture shock for the ecosystem that almost entirely comprised of libraries and developers that only ever knew "the old" .NET Framework (I'd like to mention Mono efforts and contributions, but as sad as it is they had relatively minor impact on the larger picture).

It was not until .NET Core 3.1 when it all became truly viable for companies looking to modernize their infrastructure (i.e. move to linux containers in k8s). Many did just that. And it was not until .NET 6 when it all became unconditionally optimal choice for companies begrudgingly getting dragged into an upgrade instead. By this point, the jump from legacy framework codebase to .NET 6 was akin to jumping languages or switching from debug to release builds, the performance and cloud bill impact was this big.

So when I make a case for it, I mean today, where the tooling and ecosystem had 8 years to mature and seen 8 years of platform improvements.

> We are not CPU bound, we are IO bound, and we need low latency (in p95 and p99 too), low memory usage, and good throughput for our network services. We get all of that from Go. Would we get more throughput from Java or .NET Core? Probably, but what would we need to trade off? Memory usage or latency? Or both? Go is optimized for latency (not throughput) and always was. That works for us.

Much like above, this depends on "when" or, well, "which version". .NET Core 3.1? Not likely. .NET 8 (esp. with DATAS)? Oh yeah, you're in for a good time.

Once you are in the realm of doing raw networking i.e. interacting with Socket, SslStream and similar APIs, you are much less beholden to the performance properties of frameworks and vendor SDKs (in many ways limited by lack of care, forced compatibility with .NET Framework which does not have new APIs or just historical choices). E.g. Socket itself acts like either a lightweight wrapper on top of send and recv or as an asynchronous runtime on top of epoll/kqueue much like what Go does for its own goroutine integration with it.

Ironically, these kinds of greenfield projects that involve a degree of systems programming is an area where .NET has become the strongest offering among all GC-based languages, for reasons I repeat here often: strong compiler, zero-cost FFI, SIMD API, very tunable GC, the entire C with structs and pointers (but not macro and computed goto, sorry), monomorhpized struct generics for zero-cost abstractions and more.

You can write a completely custom networking stack with io-uring in pure C# and integrate it with threadpool and tasks, provide custom task-like primitives or use existing valuetask mechanism for state machine box pooling etc (upcoming async2 project will further cut down on overhead). Something similar has already been done too: https://github.com/QRWells/LibUringSharp Go will offer you great performance of its channel primitive out of box. C# will give you all the tools to build one instead without having to change runtime internals. The sort of things you may want to do when writing a new server or a database.

> I actually got excited about Span<T> when they announced it.

This is an area where I consider the choice Go made with slices and using them prominently (despite their footguns) a strictly better solution. Almost every API works with spans and memory now, and they can wrap arbitrary sources, but codebases would still be plastered with string reallocations despite the analyzer raising a suggestion "you don't need a new string, slice it as span when passing to a Parse method instead". Takes time to change habits.

Responding to your sibling comment I can simply offer trying out `dotnet new console --aot`. Maybe add `ServerGarbageCollection` property to project manifest if you are going to put a lot of load on it. You will be surprised by binary size too :)

lossolo · 2024-08-27T03:02:32 1724727752

> it's just interacting with the author of the first benchmarksgame left me with a good impression of focus on fairness and level of detail you can draw conclusions how the basic aspects of a particular langue perform from

IMO, using SIMD intrinsics in the benchmarks game is a little bit of cheating, but that's just my opinion (probably because Go doesn't have any haha, but shhhh :))

> Once you are in the realm of doing raw networking i.e. interacting with Socket, SslStream and similar APIs, you are much less beholden to the performance properties of frameworks and vendor SDKs (in many ways limited by lack of care, forced compatibility with .NET Framework which does not have new APIs or just historical choices). E.g. Socket itself acts like either a lightweight wrapper on top of send and recv or as an asynchronous runtime on top of epoll/kqueue much like what Go does for its own goroutine integration with it.

Sounds good. We are doing a lot of different things, but they're almost all network-based and I/O bound. We have some services running on raw sockets with our custom binary protocol. We also use standard HTTP, WebSockets, gRPC, QUIC, etc., in other services.

> zero-cost FFI, SIMD API

When I saw SIMD intrinsics in .NET, I was like, 'I WANT THIS in Go!' It's such a pain to work with SIMD in Go; it's just easier to write it in C and use FFI. But then you have to have a workload that's big enough to justify the FFI overhead. This is actually less of a problem after optimizations in recent years than it was before, but it's still not free in Go.

> completely custom networking stack with io-uring in pure C# and integrate it with threadpool and tasks, provide custom task-like primitives or use existing valuetask mechanism for state machine box pooling

Reading this is like being a kid and looking at sweets through a closed glass door! :) When I first heard about introducing io_uring into the kernel, I was like: 'Userspace ring buffers without syscalls? Count me in!' I was hoping they would implement that in the Go runtime, but it's been in limbo for years with no progress on this issue. All current implementations without the runtime support have their own quirks and inefficiencies.

> This is an area where I consider the choice Go made with slices and using them prominently (despite their footguns) a strictly better solution.

Agreed.

> Responding to your sibling comment I can simply offer trying out `dotnet new console --aot`. Maybe add `ServerGarbageCollection` property to project manifest if you are going to put a lot of load on it. You will be surprised by binary size too :)

Thanks! I only wish we could trade things for time in real life like we do in virtual space :) there are so many things to try out.

I will also reply to your other comment here, to keep things in one place.

> I think to get the closest representation of the direction the GC is moving towards you could try `dotnet new webapiaot` template with .NET 9 preview SDK: https://dotnet.microsoft.com/en-us/download/dotnet/9.0

I wasn't aware that AOT had gotten so good in .NET. If I recall correctly, a few years ago it was still recommended to use JIT for performance reasons.

Thanks for all the suggestions. I'll definitely try it out, though I'm not sure if it'll be in my work environment. When I find some time, I'll tinker with it in some personal projects. As to why it probably won't be likely in our work: on one hand, the kid and tinkerer in me that likes to try out new things says "let's go!" On the other hand, the pragmatic side responsible for business costs and the whole tech stack holds me back. Knowing how much expertise, knowledge about edge cases, language internals, and experience we have - but also how deep we are already into the Go ecosystem and that in most cases it has worked for us - another GC'd language doesn't look as compelling. This pragmatic side tells me to behave. We have another service coming soon, one for which I'm not specifically looking at Go because of the requirements (mixed workload of 10+ GB heaps with deep hierarchies and a lot of small reads/writes on fast NVMes, low p99 network AND disk latency). I was thinking about jumping straight into Rust and using https://github.com/bytedance/monoio as it neatly fits our use case. I've done some things in Rust before, but not a lot. Long compile times I can handle, but I fear refactorings as I don't like when a language gets in my way. But we'll see - who knows!

igouy · 2024-09-01T21:12:28 1725225148

> IMO, using SIMD intrinsics in the benchmarks game is a little bit of cheating

otoh yes, otoh seems like that is viable for some people and of interest to others.

Perhaps a smarter, more knowledgeable, curator would have been able to draw the line and excluded those techniques; instead-of leaving it up to the reader to decide what comparisons interested them.