If you fit a gc on top of rust, the result is going to be less efficient than wh...

Aurornis · 2024-08-24T21:04:05 1724533445

This is a garbage collector written in Rust, not "on top of" Rust.

This isn't equivalent to adding garbage collection to the entire language. It's a garbage collected pointer type that can be employed for specific use cases.

The article and the repo explain why they developed it: Implementing VMs for garbage collected languages in Rust.

amelius · 2024-08-24T21:39:20 1724535560

> It's a garbage collected pointer type that can be employed for specific use cases.

You mean use cases where performance does not matter?

vlovich123 · 2024-08-24T23:56:18 1724543778

Out of curiosity, what language do you think the Java, JavaScript, and Python VMs and garbage collectors are written in? If you can understand why the VM is typically written in a systems programming language that doesn't itself have a VM or garbage collector, then you can start to think about why this is useful regardless of whether performance matters or not (& Java and C# are generally considered fairly high performance languages and VM implementations with efficient garbage collectors - the downsides may not matter to your problem domain).

neonsunset · 2024-08-25T01:00:56 1724547656

To expand on GPs point, I believe it implies that implementing a GC type for the Rust itself within its constraints (and even LLVM is not perfect, if we skip to LLVM-IR) is bound to be worse than in a language with bespoke precise+tracing+moving garbage collector which always requires deep compiler integration for "VM" to have exact information of where gcrefs are located at every safepoint (including registers!), be able to collect objects as soon as they are no longer referenced and not when they go out of scope later, determine whether write (or, worse, read) barriers are required or can be omitted and have the ability to suspend the execution to update the object references upon moving them to a different generation/heap/etc.

All GC implementations in Rust that I've seen so far relied on much more heavy handed techniques like having GC<T> to be a double indirection, pushing references to threadlocal queue, have GC pointers to be fat to pass around metadata inline, etc. They have been closer to modified RC with corresponding cost.

adrian17 · 2024-08-25T11:04:52 1724583892

> All GC implementations in Rust that I've seen so far relied on much more heavy handed techniques like having GC<T> to be a double indirection, pushing references to threadlocal queue, have GC pointers to be fat to pass around metadata inline, etc. They have been closer to modified RC with corresponding cost.

For reference, the gc-arena crate discussed in the blog post has no double indirection and no fat pointers (except for DSTs). Passing and reading the references is free, while assigning references to GC objects requires a write barrier, like in C#. The library is single-threaded, so there's no thread local state (and no global state).

But you're right that since the library is not _that_ invasive or integrated with the language runtime / allocators, you don't get things like cheap allocations, barrier omitting or generations. And most notably, without stack scanning you can't do collection while in a GC-aware scope - in particular, you can't automatically run GC if you run out of memory during an allocation. Piccolo (Lua VM) solves this by being stackless and repeatedly jumping out of the GC scope, while Ruffle (Flash/AS2/AS3 VM) bites the bullet and only runs collection between frames and hopes that it'll never hit OOM within a single frame.

binary132 · 2024-08-25T11:03:32 1724583812

I read it as something like “GC plus safety checks is costlier than just GC”

twiss · 2024-08-24T22:34:15 1724538855

Do you think Lua (for example, or any other GC'd language) has valid use cases? If so, it needs an implementation. This blog post shows (part of) one way to do that.

remexre · 2024-08-24T22:43:32 1724539412

Use-cases like "implementing a garbage-collected language."

amelius · 2024-08-25T03:57:57 1724558277

High performance garbage collectors are typically written in assembly because of the intricacies of the cache hierarchy.

tinco · 2024-08-25T09:26:51 1724578011

I spent some time porting the Go garbage collector to Rust, it was most definitely not written in assembly, and was (at the time) known as one of the most high performance garbage collectors for its particular use case.

And even if there was some assembly at some deep level I hadn't got to yet, you can easily embed assembly in Rust so it wouldn't be an issue.

Also, what sort of problem would there be with cache hierarchies that assembly would be a good solution for? Do you mean just guaranteeing the collection loop runs in L1 cache?

tsimionescu · 2024-08-25T12:14:07 1724588047

To be fair, the Go garbage collector is the most primitive GC used in any mainstream language, and lacks most of the advanced optimizations you'd expect from a modern GC. It can't move objects around, it doesn't have generations, it doesn't have good diagnostics (it can't even show you what are the GC roots for its hierarchy in a crash dump), and so on.

On the other hand, I'm not sure I've seen ASM used in many GCs, they're not that low level in my experience.

lossolo · 2024-08-25T15:37:51 1724600271

> And lacks most of the advanced optimizations you'd expect from a modern GC. It can't move objects around, it doesn't have generations, it doesn't have good diagnostics (it can't even show you what are the GC roots for its hierarchy in a crash dump), and so on.

Due to the choices and tradeoffs they made, Go doesn't require most of these 'advanced optimizations.' The way allocation works in Go, along with its use of value types etc, eliminates the need to move objects around or use generations. You can read more about this here [2][3].

> it doesn't have good diagnostics

I'm not sure that's true for me personally; I find that it has good enough diagnostics to quickly locate all GC problems using the profiler and GC tracer. You can read more about this here [1].

1. https://tip.golang.org/doc/gc-guide#Identiying_costs

2. https://go.dev/blog/ismmkeynote

3. https://itnext.io/go-does-not-need-a-java-style-gc-ac99b8d26...

tsimionescu · 2024-08-25T21:29:27 1724621367

Those articles claim that these problems don't exist in Go, but they don't really explain why, for many of them.

The one part that is clear indeed is that the pervasive use of value types in Go reduces the amount of garbage that gets generated compared to Java or C#, and, of course, less garbage means less time spent on GC.

The claims about memory fragmentation are less clear though: the first article just says that it's not a problem; and the second one does have a segment dedicated to it, but that segment gets fragmentation completely mixed up with memory locality, an entirely unrelated concept. It later claims that certain allocators are known to not suffer from fragmentation issues, but given the previous confusion, I'm not sure how seriously to take this. It also doesn't say if Go actually uses those allocators or not.

As for the diagnostics, I gave a specific example of a very commonly needed diagnostic, identifying GC roots to understand why a memory leak is occurring, that Go simply doesn't provide (someone else suggested a third party package that might help). I am well aware of the basics provided in the article you linked, and they don't even discuss this. For whatever bizarre reason, the Go memory tools don't provide this info (that the GC obviously needs to determine in its operation) - probably another victim in their quest of making it easier to implement the runtime.

lossolo · 2024-08-25T22:55:10 1724626510

> The claims about memory fragmentation are less clear though

As to memory fragmentation, memory allocations are grouped by size, so when objects of the same size are allocated and freed, the memory can be reused efficiently without causing fragmentation. So Go's memory allocator organizes memory into size classes, which are predefined blocks of different sizes. When an object is allocated, it fits into the smallest available size class that can accommodate it. This reduces internal fragmentation (unused space within allocated blocks).

> As for the diagnostics, I gave a specific example of a very commonly needed diagnostic, identifying GC roots to understand why a memory leak is occurring, that Go simply doesn't provide (someone else suggested a third party package that might help). I am well aware of the basics provided in the article you linked, and they don't even discuss this. For whatever bizarre reason, the Go memory tools don't provide this info (that the GC obviously needs to determine in its operation) - probably another victim in their quest of making it easier to implement the runtime.

Can't you just use pprof to find memory leaks?[1] You can even get a diagram showing where allocations occur and see which lines of code allocate the most memory. In all the cases where I was investigating memory leaks, pprof was sufficient. Do you have any common examples where this wouldn't work?

1. https://www.codereliant.io/memory-leaks-with-pprof/

tsimionescu · 2024-08-26T01:29:05 1724635745

Pprof shows where allocations happen, which doesn't tell you why an object is still present in memory, who is holding onto it. I've had to trawl through lots of third-party code to figure out who was holding onto some strings, for example. "Who originally allocated X" and "who is holding onto the last reference to X" are not directly linked concepts. Especially in a language that is happy to leak an entire array if a slice is holding onto three elements of it.

lossolo · 2024-08-26T02:21:59 1724638919

I see, thanks. Well, I've had similar problems, though not many. Maybe we just design things more carefully, but whenever we did encounter issues, knowing where the allocation happened was 90% of the investigation. The rest was just using the IDE features to see all possible references to that variable and go through the graph. At least that's my experience.

neonsunset · 2024-08-25T22:59:40 1724626780

A lot of structs or, say, slices in Go escape just as much. Or the interface implementations which get boxed. You have to pick any workload that is even remotely allocation/collection throughput sensitive and limitations of Go's GC become immediately apparent. What is not immediately apparent is that Go offers limited avenues to address this.

It's a good design for its purpose especially given that it is much simpler and self-hosted when compared to Java's and .NET's GC impls, but assigning the attributes to it that never were there nor were pursued by its authors is just bad, but sadly common with some here, look.

lossolo · 2024-08-26T00:38:34 1724632714

Is your opinion grounded in theory or practice? As a practitioner running Go in a high-volume, low-latency environment with tens of millions of allocations per second, we haven't found GC to be a problem. The median pause time is actually around 130 microseconds per collection. We're using Go in network services. Maybe your use case differs from ours, but at the end of the day, you pick the right tool for the job, and programming languages are just tools. We believe we picked the right one.

neonsunset · 2024-08-26T06:21:04 1724653264

This is a rather strong claim not supported by real-world data on how Go’s GC behaves under load. Which, in fact, isn’t the first time for your account as search across the discussion history indicates so you do you.

Sustaining very small allocations with median period of say 66ns (and I’m going to be charitable here) is not that difficult with single-threaded workload. Things get interesting as it becomes multithreaded and object graph becomes more complex.

As I said before, on any microbenchmark that is meant to evaluate allocation+collection throughput Go performs worse than the competition and there is no workaround to this because that’s just how its GC is designed: https://benchmarksgame-team.pages.debian.net/benchmarksgame/... And this is without even involving any sort of cross-thread data sharing between the allocations.

It’s okay to praise things you like and criticize those you don’t, but it can be kept within the realm of actual pros and cons of a particular technology.

lossolo · 2024-08-26T13:56:50 1724680610

> This is a rather strong claim not supported by real-world data on how Go’s GC behaves under load

The only thing you linked is an old binary trees benchmark from the benchmarks game, which is not equivalent to the workload in a typical networked service. For that kind of workload, I would not pick Go.

> It’s okay to praise things you like and criticize those you don’t, but it can be kept within the realm of actual pros and cons of a particular technology.

Result of basic allocation micro-benchmark for 64 byte slices, one million per op, 4 cores:

BenchmarkAllocation/procs-4-10 115 10469109 ns/op 64000517 B/op 1000007 allocs/op

One operation makes 1000007 allocations.

Each operation takes 10469109 nanoseconds (0.010469109 seconds).

Per second 1 / 0.010469109 ≈ 95.52 operations.

So, allocations per second = 95.52 * 1000007 ≈ 95.5 million

You can keep your ad-hominem comments to yourself.

neonsunset · 2024-08-26T17:06:38 1724691998

binary-trees benchmark is using Go 1.23.0 which is the latest version. Go submissions do not do anything differently compared to other GC-based languages for the most part. The fastest Java submission has interestingly much lower execution time than other Java submissions, and I plan to look into it, but that would be pretty much the only caveat. I think it is relevant given the topic of the discussion is GC implementations and their characteristics.

I have constructed the following micro-benchmark to replicate your data: https://gist.github.com/neon-sunset/c6c35230e75c89a8f6592cac...

It tests

- Time to allocate 1M 32B/64B/128B arrays/slices and pass each to an empty outlined method in a loop (to ensure heap allocation)

- Same as above but 1M allocations are split between NumCPU Goroutines/Tasks.

The environment is macOS 15, M1 Pro (8C), Go 1.23.0 and .NET 8.

The Go results are 16.2ms, 18.2ms and 104.7ms for single-threaded execution and 11.1ms, 22.6ms and 48.3ms for multi-threaded.

The C# results are 4.7ms, 5.9ms and 8.2ms for single-threaded and 1.8ms, 2.6ms and 3.8ms for multi-threaded.

When selecting Workstation/Desktop GC instead of Server (Server is default for networked container workloads via asp.net core), the numbers don't exhibit multi-threaded scaling and are 4.7ms, 5.3ms and 7ms for single-threaded and 5.9ms, 9.4ms and 16.8ms for multi-threaded. This GC mode is used by default for console and light GUI application templates. Complex applications override it to Server GC. It is likely that it will be eventually deprecated because its target scenarios are now better addressed by the new mode of Server GC called DATAS, which is enabled by default in the upcoming .NET 9.

If you're interested, feel free to replicate the data on your hardware, add more languages, etc. My assumption that OpenJDK will perform even better since Java code is more allocation-heavy.

Regarding highly-loaded networked services with predictable or hand-tuned data lifetimes, in C# you are expected to completely bypass the allocations with the use of built-in ArrayPool<T>, stack-allocated buffers or NativeMemory.Alloc/Free which directly call into malloc/free. This is made possible by wrapping such buffers into Span<T> or Memory<T>. Things like Kestrel (web server) or Garnet (Redis server) do just that. I know Go has Arena API but am not sure regarding its current status. Are there newer/better alternatives?

lossolo · 2024-08-26T20:35:01 1724704501

> binary-trees benchmark is using Go 1.23.0 which is the latest version. Go submissions do not do anything differently compared to other GC-based languages for the most part. The fastest Java submission has interestingly much lower execution time than other Java submissions, and I plan to look into it, but that would be pretty much the only caveat. I think it is relevant given the topic of the discussion is GC implementations and their characteristics.

I think what you are still missing is the context that I've added in my last two replies. This is not the use case that I'm using Go for. My micro benchmark was only to refute your claim that it's hard for Go to do tens of millions of allocations per second in a multithreaded environment.

We are not CPU bound, we are IO bound, and we need low latency (in p95 and p99 too), low memory usage, and good throughput for our network services. We get all of that from Go. Would we get more throughput from Java or .NET Core? Probably, but what would we need to trade off? Memory usage or latency? Or both? Go is optimized for latency (not throughput) and always was. That works for us.

> Regarding highly-loaded networked services with predictable or hand-tuned data lifetimes, in C# you are expected to completely bypass the allocations with the use of built-in ArrayPool<T>, stack-allocated buffers or NativeMemory.Alloc/Free which directly call into malloc/free. This is made possible by wrapping such buffers into Span<T> or Memory<T>. Things like Kestrel (web server) or Garnet (Redis server) do just that.

The last time I used Java was probably 10 years ago. I was interested in C# when they introduced .NET Core with the possibility to compile it to a single binary and support for Linux. I actually got excited about Span<T> when they announced it. I'm not sure what the current status of Kestrel is, but back then (a few years ago) its latency, memory usage, and throughput were not so good. Garnet looks very good on paper. I added it to my bookmarks when they announced it. For now, we are happy with Redis, but maybe we'll consider Garnet one day.

I just googled some more recent real-life workload benchmarks and I see Go still does a lot better there than C#. It's probably not a super scientific benchmark. And benchmarks are just benchmarks especially micro benchmarks:

https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

Scroll down to http-server:

https://programming-language-benchmarks.vercel.app/go-vs-csh...

etc.

> I know Go has Arena API but am not sure regarding its current status.

I wouldn't use it in production unless you know how it's implemented and you are careful. The current implementation probably will not get into the language as it's on hold.

EDIT: It seems my love story with .NET Core, which began in 2016, didn't work out back then after all! https://news.ycombinator.com/item?id=12358370

neonsunset · 2024-08-26T23:17:00 1724714220

> https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

This uses .NET 7 which will soon go out of support as it is STS even. The "web server" template it uses (it's not really a template, it's a hand-rolled impl. that diverges from an outdated template) is much older still. It does not seem to be actively maintained. When choosing between the benchmarks game on salsa debian server and the other one, I chose the one which makes sure to use the latest versions and has the benchmarks to stress their respective languages to a greater degree (like binary-tree depths of 18 instead of 21 which substantially lengthens the execution time). The "first" benchmarksgame is already influenced by application startup latency and time-to-steady-state performance, but the other seems to put much more emphasis on it.

> https://www.wwt.com/blog/performance-benchmarking-bun-vs-c-v...

This doesn't seem to load outside US, huh.

Anyway, all these are micro-benchmarks, it's just interacting with the author of the first benchmarksgame left me with a good impression of focus on fairness and level of detail you can draw conclusions how the basic aspects of a particular langue perform from.

> EDIT: It seems my love story with .NET Core, which began in 2016, didn't work out back then after all!

Oh. Yeah, this makes a lot of sense. 2016 with its .NET Core 1 and up to Core 2.1 this was truly the roughest patch for the ecosystem.

Everything was extremely new, breaking changes were abound and a lot of code was just not ready to the new cross-platform conditions it was finding itself in, let alone containerized deployments. It was not for the faint of heart. Not to mention the sheer culture shock for the ecosystem that almost entirely comprised of libraries and developers that only ever knew "the old" .NET Framework (I'd like to mention Mono efforts and contributions, but as sad as it is they had relatively minor impact on the larger picture).

It was not until .NET Core 3.1 when it all became truly viable for companies looking to modernize their infrastructure (i.e. move to linux containers in k8s). Many did just that. And it was not until .NET 6 when it all became unconditionally optimal choice for companies begrudgingly getting dragged into an upgrade instead. By this point, the jump from legacy framework codebase to .NET 6 was akin to jumping languages or switching from debug to release builds, the performance and cloud bill impact was this big.

So when I make a case for it, I mean today, where the tooling and ecosystem had 8 years to mature and seen 8 years of platform improvements.

> We are not CPU bound, we are IO bound, and we need low latency (in p95 and p99 too), low memory usage, and good throughput for our network services. We get all of that from Go. Would we get more throughput from Java or .NET Core? Probably, but what would we need to trade off? Memory usage or latency? Or both? Go is optimized for latency (not throughput) and always was. That works for us.

Much like above, this depends on "when" or, well, "which version". .NET Core 3.1? Not likely. .NET 8 (esp. with DATAS)? Oh yeah, you're in for a good time.

Once you are in the realm of doing raw networking i.e. interacting with Socket, SslStream and similar APIs, you are much less beholden to the performance properties of frameworks and vendor SDKs (in many ways limited by lack of care, forced compatibility with .NET Framework which does not have new APIs or just historical choices). E.g. Socket itself acts like either a lightweight wrapper on top of send and recv or as an asynchronous runtime on top of epoll/kqueue much like what Go does for its own goroutine integration with it.

Ironically, these kinds of greenfield projects that involve a degree of systems programming is an area where .NET has become the strongest offering among all GC-based languages, for reasons I repeat here often: strong compiler, zero-cost FFI, SIMD API, very tunable GC, the entire C with structs and pointers (but not macro and computed goto, sorry), monomorhpized struct generics for zero-cost abstractions and more.

You can write a completely custom networking stack with io-uring in pure C# and integrate it with threadpool and tasks, provide custom task-like primitives or use existing valuetask mechanism for state machine box pooling etc (upcoming async2 project will further cut down on overhead). Something similar has already been done too: https://github.com/QRWells/LibUringSharp Go will offer you great performance of its channel primitive out of box. C# will give you all the tools to build one instead without having to change runtime internals. The sort of things you may want to do when writing a new server or a database.

> I actually got excited about Span<T> when they announced it.

This is an area where I consider the choice Go made with slices and using them prominently (despite their footguns) a strictly better solution. Almost every API works with spans and memory now, and they can wrap arbitrary sources, but codebases would still be plastered with string reallocations despite the analyzer raising a suggestion "you don't need a new string, slice it as span when passing to a Parse method instead". Takes time to change habits.

Responding to your sibling comment I can simply offer trying out `dotnet new console --aot`. Maybe add `ServerGarbageCollection` property to project manifest if you are going to put a lot of load on it. You will be surprised by binary size too :)

lossolo · 2024-08-27T03:02:32 1724727752

> it's just interacting with the author of the first benchmarksgame left me with a good impression of focus on fairness and level of detail you can draw conclusions how the basic aspects of a particular langue perform from

IMO, using SIMD intrinsics in the benchmarks game is a little bit of cheating, but that's just my opinion (probably because Go doesn't have any haha, but shhhh :))

> Once you are in the realm of doing raw networking i.e. interacting with Socket, SslStream and similar APIs, you are much less beholden to the performance properties of frameworks and vendor SDKs (in many ways limited by lack of care, forced compatibility with .NET Framework which does not have new APIs or just historical choices). E.g. Socket itself acts like either a lightweight wrapper on top of send and recv or as an asynchronous runtime on top of epoll/kqueue much like what Go does for its own goroutine integration with it.

Sounds good. We are doing a lot of different things, but they're almost all network-based and I/O bound. We have some services running on raw sockets with our custom binary protocol. We also use standard HTTP, WebSockets, gRPC, QUIC, etc., in other services.

> zero-cost FFI, SIMD API

When I saw SIMD intrinsics in .NET, I was like, 'I WANT THIS in Go!' It's such a pain to work with SIMD in Go; it's just easier to write it in C and use FFI. But then you have to have a workload that's big enough to justify the FFI overhead. This is actually less of a problem after optimizations in recent years than it was before, but it's still not free in Go.

> completely custom networking stack with io-uring in pure C# and integrate it with threadpool and tasks, provide custom task-like primitives or use existing valuetask mechanism for state machine box pooling

Reading this is like being a kid and looking at sweets through a closed glass door! :) When I first heard about introducing io_uring into the kernel, I was like: 'Userspace ring buffers without syscalls? Count me in!' I was hoping they would implement that in the Go runtime, but it's been in limbo for years with no progress on this issue. All current implementations without the runtime support have their own quirks and inefficiencies.

> This is an area where I consider the choice Go made with slices and using them prominently (despite their footguns) a strictly better solution.

Agreed.

> Responding to your sibling comment I can simply offer trying out `dotnet new console --aot`. Maybe add `ServerGarbageCollection` property to project manifest if you are going to put a lot of load on it. You will be surprised by binary size too :)

Thanks! I only wish we could trade things for time in real life like we do in virtual space :) there are so many things to try out.

I will also reply to your other comment here, to keep things in one place.

> I think to get the closest representation of the direction the GC is moving towards you could try `dotnet new webapiaot` template with .NET 9 preview SDK: https://dotnet.microsoft.com/en-us/download/dotnet/9.0

I wasn't aware that AOT had gotten so good in .NET. If I recall correctly, a few years ago it was still recommended to use JIT for performance reasons.

Thanks for all the suggestions. I'll definitely try it out, though I'm not sure if it'll be in my work environment. When I find some time, I'll tinker with it in some personal projects. As to why it probably won't be likely in our work: on one hand, the kid and tinkerer in me that likes to try out new things says "let's go!" On the other hand, the pragmatic side responsible for business costs and the whole tech stack holds me back. Knowing how much expertise, knowledge about edge cases, language internals, and experience we have - but also how deep we are already into the Go ecosystem and that in most cases it has worked for us - another GC'd language doesn't look as compelling. This pragmatic side tells me to behave. We have another service coming soon, one for which I'm not specifically looking at Go because of the requirements (mixed workload of 10+ GB heaps with deep hierarchies and a lot of small reads/writes on fast NVMes, low p99 network AND disk latency). I was thinking about jumping straight into Rust and using https://github.com/bytedance/monoio as it neatly fits our use case. I've done some things in Rust before, but not a lot. Long compile times I can handle, but I fear refactorings as I don't like when a language gets in my way. But we'll see - who knows!

igouy · 2024-09-01T21:12:28 1725225148

> IMO, using SIMD intrinsics in the benchmarks game is a little bit of cheating

otoh yes, otoh seems like that is viable for some people and of interest to others.

Perhaps a smarter, more knowledgeable, curator would have been able to draw the line and excluded those techniques; instead-of leaving it up to the reader to decide what comparisons interested them.

tinco · 2024-08-26T08:51:25 1724662285

I think you're missing the parents' point. I think you and the parent agree would that for these general purpose benchmarks a more advanced garbage collector would win out over Go's garbage collector. But the point that has to be made is that those simply don't apply to the sort of application Go is built for.

If your Go application is not a network service, and frankly, is over 10.000 lines of code, you're not using Go for its intended purpose. An application with a complex object graph seems exactly like the sort of application that shouldn't be written in Go.

As an aside, it's a bit weird that you'd react to their citing of real-world data with a dismissal and a sort of ad hominem, and then proceed to cite a microbenchmark of precisely the sort of thing people shouldn't be building in Go. Do you have a microbenchmark that disproves that the Go GC can get 130 microsecond collect times while under tens of millions allocations per second load?

neonsunset · 2024-08-26T18:34:30 1724697270

> If your Go application is not a network service, and frankly, is over 10.000 lines of code, you're not using Go for its intended purpose. An application with a complex object graph seems exactly like the sort of application that shouldn't be written in Go.

It seems the industry is moving to adopt Go for this exact use case, unfortunately.

Not sure if I read the parent's argument the same way however, as what I'm objecting to is the continuous assignment of capabilities to language implementations that they don't have.

On 130us - pause time is frequently quoted in Go discourse but it is somewhat misleading as this omits the "front-loaded" nature of its design. If you spend >5x CPU time executing GC code than more, err, pause-focused designs and have lower throughput, then those 130us no longer exist in a vacuum. It is more accurate to specify that Go favors the sweet spot of "smaller" amount of work performed by a single node and focuses on maintaining consistent latency for light to moderate allocation rates with low heap size. In a way, .NET's GC is evolving in a similar direction but attempting to maintain the throughput or ensure the loss is within single digit %.

lossolo · 2024-08-26T20:56:19 1724705779

> It is more accurate to specify that Go favors the sweet spot of "smaller" amount of work ... focuses on maintaining consistent latency ...

It seems like you wrote the same thing that I wrote, but you used different words for it. What you described is basically I/O-bound workloads with a particular emphasis on latency—in other words, the category of networked services, which is precisely what Go was created for and why we use it.

> In a way, .NET's GC is evolving in a similar direction but attempting to maintain the throughput or ensure the loss is within single digit %.

Let me know when it finally evolves! I would like to try it out for some of our smaller services then. When we started building our core services in Go many years ago, .NET Core wasn't as good in terms of network throughput, latency, and handling multiple I/O-bound tasks simultaneously. I was actually evaluating it back then too.

PS. From one of our production nodes from the last 24h:

GC Pause time:

Min: 85.1 µs

Max: 343 µs

Avg: 150 µs

https://imgur.com/0HjUyM7

neonsunset · 2024-08-27T00:08:24 1724717304

Mainly responded in a sibling comment.

I think to get the closest representation of the direction the GC is moving towards you could try `dotnet new webapiaot` template with .NET 9 preview SDK: https://dotnet.microsoft.com/en-us/download/dotnet/9.0

Can't say I'm a huge fan of it but it gets the job done. In general, ASP.NET Core and Kestrel are great, particularly when you look at just how much more CPU and RAM (and boilerplate) Spring and Node.js require to deliver the same amount of functionality (validation, rate limiting, caching, middlewares, etc.), but because the pipeline is quite a bit UTF-16 centric and feature-focused, it mostly manages to provide competitive performance through sheer performance of the foundation it is built on. .NET could use a custom implementation, much like what Garnet does under the hood, to fully take advantage of compiler and memory management improvements and reclaim top positions in techempower :)

You may notice that GC is more size-efficient on Linux and Windows than on macOS. There is an internal feature regarding its heap management (regions) that is not enabled for the latter due to some tooling integration issues. It will eventually get fixed but due to minor impact (people aren't using macs for hosting) of it, it hasn't been a priority.

Another project which I think is a good showcase even if unrelated to the domain you work with: https://github.com/codr7/sharpl - can be compiled with dotnet publish -o .

The sort of thing people often build in Go. Because it is completely custom it is really good at demonstrating just how much native AOT compilation tooling has improved and how low base application footprint gets wrt GC and runtime. Might take a bit to compile for the first time - it needs to pull ILC (IL AOT Compiler) from nuget.org. Not going to be as fast as Go build times (.NET 9 improves it) but you often don't need that - for quick iteration during dev there is dotnet watch for hot-reload instead.

_rlh · 2024-08-28T11:53:57 1724846037

Go's defrag techniques and why they work are discussed in the Hoard papers and have proven their value not only in Go but in most malloc implementations.

There is a relationship between cache locality, moving colocated objects to different cache lines to control fragmentation, value types, and interior pointers. Perhaps it is subtle but cache optimization is real important for performance and is not ignored by Go in the language spec or the runtime implementation.

tinco · 2024-08-25T14:27:31 1724596051

True, that's also why I picked that one. It's the only one I knew of that's both in a modern language, relatively small (just under 10k lines at that time) and that had serious production grade performance. It was a bit controversial because they apparently determined that adding the advanced optimizations of modern GC's would negatively impact GC times, so the GC was intentionally kept very simple. I don't know if they maintained that attitude since then.

bboreham · 2024-08-25T19:30:59 1724614259

See ‘viewcore’. (It may not work against the latest Go version, but in principle it shows roots, live objects, etc.)

https://pkg.go.dev/github.com/golang/debug/cmd/viewcore

remexre · 2024-08-25T04:36:24 1724560584

Which one(s) are you thinking of? The JVM's appear to be in C++, GHC's and SBCL's are in C, Go's is in Go, and I'm not familiar with other high-performance garbage-collected platforms.

e4m2 · 2024-08-25T11:56:50 1724587010

Just to add one more data point, the .NET GC is also written in C++.

mpawelski · 2024-08-25T20:12:14 1724616734

In C++ in one giant file: https://github.com/dotnet/runtime/blob/main/src/coreclr/gc/g...

hayley-patton · 2024-08-25T07:54:23 1724572463

Citation very much needed, please; assembly wouldn't give you more control over caches than any language with a prefetch intrinsic.

adgjlsfhk1 · 2024-08-25T14:46:47 1724597207

it also helps to have instructions that perform non-temporal reads and writes

the_mitsuhiko · 2024-08-24T20:58:36 1724533116

The answer to your question is in the linked post.