As a long-time watcher and enjoyer of his presentations, he has a long history of building (often) superior products to the mainstream alternatives, and passionately explaining why they are superior. The alternatives still end up being more popular, and there is no denying that there's sometimes an air of "those fools with their eBPF, while we have dtrace/ZFS/zones/etc" which toes the line of arrogance.
I think it's pretty hard to have a career where you are constantly involved with software that is better technically, and have worse alternatives constantly "winning", and not be a little arrogant sometimes.
I think you're confusing arrogance with confidence. Directness and openness paired with underlying respect for others can reflect confidence, not arrogance.
Tackling difficult conversations or addressing challenges head on may be seen as "close to arrogance" but they are in fact a net positive for everyone.
He described his response to watching a video where Gelsinger behaved as if his own work at Intel consisted of interesting successes and the company's failings were due to other people. What's ahem ironic about okay I'm just going to stop
Definitely not a totally unfair criticism (though do keep in mind that I was not at an executive at Sun, whereas Gelsinger was Intel's CTO), but I did try to be forgiving of that as confidence. It was his line about NVIDIA that I felt to (well) over the line. You certainly won't find me saying that "AWS would be a fraction of its size if Oracle hadn't cancelled Sun's cloud initiative" or whatever the equivalent might be.
What that code does is a per-byte-pair popcount, which is not what the POPCNT instruction does (it computes the popcount for the whole word).
On processors with BMI2 the whole algorithm reduces to a PDEP as mentioned in another comment, but if you don't have that this is pretty much the best you can do (unless you use lookup tables but those have pros and cons).
I broadly agree with the thesis of the post, which if I understand correctly is that frame pointers are a temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers, and it seems that we'll eventually get there.
Some of the statements in the post seem odd to me though.
- 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
- Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
- Even if we're in the prologue, we know that we're in a leaf call, we can still resolve the instruction pointer to the function, and we can read the return address to find the parent, so what information is lost?
When it comes to future alternatives, while frame pointers have their own problems, I think that there are still a few open questions:
- Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
- Is the memory overhead of lookup tables for very large programs acceptable?
> - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):
0000000000007000 <main>:
7000: f3 0f 1e fa endbr64 ; landing pad
7004: 55 push %rbp ; set up frame pointer
7005: 48 89 e5 mov %rsp,%rbp
It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.
> - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.
I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.
> - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)
This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)
I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)
> Shadow stacks are cool but aren't they limited to a fixed number of entries?
Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.
> Is the memory overhead of lookup tables for very large programs acceptable?
I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.
Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.
> temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers,
Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.
Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.
But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".
Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.
> Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
This is something I've thinking about a lot over time, that the CAS is only there to atomically determine if there are any sleeping waiters on unlock and you have to do a futex_wake. I would really want some way to get away with non-fenced operations (at least on x86), but I don't know if it's just that nobody has figured out why, or there is a fundamental reason why that's not possible.
You do need a fence in the unlock path though (at least a release fence).
I think the issue is that if you ask the CPU to just store something (like in a spin lock), whether or not there’s a fence, it’s an operation with limited data flow dependencies so it’s easy for the CPU to execute. Even the fence can be handled using wacky speculation tricks.
But if you want to do something like, “store this value but only if the old value satisfies some predicate”, then there’s a load and the whole thing is dependent on the load. So you’re asking the CPU to load, then run a predicate, then store, and for that to be fenced, and atomic.
Strictly more work. I don’t think there’s any trick to make it faster than just the store release.
> You do need a fence in the unlock path though (at least a release fence).
Well yes but on x86 that comes for free. The overhead of the full fence brought in by lock cmpxchg or lock xchg is in the order of ~10ns, which for an uncontended lock means that a mutex is almost 2x as slow as a spinlock.
A load acquire + store release would be a couple of ns (assuming everything in L1 etc...)
As far as I know it is a fundamental limitation. You need to release the mutex, then check that there were no waiters in this order not to miss wakeups. As the mutex release must be globally visible before the load, release ordering on the mutex is not sufficient as the load could be reordered before the unlock; hence you need a StoreLoad fence which is always the most expensive barrier.
Consider the implementation of Dekker's algorithm for example.
As a more practical argument, let's suppose x86 had a an atomic CAS that guarantees that the store is released but the load is relaxed (unlike other x86 normal loads, but like non-temporal loads, it has no implied LoadLoad or LoadStore like other x86 loads).
This relaxed-load-CAS, coupled with some form of control or data dependency would be sufficient to implement your mutex. But would such a CAS be significantly cheaper than the existing CAS? If it was, you would be able to approximate the strong CAS semantics by adding lfence+sfence after the relaxed CAS. These fences are cheap, so if strong CAS was possible to be implemented this way with significant improvements, intel would have already done it.
Finally, it is practically possible to implement the sequence store(A);#StoreLoad;load (B) without an explicit fence by using the colocation trick: have A and B be adjacent in memory (on the same cacheline), store to A, then do a wider load on both A and B. Intel does not give multithread guarantees on this, but my understanding is that it works: the wider load fails to be store-forwarded and stalls the pipeline waiting for the previous store to be flushed. In practice this costs about as much as a fence, so in addition to being undefined, is not cheaper.
- On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).
- Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).
Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.
> Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).
If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).
It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.
You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).
Yeah, that specific benchmark is actually likely to prefer undesirable behaviors, for example pathological unfairness: clearly the optimal scheduling of those threads runs first all the increments from the first thread, then all of the second thread, etc... because this will minimize inter-processor traffic.
A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.
This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.
But perhaps that's the interesting point — 1/x is *cannot be* an exact bijection for floating point numbers, because there's not a one-to-one mapping between the two sets.
VByte is a weird baseline to compare to: it is a byte-aligned encoding scheme, so it trades off some space efficiency for speed of decoding. The proposed method is bit-aligned, so it should be compared against bit-aligned encoding schemes.
For large integers, it is hard to beat Elias Delta coding [1], which asymptotically uses log(n) + o(log(n)) bits, and in practice it breaks even with most other encoding schemes quite early on.
More importantly, Elias Gamma and Delta are complete, meaning that there is no redundancy (another way to look at it is that any sequence over the alphabet is decodable). VByte is complete over the byte alphabet.
Any complete code is optimal for some integer distribution (see "implied probability" in the Wikipedia page).
So if your integer distribution is heavier on the large integers, there are already plenty of complete encodings to pick from, and almost certainly one that fits well the distribution.
The scheme proposed here, instead, is not complete, as mentioned in the README ("this method does introduce some redundancy, particularly when sequences must be padded to prevent unintentional "101" patterns"), so it cannot be optimal for any distribution.
The Wikipedia page on the Kraft–McMillan inequality [2] explains this in more detail.
Yeah I think the paper could be better, too. Thank you for your suggestions — and for the information, it’s very interesting!
Although, the padding requirement for integers with bits ending in 10 can actually be dismissed: you join on only 101, then to decode you just split on 10101 first, re-attach the removed 10 to the left, then split the resulting parts on 101, removing the padding.
I guess you can consider that a spec bug in the draft? Hahahaha ! :)
I don’t know what complete means, and I don’t know if this becomes complete, but anyway it’s really interesting.
Sounds like it would be a good idea to add these codings to the bench mark!
There’s another potential criticism I was thinking about: what if we encode the lengths with VByte then concat the bits, just like we do with irradix to make the L1 variant? It’s not really fair to compare L1 with VByte when they’re operating in different modes like that. It’s possible that any advantage with our current scheme disappears if you do that, I don’t know.
I picked VByte tho because it’s so simple and very common, so just a question for you: why aren’t the codings you mentioned used as frequently as VByte? Are they more complex or slower?
> According to the Times, Shah wants CNET because it’s a “well-known industry brand” and still has an audience that’s large enough to be attractive to tech advertisers.
reply