Hacker News new | past | comments | ask | show | jobs | submit | iamnotagenius's comments login

very strange as I get on old i5-12400+DDR4 2 tok/sec with 14B/q8 model.

The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.

would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.

Oh yes multiple people have asked me about this - I'll see what I can do :)

Imo the most "unhinged" cpus for AVX-512 are early batches of Alder Lakes which is the only cpu family that has nearly full coverage of all existing avx-512 subsets.

It's a shame that Intel seemed to really not want people to use it, given they started disabling the ability to use it in future microcode, and fused it off in later parts.

> It's a shame that Intel seemed to really not want people to use it

AVX-512 was never part of the specification for those CPUs. It was never advertised as a feature or selling point. You had to disable the E cores to enable AVX-512, assuming your motherboard even supported it.

Alder Lake AVX-512 has reached mythical status, but I think the number of people angry about it is far higher than the number of people who ever could have taken advantage of it and benefitted from it. For general purpose workloads, having the E cores enabled (and therefore AVX-512 disabled) was faster. You had to have an extremely specific workload that didn't scale well with additional cores and also had hot loops that benefitted from AVX-512, which was not very common.

So you're right: They never wanted people to use it. It wasn't advertised and wasn't usable without sacrificing all of the E cores and doing a lot of manual configuration work. I suspect they didn't want people using it because they never validated it. AVX-512 mode increased the voltages, which would impact things like failure rate and warranty returns. They probably meant to turn it off but forgot in the first versions.


They had to disable AVX-512 only because Microsoft was too lazy to rewrite their thread scheduler to handle heterogeneous CPU cores.

The Intel-AMD x86-64 architecture is full of horrible things, starting with the System Management Mode added in 1990, which have been added by Intel only because every time Microsoft has refused to update Windows, expecting that the hardware vendors must do the work instead of Microsoft for enabling Windows to continue to work on newer hardware, even when that causes various disadvantages for the customers.

Moreover, even if Intel had not said that Alder Lake will support AVX-512, they also had not said that the P-cores of Alder Lake will not support AVX-512.

Therefore everybody had expected that Intel will continue to provide backward compatibility, as always before that, so the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake.

The failure to be compatible with their previous products has been a surprise for everybody.


Windows can work without SMM, especially NT - the problem is that SMM was created for a world where majority used DOS and the idea of using OS services instead of every possibly quirk of IBM PC was anathema to developers.

Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS


> Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS

In theory, there was: you could have a separate microcontroller, accessed through some of the I/O ports, doing the power management; it's mostly how it's done nowadays, with the EC (Embedded Controller) on laptops (and nowadays there's also the PSP or ME, which is a separate processor core doing startup and power management for the main CPU cores). But back then, it would also be more expensive (a whole other chip) than simply adding an extra mode to the single CPU core (multiple cores back then usually required multiple CPU chips).


The problem is reliably interrupting the CPU in a way that didn't require extra OS support. SMM provided such trigger, and in fact is generally used as part of the scheme with EC cooperating.

If Windows could work without SMM, is there a historical reason why SMM mode didn't just die and become disused after Windows becomes popular and nobody uses DOS any more? There are plenty of features in x86 that are disused.

The feature turned out too useful for all sorts of things, including dealing with the fact that before NT loaded itself you still had to emulate being an IBM PC including the fiction of booting from cassette tape or jumping to ROM BASIC.

Also, it's been cheaper to implement various features through small piece of code instead of adding a separate MCU to handle them, including prosaic things like handling NVRAM storage for variables (instead of interacting with external MCU or having separate NVRAM, you end up with SMM code being "trusted" to update the homogenous flash chip that contains both NVRAM and boot code)


I don't know if I'd call Microsoft lazy. Are there any existing operating systems that allow preemptive scheduling across cores with different ISA subsets? I'd sort of assume Microsoft research has a proof of concept for something like that but putting it into a production OS is a different kettle of fish.

> the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake

Wait. I thought the article says only Tiger Lake supports the vp2intersect instruction. Is that not true then?


Tiger Lake is the only one with vp2intersect, but before Alder Lake there had already been 3 generations of consumer CPUs with AVX-512 support (Cannon Lake in 2018/2019, Ice Lake in 2019/2020 and Tiger Lake + Rocket Lake in 2020/2021).

So it was expected that any future Intel CPUs will remain compatible. Removing an important instruction subset has never happened before in Intel's history.

Only AMD has removed some instructions when passing from a 32-bit ISA to a 64-bit ISA, most of which were obsolete (except that removing interrupt on overflow was bad and it does not simplify greatly a CPU core, since there are many other sources of precise exceptions that must still be supported; the only important effect of removing INTO is that many instructions can be retired earlier than otherwise, which reduces the risk of filling up the retirement queue).


The reason you had to disable the E cores was... also an artificial barrier imposed by Intel. Enabling AVX-512 only looks like a problem when inside that false dichotomy. You can have both with a bit of scheduler awareness.

The problem with the validation argument is that the P-cores were advertising AVX-512 via CPUID with the E-cores disabled. If the AVX-512 support was not validated and meant to be used, it would not have been a good idea to set that CPUID bit, or even allow the instructions to be executed without faulting. It's strange that it launched with any AVX-512 support at all and there were rumors that the decision to drop AVX-512 support officially was made at the last minute.

As for the downsides of disabling the E-cores, there were Alder Lake SKUs that were P-core only and had no E-cores.

Not all workloads are widely parallelizable and AVX-512 has features that are also useful for highly serialized workloads such as decompression, even at narrower than 512-bit width. Part of the reason that AVX-512 has limited usage is that Intel has set back widespread adoption of AVX-512 by half a decade by dropping it again from their consumer SKUs, with AVX10/256 only to return starting in ~2026.


I use AVX-512 on Alder and it does not increase the voltage above AVX-2 voltages, and even then power dissipation is considerably lower than AVX-2.


I think I have two of these sitting in a box, one prototype with avx512 and one retail without. Is it worth me breaking these out for ML experiments and such?

Do they cover anything Sapphire Rapids Xeon's don't? I thought they share the same arch (Golden Cove).

According to this [1] wikipedia article, the only feature Sapphire Rapids doesn't support is VP2INTERSECT.

[1]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions



Or Zen 5. :-p

Note that the article mentions using both outputs of the instruction, whereas the emulation is only able to compute one output efficiently.

Yes, you are right; I meant "consumer grade cpu".

What about Zen5?

AMD advertised and enabled AVX-512 on all Zen 5 CPUs. You have to resort to workarounds to get AVX-512 working on Alder Lake.

requires obscene amount of memory for context.

It's on the model's huggingface README[1].

> For processing 1 million-token sequences:

> Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).

> Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

[1]: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M


More than other models? I thought that context used a lot of memory on all models.

And I’d hardly call it obscene. You can buy a Mac Studio with 192GB of memory, that should allow you to max out the context window of the 7B model. Probably not going to be very fast though.


Not attainable to working class though. can is doing a lot of heavy lifting here. Seems like after a brief period where technology was essentially class agnostic, now only the wealthy can enjoy being part of development and everyone else can just be a consumer.

Not sure what you mean. Cutting edge computing has never been cheap. And a Mac Studio is definitely within the budget of a software developer in Norway. Not going to feel like a cheap investment, but definitely something that would be doable. Unlike a cluster of H100 GPUs, which would cost as much as a small apartment in Oslo.

And you can easily get a dev job in Norway without having to run an LLM locally on your computer.


The money would be better invested in a 2-4 3090 x86 build, than in a Mac Studio. While the Macs have a fantastic performance-per-watt ratio, and have decent memory support (both bus width and memory size), they are not great at compute power. A multi RTX 3090 build totally smokes a Mac at the same price point, at inference speed.

Memory requirement for the 7B model with full context is 120GB, so you would need 5 3090 GPUs, not 2-4. Do you know if you can get a motherboard with space for 5 GPUs and a power supply to match?

I bet that 5 3090s will smoke a Mac Studio. Can't find anyone in Norway with any in stock though. Or any 4090s with 24GB of memory.

You can get a nVidia RTX 5000 with 32GB of memory, there are two webshops that have those in stock. You'll need to wait though, because it looks like there might be one or maybe two in stock in total. And they are 63 000 NOK, and you need 4 of them. At that price you can buy two Mac Studios though.

I see people selling 3090s with 24GB secondhand for around 10 000 NOK each, but those have been running day in and day our for 3 years and don't come with a warranty.


If you search on r/localllama, there are people who have improvised builds with eg 8x GPUs. Takes multiple power supplies and server mainboards. And some let the GPUs sit openly on wooden racks - not sure that’s good for longevity?

BTW a Mac wouldn’t be able to run a model with 120GB requirements, 8GB for the rest is likely too tight a fit.


Mac Studio has up to 196 GB of memory.

Agreed - it's probably not unreasonable. So are the M4 Macs becoming the de-facto solution to running an LLM locally? Due to the insane 800 GB/sec internal bandwidth of Apple Silicon at its best?

The advantage the Macs have is that they can share RAM between GPU and CPU, and GPU-accessible RAM is everything when you want to run a decent sized LLM.

The problem is that most ML models are released for NVIDIA CUDA. Getting them to work on macOS requires translating them, usually to either GGUF (the llama.cpp format) or MLX (using Apple's own MLX array framework).

As such, as a Mac user I remain envious of people with NVIDIA/CUDA rigs with decent amounts of VRAM.

The NVIDIA "Digits" product may change things when it ships: https://www.theverge.com/2025/1/6/24337530/nvidia-ces-digits... - it may become the new cheapest convenient way to get 128GB of GPU-accessible RAM for running models.


No they are lacking compute power to be great at inference.

Can you back that up?

One 3090 seems to be equivalent to one M3 Max at inference: https://www.reddit.com/r/LocalLLaMA/s/BaoKxHj8ww

There are many such threads on Reddit. M4 Max is incrementally faster, maybe 20%. Even if you factor in electricity costs, a 2x 3090 setup is IMO the sweet spot, cost/benefit wise.

And it’s maybe a zany line of argumentation, but 2x 3090 use 10x the power of an M4 Max. While the M4 is maybe the most efficient setup out there, it’s not nearly 10x as efficient. That’s IMO where the lack of compute power comes from.


What is the GPU memory on that 3090?

24GB VRAM. Using multiple ones scales well because models can be split by layers, and run in a pipelined fashion.

I am talking about the times where you were only limited by your imagination and skills. All you needed was a laptop and few hundred bucks for servers. Now, to compete, you would need magnitudes more cash. You can still do some things, but you are at a mercy of AI providers that they can cut you off on a whim.

Not much more than something like a used jetski, but possibly depreciates even faster.

I mean... when has this not been the case?

Technology has never been class-agnostic or universally accessible.

Even saying that, I would argue that there is more, not less, technology that is accessible to more people today than there ever has been.


People hate being caught committing hypocrisy, you and me will be downvotoed to oblivion.

10-15t/s on 12400 with ddr5

No, but I use llama 3.2 1b and qwen2.5 1.5 as bash oneliner generator, always runnimg in console.

Could you elaborate?

I think I know what he means. I use AI Chat. I load Qwen2.5-1.5B-Instruct with llama.cpp server, fully offloaded to the CPU, and then I config AI Chat to connect to the llama.cpp endpoint.

Checkout the demo they have below

https://github.com/sigoden/aichat#shell-assistant


I just run llama-cli with the model. Every time I want some "awk" or "find" trickery, I just ask model. Good for throwaway python scripts too.

Can it do 'sed'?

I think one major improvement for folks like me would be human->regex LLM translator, ideally also respecting different flavors/syntax for various languages and tools.

This has been a bane of me - I run into requirement to develop some complex regexes maybe every 2-3 years, so I dig deep into specs, work on it, deliver eventually if its even possible, and within few months almost completely forget all the details and start at almost same place next time. It gets better over time but clearly I will retire earlier than this skill settles in well.


have not tried yet. any specific query? I can try.

What's your workflow like? I use AI Chat. I load Qwen2.5-1.5B-Instruct with llama.cpp server, fully offloaded to the CPU, and then I config AI Chat to connect to the llama.cpp endpoint.

Coding problems are never solved with rag.

%90 of the time all the LLM does is to point me to the website where the solution is.

well this is not how normally coding models used anyway.

small models can be funny (12b model):

tell me funny story about techbro and real developer who actually ships code

Sure, here's a light-hearted story for you:

---

Once upon a time in Silicon Valley, there was a shiny new startup called "TechBro Inc." It was filled with eager young techbros, all wearing their "10x Developer" t-shirts, and sipping on their $8 cold-pressed juices. Among them was Chad, the self-proclaimed "Tech Bro King."

Chad was always posting his impressive code snippets on his social media profiles, bragging about how he could "crush" any coding challenge in no time. He was the epitome of a techbro, all talk and no walk.

Meanwhile, in the corner, there was Alex, the quiet, unassuming developer who actually shipped code. Alex didn't have flashy t-shirts or a massive online following, but the company's products worked, and that was thanks to Alex.

One day, TechBro Inc. decided to host a "Hackathon" to impress their investors. Chad saw this as his opportunity to shine. He spent the entire day tweeting about his progress, posting screenshots of his code, and even live-streaming his "epic coding session."

Alex, on the other hand, quietly sat at his desk, coding away. He didn't take any breaks for photoshoots or live streams. He was there to work, not to perform.

As the hackathon came to a close, Chad confidently declared, "I've just finished the most amazing feature ever! It's going to revolutionize our app!" The room erupted in applause, and Chad basked in the glory.

The next day, the team gathered to demo their new features. Chad confidently presented his "revolutionary" feature. He clicked the button to launch it, but nothing happened. He tried again, and again, but the app remained unchanged. Chad's face turned red, and sweat dripped down his forehead.

It was Alex's turn. He quietly stepped up, clicked a button, and suddenly, the app was filled with new, useful features. The team gasped in awe. The investors were impressed. The app actually worked, and that was thanks to Alex.

Chad, trying to save face, turned to Alex and said, "Well, Alex, it seems you've just... shipped some code. Not bad for a... real developer."

The room burst into laughter. From that day forward, Chad learned the difference between talking about coding and actually doing it. And Alex? He just kept shipping code, one line at a time.

---


That is beyond terrible. Are you joking? I feel sad that I spent a minute reading that.

Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: