Hacker News new | past | comments | ask | show | jobs | submit login

I think this comment explains it https://github.com/ggerganov/llama.cpp/discussions/4130#disc... As far as I understand (and mcharytoniuk should better confirm this), llama.cpp allows to chunk the context window of an LLM into independent blocks, such that multiple requests can be processed in a single inference. I think due to the auto-regressive nature of LLMs, you also don't have to wait for all sequences to finish to output them. As soon as one sequence finishes, you can use its "slot" in the context window for other requests.



Yes, exactly. You can split the available context into "slots" (chunks) so it can handle multipe requests concurrently. The number of them is configurable.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: