Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Comments URL: https://news.ycombinator.com/item?id=44009321

Points: 108

# Comments: 11

https://github.com/dipampaul17/KVSplit

Creată 2mo | 16 mai 2025, 21:50:10

Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

The Year of Peak Might and Magic

Article URL: https://www.filfre.net/2025/07/the-year-of-peak-might-and-magic/

Comments URL:

18 iul. 2025, 21:30:21 | Hacker news

Third patient dies from acute liver failure caused by a Sarepta gene therapy

Article URL: https://www.biocentury.com/article/656520/third-death-from-a-sarepta-gene-therapy

18 iul. 2025, 21:30:20 | Hacker news

How I keep up with AI progress

Article URL: https://blog.nilenso.com/blog/2025/06/23/how-i-keep-up-with-ai-progress/

Comments URL:

18 iul. 2025, 21:30:19 | Hacker news

Cancer DNA is detectable in blood years before diagnosis

Article URL: https://www.sciencenews.org/article/cancer-tumor-dna-blood-test-screening

Comments URL:

18 iul. 2025, 21:30:17 | Hacker news

Show HN: Molab, a cloud-hosted Marimo notebook workspace

We launched marimo [1], an open-source reactive Python notebook, last year on HackerNews. Today, the most popular recent feature request in Google Colab’s issue tracker asks for marimo support in

18 iul. 2025, 21:30:16 | Hacker news

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

Article URL: https://eprint.iacr.org/2025/1237

Comments URL: https://news.ycombinator

18 iul. 2025, 21:30:15 | Hacker news

Asynchrony Is Not Concurrency

Article URL: https://kristoff.it/blog/asynchrony-is-not-concurrency/

Comments URL:

18 iul. 2025, 21:30:14 | Hacker news

Techie