Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Comments URL: https://news.ycombinator.com/item?id=44009321

Points: 108

# Comments: 11

https://github.com/dipampaul17/KVSplit

созданный 8h | 16 мая 2025 г., 21:50:10

Войдите, чтобы добавить комментарий

Другие сообщения в этой группе

Wow@Home – Network of Amateur Radio Telescopes

Article URL: https://phl.upr.edu/wow/outreach

Comments URL: https://news.ycombinator.c

17 мая 2025 г., 04:50:13 | Hacker news

Postman for MCP

Article URL: https://usetexture.com/##

Comments URL: https://news.ycombinator.com/item?id=440

17 мая 2025 г., 04:50:10 | Hacker news

Behind Silicon Valley and the GOP’s campaign to ban state AI laws

Article URL: https://www.bloodinthemachine.com/p/de-democratizing-ai

Comments URL:

17 мая 2025 г., 04:50:09 | Hacker news

A Linux kernel developer plays with Home Assistant: general impressions

Article URL: https://lwn.net/SubscriberLink/1017720/7155ecb9602e9ef2/

Comments URL:

17 мая 2025 г., 04:50:08 | Hacker news

Show HN: Fahmatrix – A Lightweight, Pandas-Like DataFrame Library for Java

Hey HN,

I’ve built Fahmatrix, a minimal, fast Java library for working with tabular data — inspired by Python’s pandas, but designed for performance and simplicity on the JVM.

After working ex

17 мая 2025 г., 04:50:07 | Hacker news

WebGL Gray-Scott Explorer (2012)

Article URL: http://www.mrob.com/pub/comp/xmorphia/ogl/index.html

Comments URL:

17 мая 2025 г., 02:40:08 | Hacker news

Show HN: Merliot – plugging physical devices into LLMs

Merliot Hub is an AI-integrated device hub.

What does that mean? It means you can control and interact with your physical devices, your security cameras, your thermometer, seamlessly using natur

17 мая 2025 г., 02:40:07 | Hacker news

Techie