Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Comments URL: https://news.ycombinator.com/item?id=44009321

Points: 108

# Comments: 11

https://github.com/dipampaul17/KVSplit

Utworzony 10h | 16 maj 2025, 21:50:10

Zaloguj się, aby dodać komentarz

Inne posty w tej grupie

Wow@Home – Network of Amateur Radio Telescopes

Article URL: https://phl.upr.edu/wow/outreach

Comments URL: https://news.ycombinator.c

17 maj 2025, 04:50:13 | Hacker news

Postman for MCP

Article URL: https://usetexture.com/##

Comments URL: https://news.ycombinator.com/item?id=440

17 maj 2025, 04:50:10 | Hacker news

Behind Silicon Valley and the GOP’s campaign to ban state AI laws

Article URL: https://www.bloodinthemachine.com/p/de-democratizing-ai

Comments URL:

17 maj 2025, 04:50:09 | Hacker news

A Linux kernel developer plays with Home Assistant: general impressions

Article URL: https://lwn.net/SubscriberLink/1017720/7155ecb9602e9ef2/

Comments URL:

17 maj 2025, 04:50:08 | Hacker news

Show HN: Fahmatrix – A Lightweight, Pandas-Like DataFrame Library for Java

Hey HN,

I’ve built Fahmatrix, a minimal, fast Java library for working with tabular data — inspired by Python’s pandas, but designed for performance and simplicity on the JVM.

After working ex

17 maj 2025, 04:50:07 | Hacker news

WebGL Gray-Scott Explorer (2012)

Article URL: http://www.mrob.com/pub/comp/xmorphia/ogl/index.html

Comments URL:

17 maj 2025, 02:40:08 | Hacker news

Show HN: Merliot – plugging physical devices into LLMs

Merliot Hub is an AI-integrated device hub.

What does that mean? It means you can control and interact with your physical devices, your security cameras, your thermometer, seamlessly using natur

17 maj 2025, 02:40:07 | Hacker news

Techie