Show HN: Speeding up LLM inference 2x times (possibly)

Here's a project I've been working on for the last few months.

It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model.

It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results.

I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses.

The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization.

You could call it ad-hoc model distillation :)

You can change the speed / accuracy of a model at will, in real time.

Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory. You can decide to skip say 10-20-40% of the least important weights.

It's implemented for Mistral, it was also tested slightly on Mixtral and Llama. It's for FP16 for now, but Q8 is in the works.

The algorithm is described here, and the implementation is open source.

https://kolinko.github.io/effort/

I know these are bold claims, but I hope they survive the scrutiny :)

Comments URL: https://news.ycombinator.com/item?id=40067677

Points: 45

# Comments: 7

https://asciinema.org/a/piP22yYwcaohu5cA2gyuv1W61

Utworzony 1y | 17 kwi 2024, 19:20:05

Zaloguj się, aby dodać komentarz

Inne posty w tej grupie

A CarFax for Used PCs; Hewlett Packard wants to give old laptops new life

Article URL: https://spectrum.ieee.org/carmax-used-pcs

Comments URL: https://

30 cze 2025, 19:50:35 | Hacker news

I write type-safe generic data structures in C

Article URL: https://danielchasehooper.com/posts/typechecked-generic-c-data-structures/

Comments URL

30 cze 2025, 19:50:35 | Hacker news

Proton joins suit against Apple for predatory practices

Article URL: https://proton.me/blog/apple-lawsuit

Comments URL: https://news.ycomb

30 cze 2025, 19:50:34 | Hacker news

They don't make 'em like that any more: Sony DTC-700 audio DAT player/recorder

Article URL: https://kevinboone.me/dtc-700.html

Comments URL: https://news.ycombinat

30 cze 2025, 19:50:33 | Hacker news

Ask HN: What's the 2025 stack for a self-hosted photo library with local AI?

First of all, this is purely a personal learning project for me, aiming to combine three of my passions: photography, software engineering, and my family memories. I have a large collection of fam

30 cze 2025, 19:50:32 | Hacker news

OpenTelemetry Is Great, but Who the Hell Is Going to Pay for It?

Article URL: https://www.adatosystems.com/2025/02/10/who-the-hell-is-going-to-pay-for-this/

Comm

30 cze 2025, 19:50:27 | Hacker news

Datadog's $65M/year customer mystery solved

Article URL: https://blog.pragmaticengineer.com/datadog-65m-year-customer-mystery/

Comments URL:

30 cze 2025, 19:50:26 | Hacker news

Techie