TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.
I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
Comments URL: https://news.ycombinator.com/item?id=44422480
Points: 3
# Comments: 0
Zaloguj się, aby dodać komentarz
Inne posty w tej grupie

Article URL: https://github.com/PlutoLang/Pluto
Comments URL: https://news.ycombinat


Article URL: https://github.com/nimtable/nimtable
Comments URL: https://news.ycomb



Article URL: https://tucson-josh.com/posts/rust-clap-cli/
Comments URL: ht