TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.
I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
Comments URL: https://news.ycombinator.com/item?id=44422480
Points: 3
# Comments: 0
Ak chcete pridať komentár, prihláste sa
Ostatné príspevky v tejto skupine
![Experience converting a mathematical software package to C++20 modules [PDF]](https://www.cdn5.niftycent.com/a/1/0/0/W/q/v/experience-converting-a-mathematical-software-package-to-c-20-modules-pdf.webp)
Article URL: https://arxiv.org/abs/2506.21654
Comments URL: https://news.ycombinator.c
Please lead with either SEEKING WORK or SEEKING FREELANCER, your location, and whether remote work is a possibility.
Please only post if you are personally looking to hire a freelancer or work a
Please state the location and include REMOTE for remote work, REMOTE (US) or similar if the country is restricted, and ONSITE when remote work is not an option.
Please only post if you pe

Article URL: https://calvin.sh/blog/fed-lie/
Comments URL: https://news.ycombinator.com

As the creator of TerarkDB (acquired by ByteDance in 2019), I have developed ToplingDB in recent years.
ToplingDB is forked from RocksDB, where we have replaced almost all components with mo