Show HN: TokenDagger – A tokenizer 2-4x faster than OpenAI's Tiktoken

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.


Comments URL: https://news.ycombinator.com/item?id=44422480

Points: 3

# Comments: 0

https://github.com/M4THYOU/TokenDagger

Vytvorené 1d | 30. 6. 2025, 12:50:06


Ak chcete pridať komentár, prihláste sa

Ostatné príspevky v tejto skupine

Ask HN: Freelancer? Seeking freelancer? (July 2025)

Please lead with either SEEKING WORK or SEEKING FREELANCER, your location, and whether remote work is a possibility.

Please only post if you are personally looking to hire a freelancer or work a

1. 7. 2025, 16:40:26 | Hacker news
Ask HN: Who is hiring? (July 2025)

Please state the location and include REMOTE for remote work, REMOTE (US) or similar if the country is restricted, and ONSITE when remote work is not an option.

Please only post if you pe

1. 7. 2025, 16:40:24 | Hacker news
Show HN: ToplingDB - A Persistent Key-Value Store for External Storage

As the creator of TerarkDB (acquired by ByteDance in 2019), I have developed ToplingDB in recent years.

ToplingDB is forked from RocksDB, where we have replaced almost all components with mo

1. 7. 2025, 14:30:09 | Hacker news