perf(unigram): pre-size token map and replace per-node HashMap with Vec by taeyun16 · Pull Request #2039 · huggingface/tokenizers

taeyun16 · 2026-04-26T17:29:59Z

PR draft: huggingface/tokenizers — reduce Unigram loader heap by 39%

Target: huggingface/tokenizers (Rust crate, v0.22.2 / main)
Files: tokenizers/src/models/unigram/{model.rs, trie.rs}
Patch: .notes/tokenizers-pr.patch
Author: Taeyun Jang <taeyun16@pm.me>

Title

perf(unigram): pre-size token map and replace per-node HashMap with Vec

Body

While profiling Unigram::from for the 500 353-vocab
minishlab/potion-multilingual-128M tokenizer, two allocation sites
showed up as dominant in a dhat heap profile.

PR #1799 (v0.22.0, "Consolidated optimization ahash dary compact str")
already swapped the std HashMaps in this code path to
ahash::AHashMap, which addressed the hasher cost. The remaining
heap pressure is structural — the per-node AHashMap allocations
themselves and the missing capacity hint at the call site.

Issue 1 — `token_to_ids: AHashMap::new()`

models/unigram/model.rs:102 constructs token_to_ids with no
capacity hint, then immediately inserts vocab.len() entries:

let n = vocab.len();
let mut token_to_ids: TokenMap = AHashMap::new();
...
for (id, (token, score)) in vocab.iter().enumerate() {
    token_to_ids.insert(token.to_string(), id as u32);
    builder.push(token.as_bytes());
}

For a 500 k-vocab model that's ~17 doubling rehashes; each rehash
allocates a fresh table, copies every entry over, and frees the
old table. dhat attributes ~30 MB of total allocations to this
single site — all redundant since n is already in scope.

Fix: AHashMap::with_capacity(n).

Issue 2 — `Trie<Label>` `Node::children: AHashMap<Label, Node<Label>>`

models/unigram/trie.rs stores an AHashMap on every trie node.
For the 500 k-vocab tokenizer this materializes ~2.6 M
empty/near-empty hashbrown tables (one per node, plus per-entry
buckets). dhat shows this as the single largest live-byte source
at peak: ~210 MB across millions of small blocks on v0.22.2.

But trie nodes have very low fan-out — the only "wide" node is the
root (≤ alphabet size, in practice ≤ 256 for byte-level tries),
and interior nodes typically have 1–4 children. At those sizes a
linear scan over a packed Vec<(Label, Node)> beats a HashMap on
both axes:

	AHashMap	Vec
Empty-node footprint	~48 B (hashbrown header)	24 B (Vec header, no heap alloc)
4-entry node footprint	~512 B (16-bucket table padded)	~104 B (4 entries)
Lookup at fan-out=4	hash + masking + branch	3 byte compares
Lookup at fan-out=256	hash + masking + branch	≤256 byte compares (cache-resident, branch-predictable)

Fix: children: Vec<(Label, Node<Label>)>, with linear scans in
push and TrieIterator::next. Public API of the trie is
unchanged.

Measurement

Profiled against v0.22.2 baseline. Workload: a 1 500-fact wiki
search bench using minishlab/potion-multilingual-128M, decoding
20 queries and embedding each.

dhat heap profile (release-with-debuginfo)

	v0.22.2 baseline	both fixes applied	delta
Heap peak (`At t-gmax`)	540.2 MB	330.6 MB	-209 MB (-39%)
Heap total (`Total`)	1 161.7 MB	903.3 MB	-259 MB (-22%)
Blocks at gmax	2 640 084	2 640 084	same

Process-level (release build, peak_alloc + memory_stats)

Both columns measured on the same machine, same model, same workload:

	v0.22.2 baseline	both fixes applied	delta
Process Heap peak	515.2 MB	315.3 MB	-199.9 MB (-39%)
RSS peak	840.8 MB	554.0 MB	-286.8 MB (-34%)
macOS `phys_footprint` peak	708.3 MB	421.4 MB	-286.9 MB (-41%)
Total CPU user time	1673.2 ms	1447.0 ms	-226 ms (-14%)
p50 search latency	23.27 ms	23.27 ms	same
p95 search latency	59.59 ms	59.32 ms	-0.3 ms (noise)

The CPU win is incidental — fewer allocator round-trips through
the global allocator and tighter inner loops in the trie scan.

For comparison, the same two fixes applied on v0.20.4 (pre-ahash)
landed at gmax 300 MB / total 874 MB. ahash made the hasher faster
but the per-node table overhead persisted; the per-node Vec
switch is what addresses that.

Test impact

cargo test -p tokenizers --lib unigram (20 tests) passes
unchanged on the patched build.

Encoding determinism verified externally: cosine(encode_v0.22.2, encode_patched) > 0.9999 on real text via Model2Vec round-trip.

Compatibility

Public API: unchanged.
Behavior: identical (deterministic encode produces same vectors).
Compile time: Vec is Default, no extra trait bounds needed.
New deps: none. (Trie::push no longer needs ahash::AHashMap,
but the rest of the file/tree still uses it; the use ahash::…
import in trie.rs is removed since the trie itself no longer
references it.)

HuggingFaceDocBuilderDev · 2026-04-27T05:59:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

While profiling Unigram::from for the 500 353-vocab minishlab/potion-multilingual-128M tokenizer, two allocation sites showed up as dominant in a dhat heap profile. PR huggingface#1799 already swapped the std HashMaps in this code path to ahash::AHashMap, which addressed hasher cost; the remaining heap pressure is structural. 1) models/unigram/model.rs: AHashMap::new() built without a capacity hint despite vocab.len() being known. Replaced with AHashMap::with_capacity(n) so the 500k-vocab map skips ~17 doubling rehashes on load. 2) models/unigram/trie.rs: Node::children switched from AHashMap<Label, Node> to Vec<(Label, Node)>. Trie nodes typically have 1–4 children; even the root maxes out at the alphabet size (≤256 for byte-level tries). At those fan-outs a packed Vec with linear scan is smaller and faster than a hashbrown table — and crucially, the empty Vec costs zero allocations vs ~48 B of hashbrown header per node × millions of nodes. Measured on a 1500-fact wiki search bench using minishlab/potion-multilingual-128M (decode 20 queries, embed each), v0.22.2 base vs both fixes applied: Heap peak 515.2 MB → 315.3 MB -39% RSS peak 840.8 MB → 554.0 MB -34% phys_footprint 708.3 MB → 421.4 MB -41% CPU user time 1673 ms → 1447 ms -14% p50 latency 23.27 ms → 23.27 ms no regression p95 latency 59.59 ms → 59.32 ms noise dhat At t-gmax 540.2 MB → 330.6 MB -39% The CPU win is incidental — fewer allocator round-trips through the global allocator, plus tighter inner loops in the trie scan. Public API unchanged; cargo test -p tokenizers --lib unigram (20 tests) passes. Encoding determinism verified externally via Model2Vec round-trip: cosine(encode_baseline, encode_patched) > 0.9999 on real text.

taeyun16 force-pushed the perf/unigram-trie-vec branch from 1b1e34f to b1436b3 Compare April 27, 2026 12:28

Merge branch 'main' into perf/unigram-trie-vec

9b65384

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(unigram): pre-size token map and replace per-node HashMap with Vec#2039

perf(unigram): pre-size token map and replace per-node HashMap with Vec#2039
taeyun16 wants to merge 2 commits intohuggingface:mainfrom
taeyun16:perf/unigram-trie-vec

taeyun16 commented Apr 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taeyun16 commented Apr 26, 2026

PR draft: huggingface/tokenizers — reduce Unigram loader heap by 39%

Title

Body

Issue 1 — token_to_ids: AHashMap::new()

Issue 2 — Trie<Label> Node::children: AHashMap<Label, Node<Label>>

Measurement

dhat heap profile (release-with-debuginfo)

Process-level (release build, peak_alloc + memory_stats)

Test impact

Compatibility

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Issue 1 — `token_to_ids: AHashMap::new()`

Issue 2 — `Trie<Label>` `Node::children: AHashMap<Label, Node<Label>>`