perf(unigram): pre-size token map and replace per-node HashMap with Vec#2039
Open
taeyun16 wants to merge 2 commits intohuggingface:mainfrom
Open
perf(unigram): pre-size token map and replace per-node HashMap with Vec#2039taeyun16 wants to merge 2 commits intohuggingface:mainfrom
taeyun16 wants to merge 2 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
While profiling Unigram::from for the 500 353-vocab minishlab/potion-multilingual-128M tokenizer, two allocation sites showed up as dominant in a dhat heap profile. PR huggingface#1799 already swapped the std HashMaps in this code path to ahash::AHashMap, which addressed hasher cost; the remaining heap pressure is structural. 1) models/unigram/model.rs: AHashMap::new() built without a capacity hint despite vocab.len() being known. Replaced with AHashMap::with_capacity(n) so the 500k-vocab map skips ~17 doubling rehashes on load. 2) models/unigram/trie.rs: Node::children switched from AHashMap<Label, Node> to Vec<(Label, Node)>. Trie nodes typically have 1–4 children; even the root maxes out at the alphabet size (≤256 for byte-level tries). At those fan-outs a packed Vec with linear scan is smaller and faster than a hashbrown table — and crucially, the empty Vec costs zero allocations vs ~48 B of hashbrown header per node × millions of nodes. Measured on a 1500-fact wiki search bench using minishlab/potion-multilingual-128M (decode 20 queries, embed each), v0.22.2 base vs both fixes applied: Heap peak 515.2 MB → 315.3 MB -39% RSS peak 840.8 MB → 554.0 MB -34% phys_footprint 708.3 MB → 421.4 MB -41% CPU user time 1673 ms → 1447 ms -14% p50 latency 23.27 ms → 23.27 ms no regression p95 latency 59.59 ms → 59.32 ms noise dhat At t-gmax 540.2 MB → 330.6 MB -39% The CPU win is incidental — fewer allocator round-trips through the global allocator, plus tighter inner loops in the trie scan. Public API unchanged; cargo test -p tokenizers --lib unigram (20 tests) passes. Encoding determinism verified externally via Model2Vec round-trip: cosine(encode_baseline, encode_patched) > 0.9999 on real text.
1b1e34f to
b1436b3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR draft: huggingface/tokenizers — reduce Unigram loader heap by 39%
Target:
huggingface/tokenizers(Rust crate, v0.22.2 / main)Files:
tokenizers/src/models/unigram/{model.rs, trie.rs}Patch:
.notes/tokenizers-pr.patchAuthor: Taeyun Jang
<taeyun16@pm.me>Title
perf(unigram): pre-size token map and replace per-node HashMap with VecBody
While profiling
Unigram::fromfor the 500 353-vocabminishlab/potion-multilingual-128Mtokenizer, two allocation sitesshowed up as dominant in a
dhatheap profile.PR #1799 (v0.22.0, "Consolidated optimization ahash dary compact str")
already swapped the std
HashMaps in this code path toahash::AHashMap, which addressed the hasher cost. The remainingheap pressure is structural — the per-node
AHashMapallocationsthemselves and the missing capacity hint at the call site.
Issue 1 —
token_to_ids: AHashMap::new()models/unigram/model.rs:102constructstoken_to_idswith nocapacity hint, then immediately inserts
vocab.len()entries:For a 500 k-vocab model that's ~17 doubling rehashes; each rehash
allocates a fresh table, copies every entry over, and frees the
old table. dhat attributes ~30 MB of total allocations to this
single site — all redundant since
nis already in scope.Fix:
AHashMap::with_capacity(n).Issue 2 —
Trie<Label>Node::children: AHashMap<Label, Node<Label>>models/unigram/trie.rsstores anAHashMapon every trie node.For the 500 k-vocab tokenizer this materializes ~2.6 M
empty/near-empty hashbrown tables (one per node, plus per-entry
buckets). dhat shows this as the single largest live-byte source
at peak: ~210 MB across millions of small blocks on v0.22.2.
But trie nodes have very low fan-out — the only "wide" node is the
root (≤ alphabet size, in practice ≤ 256 for byte-level tries),
and interior nodes typically have 1–4 children. At those sizes a
linear scan over a packed
Vec<(Label, Node)>beats a HashMap onboth axes:
Fix:
children: Vec<(Label, Node<Label>)>, with linear scans inpushandTrieIterator::next. Public API of the trie isunchanged.
Measurement
Profiled against v0.22.2 baseline. Workload: a 1 500-fact wiki
search bench using
minishlab/potion-multilingual-128M, decoding20 queries and embedding each.
dhat heap profile (release-with-debuginfo)
At t-gmax)Total)Process-level (release build, peak_alloc + memory_stats)
Both columns measured on the same machine, same model, same workload:
phys_footprintpeakThe CPU win is incidental — fewer allocator round-trips through
the global allocator and tighter inner loops in the trie scan.
For comparison, the same two fixes applied on v0.20.4 (pre-ahash)
landed at gmax 300 MB / total 874 MB. ahash made the hasher faster
but the per-node table overhead persisted; the per-node Vec
switch is what addresses that.
Test impact
cargo test -p tokenizers --lib unigram(20 tests) passesunchanged on the patched build.
Encoding determinism verified externally:
cosine(encode_v0.22.2, encode_patched) > 0.9999on real text via Model2Vec round-trip.Compatibility
VecisDefault, no extra trait bounds needed.Trie::pushno longer needsahash::AHashMap,but the rest of the file/tree still uses it; the
use ahash::…import in
trie.rsis removed since the trie itself no longerreferences it.)