-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Pull requests: huggingface/tokenizers
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
perf(unigram): pre-size token map and replace per-node HashMap with Vec
#2039
opened Apr 26, 2026 by
taeyun16
Loading…
feat(ByteLevel): skip per-byte transform for printable-ASCII tokens
#2038
opened Apr 26, 2026 by
KimBioInfoStudio
Loading…
2 of 3 tasks
feat(NFC): skip Unicode pass for all-ASCII inputs
#2037
opened Apr 26, 2026 by
KimBioInfoStudio
Loading…
2 of 3 tasks
feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)
#2036
opened Apr 26, 2026 by
KimBioInfoStudio
Loading…
5 of 6 tasks
perf(byte_level): port GPT-2 split regex to logos FSM (−22% on GPT-2 encode)
#2031
opened Apr 23, 2026 by
ArthurZucker
Collaborator
Loading…
Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard
#2030
opened Apr 23, 2026 by
ArthurZucker
Collaborator
Loading…
4 tasks
Batch encode: lock-free work queue with dynamic window sizing
#2029
opened Apr 23, 2026 by
sebpop
Contributor
Loading…
perf: skip alignment tracking in encode_fast normalization
#2022
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
feat: Normalizer::normalize_str — skip NormalizedString allocation
#2020
opened Apr 10, 2026 by
ArthurZucker
Collaborator
Loading…
feat: open-addressing merge table with cache-line-local linear probing for BPE
#2012
opened Apr 8, 2026 by
ArthurZucker
Collaborator
Loading…
feat: compact vocabulary — single-allocation id→token store for BPE
#2011
opened Apr 8, 2026 by
ArthurZucker
Collaborator
Loading…
feat(pattern): parallel regex
find_matches for large inputs
#2003
opened Mar 31, 2026 by
McPatate
Member
Loading…
fix: skip serializing ByteLevel fields at their default value
#2001
opened Mar 30, 2026 by
ArthurZucker
Collaborator
Loading…
feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)
#1968
opened Mar 19, 2026 by
michaelfeil
Contributor
Loading…
feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)
#1967
opened Mar 19, 2026 by
michaelfeil
Contributor
Loading…
Fix type_ids not applied to overflow encodings
#1965
opened Mar 17, 2026 by
joaquinhuigomez
Loading…
Add get_special_tokens and is_special_token methods
#1945
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
2 tasks done
Add post_process_tokens and post_process_ids methods
#1944
opened Feb 5, 2026 by
ArthurZucker
Collaborator
Loading…
3 tasks done
Previous Next
ProTip!
no:milestone will show everything without a milestone.