feat(NFC): skip Unicode pass for all-ASCII inputs by KimBioInfoStudio · Pull Request #2037 · huggingface/tokenizers

KimBioInfoStudio · 2026-04-26T16:42:10Z

Summary

Adds a single-line ASCII gate to the NFC normalizer: when NormalizedString::get().is_ascii() is true the normalizer returns immediately, skipping the per-char unicode-normalization-alignments pass and the transform() rebuild it triggers.

Why this is safe

ASCII (U+0000..=U+007F) is NFC by Unicode invariant:

no ASCII code point has a Decomposition_Mapping
no ASCII code point combines with adjacent characters
no ASCII code point is the target of any composition

Running the existing NFC code on an all-ASCII NormalizedString therefore rebuilds normalized and alignments to the exact same bytes and tuples it already had — the gate just elides that work.

The gate is strictly conservative: any non-ASCII byte in the input falls through to the original code path with zero changes. Combining-mark sequences (e.g. e + U+0301 → é), CJK, Arabic, Cyrillic, Vietnamese, Hindi, Thai, etc. all keep their existing behavior.

Tests

Two unit tests in normalizers::unicode::tests pin this:

nfc_ascii_fast_path_is_no_op — runs NFKD on ﬀ first to produce all-ASCII text with non-trivial alignments (each output byte still maps back to the 3-byte ligature), then asserts NFC.normalize leaves the entire NormalizedString byte-identical.
nfc_non_ascii_still_runs_unicode_path — checks that e + combining acute is still composed to é.

Test plan

cargo test --lib (202 tests pass)
New equivalence + non-ASCII regression tests
CI on x86_64 / Linux / macOS

Notes

This is a minimal-risk performance gate — no new dependencies, no unsafe, no SIMD. Companion to #2036 (SIMD ASCII fast path for Lowercase), but fully independent of it.

Same-shape gates may apply to NFD, NFKC, NFKD (ASCII is also already NFD/NFKD/NFKC by the same invariant); happy to extend in a follow-up if maintainers prefer. Kept minimal here for review clarity.

🤖 Generated with Claude Code

ASCII (U+0000..=U+007F) is NFC by Unicode invariant — none of those code points have a `Decomposition_Mapping`, none combine with adjacent characters, and none are the target of any composition. Running the full `unicode-normalization-alignments` pass on an all-ASCII `NormalizedString` therefore rebuilds `normalized` and `alignments` to the exact same bytes and tuples it already had. We can return early and save the iterator allocation, the per-`char` UTF-8 decode loop, and the `transform` rebuild. The gate is conservative: any non-ASCII byte in the input falls through to the original code path with zero changes, so combining-mark sequences, CJK, Arabic, Cyrillic, Vietnamese, etc. are unaffected. Two unit tests pin the contract: - `nfc_ascii_fast_path_is_no_op` — runs NFKD on `ﬀ` (producing all-ASCII text with non-trivial alignments) and asserts NFC leaves the entire `NormalizedString` byte-identical. - `nfc_non_ascii_still_runs_unicode_path` — checks that "e" + combining acute is still composed to "é". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-27T06:04:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

KimBioInfoStudio mentioned this pull request Apr 26, 2026

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens #2038

Open

3 tasks

Merge branch 'main' into feat/nfc-ascii-fast-path

e88a5c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(NFC): skip Unicode pass for all-ASCII inputs#2037

feat(NFC): skip Unicode pass for all-ASCII inputs#2037
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/nfc-ascii-fast-path

KimBioInfoStudio commented Apr 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KimBioInfoStudio commented Apr 26, 2026

Summary

Why this is safe

Tests

Test plan

Notes

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants