Skip to content

feat(NFC): skip Unicode pass for all-ASCII inputs#2037

Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/nfc-ascii-fast-path
Open

feat(NFC): skip Unicode pass for all-ASCII inputs#2037
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/nfc-ascii-fast-path

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown

Summary

Adds a single-line ASCII gate to the NFC normalizer: when NormalizedString::get().is_ascii() is true the normalizer returns immediately, skipping the per-char unicode-normalization-alignments pass and the transform() rebuild it triggers.

Why this is safe

ASCII (U+0000..=U+007F) is NFC by Unicode invariant:

  • no ASCII code point has a Decomposition_Mapping
  • no ASCII code point combines with adjacent characters
  • no ASCII code point is the target of any composition

Running the existing NFC code on an all-ASCII NormalizedString therefore rebuilds normalized and alignments to the exact same bytes and tuples it already had — the gate just elides that work.

The gate is strictly conservative: any non-ASCII byte in the input falls through to the original code path with zero changes. Combining-mark sequences (e.g. e + U+0301 → é), CJK, Arabic, Cyrillic, Vietnamese, Hindi, Thai, etc. all keep their existing behavior.

Tests

Two unit tests in normalizers::unicode::tests pin this:

  • nfc_ascii_fast_path_is_no_op — runs NFKD on first to produce all-ASCII text with non-trivial alignments (each output byte still maps back to the 3-byte ligature), then asserts NFC.normalize leaves the entire NormalizedString byte-identical.
  • nfc_non_ascii_still_runs_unicode_path — checks that e + combining acute is still composed to é.

Test plan

  • cargo test --lib (202 tests pass)
  • New equivalence + non-ASCII regression tests
  • CI on x86_64 / Linux / macOS

Notes

This is a minimal-risk performance gate — no new dependencies, no unsafe, no SIMD. Companion to #2036 (SIMD ASCII fast path for Lowercase), but fully independent of it.

Same-shape gates may apply to NFD, NFKC, NFKD (ASCII is also already NFD/NFKD/NFKC by the same invariant); happy to extend in a follow-up if maintainers prefer. Kept minimal here for review clarity.

🤖 Generated with Claude Code

ASCII (U+0000..=U+007F) is NFC by Unicode invariant — none of those code
points have a `Decomposition_Mapping`, none combine with adjacent
characters, and none are the target of any composition. Running the full
`unicode-normalization-alignments` pass on an all-ASCII `NormalizedString`
therefore rebuilds `normalized` and `alignments` to the exact same bytes
and tuples it already had. We can return early and save the iterator
allocation, the per-`char` UTF-8 decode loop, and the `transform` rebuild.

The gate is conservative: any non-ASCII byte in the input falls through
to the original code path with zero changes, so combining-mark sequences,
CJK, Arabic, Cyrillic, Vietnamese, etc. are unaffected. Two unit tests
pin the contract:

  - `nfc_ascii_fast_path_is_no_op` — runs NFKD on `ff` (producing all-ASCII
    text with non-trivial alignments) and asserts NFC leaves the entire
    `NormalizedString` byte-identical.
  - `nfc_non_ascii_still_runs_unicode_path` — checks that "e" + combining
    acute is still composed to "é".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants