feat(NFC): skip Unicode pass for all-ASCII inputs#2037
Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Open
feat(NFC): skip Unicode pass for all-ASCII inputs#2037KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Conversation
ASCII (U+0000..=U+007F) is NFC by Unicode invariant — none of those code
points have a `Decomposition_Mapping`, none combine with adjacent
characters, and none are the target of any composition. Running the full
`unicode-normalization-alignments` pass on an all-ASCII `NormalizedString`
therefore rebuilds `normalized` and `alignments` to the exact same bytes
and tuples it already had. We can return early and save the iterator
allocation, the per-`char` UTF-8 decode loop, and the `transform` rebuild.
The gate is conservative: any non-ASCII byte in the input falls through
to the original code path with zero changes, so combining-mark sequences,
CJK, Arabic, Cyrillic, Vietnamese, etc. are unaffected. Two unit tests
pin the contract:
- `nfc_ascii_fast_path_is_no_op` — runs NFKD on `ff` (producing all-ASCII
text with non-trivial alignments) and asserts NFC leaves the entire
`NormalizedString` byte-identical.
- `nfc_non_ascii_still_runs_unicode_path` — checks that "e" + combining
acute is still composed to "é".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a single-line ASCII gate to the
NFCnormalizer: whenNormalizedString::get().is_ascii()is true the normalizer returns immediately, skipping the per-charunicode-normalization-alignmentspass and thetransform()rebuild it triggers.Why this is safe
ASCII (U+0000..=U+007F) is NFC by Unicode invariant:
Decomposition_MappingRunning the existing NFC code on an all-ASCII
NormalizedStringtherefore rebuildsnormalizedandalignmentsto the exact same bytes and tuples it already had — the gate just elides that work.The gate is strictly conservative: any non-ASCII byte in the input falls through to the original code path with zero changes. Combining-mark sequences (e.g.
e+ U+0301 →é), CJK, Arabic, Cyrillic, Vietnamese, Hindi, Thai, etc. all keep their existing behavior.Tests
Two unit tests in
normalizers::unicode::testspin this:nfc_ascii_fast_path_is_no_op— runsNFKDonfffirst to produce all-ASCII text with non-trivial alignments (each output byte still maps back to the 3-byte ligature), then assertsNFC.normalizeleaves the entireNormalizedStringbyte-identical.nfc_non_ascii_still_runs_unicode_path— checks thate+ combining acute is still composed toé.Test plan
cargo test --lib(202 tests pass)Notes
This is a minimal-risk performance gate — no new dependencies, no
unsafe, no SIMD. Companion to #2036 (SIMD ASCII fast path forLowercase), but fully independent of it.Same-shape gates may apply to
NFD,NFKC,NFKD(ASCII is also already NFD/NFKD/NFKC by the same invariant); happy to extend in a follow-up if maintainers prefer. Kept minimal here for review clarity.🤖 Generated with Claude Code