feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x) by KimBioInfoStudio · Pull Request #2036 · huggingface/tokenizers

KimBioInfoStudio · 2026-04-26T16:20:24Z

Summary

Adds utils::simd::ascii_lower, a runtime-dispatched in-place ASCII lowercaser with AVX-512BW, AVX2, SSE2, NEON, and scalar back-ends.
Gates NormalizedString::lowercase on s.is_ascii() so all-ASCII inputs skip the per-char Unicode case-folding loop and the transform() alignments rebuild, mutating bytes in place instead.
Microbench shows ~30-49× over the previous Unicode chars() path on Apple/NEON, and up to ~117× on x86_64 AVX2 (Zen 4) at 16 KiB.

Why this is safe

For any ASCII char, char::to_lowercase() returns exactly one ASCII char equal to b | 0x20 for A..=Z (Unicode SpecialCasing.txt has no ASCII entries), so 1 byte → 1 byte and transform() would have rebuilt alignments to the same values it already had. Skipping transform() is therefore observationally equivalent for ASCII inputs.

Two new unit tests pin this down:

lowercase_ascii_fast_path_preserves_alignments — after a non-trivial nfkd() transform, asserts normalized, alignments, original, and original_shift are byte-identical between fast-path output and the expected values.
lowercase_ascii_matches_unicode_path_byte_for_byte — for every printable ASCII byte, checks the fast path output equals chars().flat_map(to_lowercase).

Non-ASCII inputs hit the original code path unchanged.

Implementation notes

Runtime dispatch via is_x86_feature_detected!("avx512fp16"), "avx2". SSE2 is always available on the x86_64 baseline; aarch64 stable ABI mandates NEON, so no detect there.
AVX-512BW path is gated on avx512fp16 (not avx512bw) intentionally as a frequency-safety proxy: avx512fp16 lights up only on Sapphire Rapids+ / Granite Rapids / Zen 5, where heavy 512-bit integer ops do not meaningfully downclock the core. Pre-SPR Intel parts (SKX/CLX/ICX) and Zen 4 — which all have AVX-512BW — fall back to AVX2 to avoid the historical 512-bit clock penalty.
Each unsafe fn ascii_lower_* is #[target_feature(enable = "...")]-gated and has a scalar tail. The SSE2/AVX2 paths use signed cmpgt_epi8, which excludes bytes ≥ 0x80 naturally — defensive even though the gate filters them out.
Equivalence-tested across critical lengths (0, 1, 7, 15, 16, 17, 31, 32, 33, ...) and against high-bit bytes mixed in.

Microbench

cargo bench --bench ascii_lower_benchmark.

The "unicode chars" stand-in mirrors what NormalizedString::lowercase did before this change (per-char UTF-8 decode + to_lowercase() + Vec<(char, isize)> accumulation, alignment bookkeeping omitted).

Apple Silicon (NEON)

size	SIMD	scalar (auto-vec)	unicode chars (old)	SIMD vs old
64 B	30.2 GiB/s	5.9 GiB/s	1.0 GiB/s	~30×
1 KiB	55.4 GiB/s	6.3 GiB/s	1.2 GiB/s	~48×
16 KiB	58.4 GiB/s	6.3 GiB/s	1.2 GiB/s	~49×
256 KiB	57.9 GiB/s	6.3 GiB/s	1.2 GiB/s	~48×

x86_64 AVX2 — AMD Ryzen 7 7800X3D (Zen 4, Ubuntu 26.04, rustc 1.95 stable, criterion `sample_size=50`)

size	SIMD	scalar (auto-vec)	unicode chars (old)	SIMD vs scalar	SIMD vs old
64 B	17.5 GiB/s	3.51 GiB/s	591 MiB/s	~5.0×	~30×
1 KiB	70.5 GiB/s	3.61 GiB/s	723 MiB/s	~19.5×	~100×
16 KiB	83.0 GiB/s	3.63 GiB/s	728 MiB/s	~22.9×	~117×
256 KiB	65.6 GiB/s	3.63 GiB/s	738 MiB/s	~18.1×	~91×

Notes for the AVX2 row:

AVX-512BW path is not taken on Zen 4 because avx512fp16 is absent — by design (see Implementation notes). On a Zen 5 / Sapphire Rapids+ host this path will engage and should improve the 1–16 KiB range further.
scalar (the simple is_ascii_uppercase() | 0x20 byte loop) tops out at ~3.6 GiB/s on this host because the autovectorizer does not lift it; SIMD wins ~5–23× even against this baseline.
16 KiB peak (~83 GiB/s) is bounded by L1/L2; 256 KiB tail reflects working-set leaving L1.

Test plan

cargo test --lib (207 tests pass)
cargo build --release --lib
New unit tests in utils::simd::tests (5 tests covering empty input, critical lengths, mixed-case random, idempotence, and high-byte safety)
New equivalence tests in tokenizer::normalizer::tests
cargo bench --bench ascii_lower_benchmark (Apple/NEON + x86_64 AVX2 / Zen 4)
cargo fmt --check
CI: clippy + tests on x86_64 (running via repo CI)

Follow-ups (not in this PR)

Same pattern can extend to Strip, BertPreTokenizer byte classification, and ByteLevel ASCII pass-through.

🤖 Generated with Claude Code

Adds `utils::simd::ascii_lower`, a runtime-dispatched in-place ASCII lowercaser with AVX2, SSE2, NEON, and scalar back-ends, and gates `NormalizedString::lowercase` on `is_ascii()` so all-ASCII inputs skip the per-`char` Unicode case-folding loop and the alignments rebuild. For ASCII inputs the slow path produced byte-identical output and byte-identical alignments (each `char::to_lowercase()` of an ASCII char is a single ASCII char, so the `transform` rebuild was a no-op on alignments); the fast path therefore just flips the `0x20` bit on bytes in `A`..=`Z` in place. Two new unit tests in `normalizer.rs` lock that equivalence in: - `lowercase_ascii_fast_path_preserves_alignments` — checks that after a non-trivial NFKD transform the fast path leaves `alignments`, `original`, and `original_shift` unchanged. - `lowercase_ascii_matches_unicode_path_byte_for_byte` — checks every printable ASCII byte against `char::to_lowercase`. Microbench (`benches/ascii_lower_benchmark.rs`, Apple Silicon, NEON): | size | SIMD | scalar (auto-vec) | unicode chars (old) | |--------|------------|-------------------|---------------------| | 64 B | 30.2 GiB/s | 5.9 GiB/s | 1.0 GiB/s | | 1 KiB | 55.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s | | 16 KiB | 58.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s | |256 KiB | 57.9 GiB/s | 6.3 GiB/s | 1.2 GiB/s | i.e. ~30-49x over the previous Unicode path on real-text-like ASCII. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a 64-byte-wide back-end (`_mm512_cmpgt_epi8_mask` + `_mm512_movm_epi8`) to `utils::simd::ascii_lower`. The dispatcher only routes to it when `avx512f`, `avx512bw`, and `avx512fp16` are all detected. `avx512fp16` is used as a proxy for "AVX-512 without meaningful license-mode downclock": - present on Intel Sapphire Rapids / Emerald Rapids / Granite Rapids - present on AMD Zen 4 (Ryzen 7000 / EPYC Genoa) and Zen 5 (Turin) - absent on Skylake-X, Cascade Lake, Cooper Lake, Ice Lake-SP, Rocket Lake — exactly the generations where 512-bit ops cause measurable frequency throttling Older AVX-512-capable hardware therefore stays on the AVX2 path, where the 256-bit work is already memory-bandwidth-bound on long buffers. Verified with `cargo check --target x86_64-unknown-linux-gnu` plus the existing 207-test lib suite on aarch64. The AVX-512 path itself is exercised at runtime only on capable hosts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-27T06:03:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

KimBioInfoStudio and others added 2 commits April 27, 2026 00:18

This was referenced Apr 26, 2026

feat(NFC): skip Unicode pass for all-ASCII inputs #2037

Open

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens #2038

Open

KimBioInfoStudio added 2 commits April 28, 2026 22:01

fmt: cargo fmt on simd ascii_lower

9385140

Merge branch 'main' into feat/simd-ascii-lower

3255ec1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036
KimBioInfoStudio wants to merge 4 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower

KimBioInfoStudio commented Apr 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KimBioInfoStudio commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is safe

Implementation notes

Microbench

Apple Silicon (NEON)

x86_64 AVX2 — AMD Ryzen 7 7800X3D (Zen 4, Ubuntu 26.04, rustc 1.95 stable, criterion sample_size=50)

Test plan

Follow-ups (not in this PR)

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KimBioInfoStudio commented Apr 26, 2026 •

edited

Loading

x86_64 AVX2 — AMD Ryzen 7 7800X3D (Zen 4, Ubuntu 26.04, rustc 1.95 stable, criterion `sample_size=50`)