Skip to content

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036

Open
KimBioInfoStudio wants to merge 4 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower
Open

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036
KimBioInfoStudio wants to merge 4 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown

@KimBioInfoStudio KimBioInfoStudio commented Apr 26, 2026

Summary

  • Adds utils::simd::ascii_lower, a runtime-dispatched in-place ASCII lowercaser with AVX-512BW, AVX2, SSE2, NEON, and scalar back-ends.
  • Gates NormalizedString::lowercase on s.is_ascii() so all-ASCII inputs skip the per-char Unicode case-folding loop and the transform() alignments rebuild, mutating bytes in place instead.
  • Microbench shows ~30-49× over the previous Unicode chars() path on Apple/NEON, and up to ~117× on x86_64 AVX2 (Zen 4) at 16 KiB.

Why this is safe

For any ASCII char, char::to_lowercase() returns exactly one ASCII char equal to b | 0x20 for A..=Z (Unicode SpecialCasing.txt has no ASCII entries), so 1 byte → 1 byte and transform() would have rebuilt alignments to the same values it already had. Skipping transform() is therefore observationally equivalent for ASCII inputs.

Two new unit tests pin this down:

  • lowercase_ascii_fast_path_preserves_alignments — after a non-trivial nfkd() transform, asserts normalized, alignments, original, and original_shift are byte-identical between fast-path output and the expected values.
  • lowercase_ascii_matches_unicode_path_byte_for_byte — for every printable ASCII byte, checks the fast path output equals chars().flat_map(to_lowercase).

Non-ASCII inputs hit the original code path unchanged.

Implementation notes

  • Runtime dispatch via is_x86_feature_detected!("avx512fp16"), "avx2". SSE2 is always available on the x86_64 baseline; aarch64 stable ABI mandates NEON, so no detect there.
  • AVX-512BW path is gated on avx512fp16 (not avx512bw) intentionally as a frequency-safety proxy: avx512fp16 lights up only on Sapphire Rapids+ / Granite Rapids / Zen 5, where heavy 512-bit integer ops do not meaningfully downclock the core. Pre-SPR Intel parts (SKX/CLX/ICX) and Zen 4 — which all have AVX-512BW — fall back to AVX2 to avoid the historical 512-bit clock penalty.
  • Each unsafe fn ascii_lower_* is #[target_feature(enable = "...")]-gated and has a scalar tail. The SSE2/AVX2 paths use signed cmpgt_epi8, which excludes bytes ≥ 0x80 naturally — defensive even though the gate filters them out.
  • Equivalence-tested across critical lengths (0, 1, 7, 15, 16, 17, 31, 32, 33, ...) and against high-bit bytes mixed in.

Microbench

cargo bench --bench ascii_lower_benchmark.

The "unicode chars" stand-in mirrors what NormalizedString::lowercase did before this change (per-char UTF-8 decode + to_lowercase() + Vec<(char, isize)> accumulation, alignment bookkeeping omitted).

Apple Silicon (NEON)

size SIMD scalar (auto-vec) unicode chars (old) SIMD vs old
64 B 30.2 GiB/s 5.9 GiB/s 1.0 GiB/s ~30×
1 KiB 55.4 GiB/s 6.3 GiB/s 1.2 GiB/s ~48×
16 KiB 58.4 GiB/s 6.3 GiB/s 1.2 GiB/s ~49×
256 KiB 57.9 GiB/s 6.3 GiB/s 1.2 GiB/s ~48×

x86_64 AVX2 — AMD Ryzen 7 7800X3D (Zen 4, Ubuntu 26.04, rustc 1.95 stable, criterion sample_size=50)

size SIMD scalar (auto-vec) unicode chars (old) SIMD vs scalar SIMD vs old
64 B 17.5 GiB/s 3.51 GiB/s 591 MiB/s ~5.0× ~30×
1 KiB 70.5 GiB/s 3.61 GiB/s 723 MiB/s ~19.5× ~100×
16 KiB 83.0 GiB/s 3.63 GiB/s 728 MiB/s ~22.9× ~117×
256 KiB 65.6 GiB/s 3.63 GiB/s 738 MiB/s ~18.1× ~91×

Notes for the AVX2 row:

  • AVX-512BW path is not taken on Zen 4 because avx512fp16 is absent — by design (see Implementation notes). On a Zen 5 / Sapphire Rapids+ host this path will engage and should improve the 1–16 KiB range further.
  • scalar (the simple is_ascii_uppercase() | 0x20 byte loop) tops out at ~3.6 GiB/s on this host because the autovectorizer does not lift it; SIMD wins ~5–23× even against this baseline.
  • 16 KiB peak (~83 GiB/s) is bounded by L1/L2; 256 KiB tail reflects working-set leaving L1.

Test plan

  • cargo test --lib (207 tests pass)
  • cargo build --release --lib
  • New unit tests in utils::simd::tests (5 tests covering empty input, critical lengths, mixed-case random, idempotence, and high-byte safety)
  • New equivalence tests in tokenizer::normalizer::tests
  • cargo bench --bench ascii_lower_benchmark (Apple/NEON + x86_64 AVX2 / Zen 4)
  • cargo fmt --check
  • CI: clippy + tests on x86_64 (running via repo CI)

Follow-ups (not in this PR)

  • Same pattern can extend to Strip, BertPreTokenizer byte classification, and ByteLevel ASCII pass-through.

🤖 Generated with Claude Code

KimBioInfoStudio and others added 2 commits April 27, 2026 00:18
Adds `utils::simd::ascii_lower`, a runtime-dispatched in-place ASCII
lowercaser with AVX2, SSE2, NEON, and scalar back-ends, and gates
`NormalizedString::lowercase` on `is_ascii()` so all-ASCII inputs skip the
per-`char` Unicode case-folding loop and the alignments rebuild.

For ASCII inputs the slow path produced byte-identical output and
byte-identical alignments (each `char::to_lowercase()` of an ASCII char is
a single ASCII char, so the `transform` rebuild was a no-op on
alignments); the fast path therefore just flips the `0x20` bit on bytes in
`A`..=`Z` in place. Two new unit tests in `normalizer.rs` lock that
equivalence in:

  - `lowercase_ascii_fast_path_preserves_alignments` — checks that after a
    non-trivial NFKD transform the fast path leaves `alignments`,
    `original`, and `original_shift` unchanged.
  - `lowercase_ascii_matches_unicode_path_byte_for_byte` — checks every
    printable ASCII byte against `char::to_lowercase`.

Microbench (`benches/ascii_lower_benchmark.rs`, Apple Silicon, NEON):

  | size   | SIMD       | scalar (auto-vec) | unicode chars (old) |
  |--------|------------|-------------------|---------------------|
  |   64 B | 30.2 GiB/s |      5.9 GiB/s    |       1.0 GiB/s     |
  |  1 KiB | 55.4 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |
  | 16 KiB | 58.4 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |
  |256 KiB | 57.9 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |

i.e. ~30-49x over the previous Unicode path on real-text-like ASCII.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 64-byte-wide back-end (`_mm512_cmpgt_epi8_mask` + `_mm512_movm_epi8`)
to `utils::simd::ascii_lower`. The dispatcher only routes to it when
`avx512f`, `avx512bw`, and `avx512fp16` are all detected.

`avx512fp16` is used as a proxy for "AVX-512 without meaningful license-mode
downclock":

  - present on Intel Sapphire Rapids / Emerald Rapids / Granite Rapids
  - present on AMD Zen 4 (Ryzen 7000 / EPYC Genoa) and Zen 5 (Turin)
  - absent on Skylake-X, Cascade Lake, Cooper Lake, Ice Lake-SP, Rocket
    Lake — exactly the generations where 512-bit ops cause measurable
    frequency throttling

Older AVX-512-capable hardware therefore stays on the AVX2 path, where
the 256-bit work is already memory-bandwidth-bound on long buffers.

Verified with `cargo check --target x86_64-unknown-linux-gnu` plus the
existing 207-test lib suite on aarch64. The AVX-512 path itself is
exercised at runtime only on capable hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants