feat: add runtime batch_bool mask overloads for load_masked/store_masked by DiamonDinoia · Pull Request #1332 · xtensor-stack/xsimd

DiamonDinoia · 2026-04-28T21:00:16Z

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment. Also: forward SVE compile-time masked load/store through the runtime path so the per-lane predicate is correct on SVE wider than 128 bits (the previous svdupq_b* path replicates a 128-bit chunk pattern across the vector).

Sugar over runtime-mask load/store for loop head/tail remainders. Take ``n`` directly instead of a constructed batch_bool; only ``mem[0, n)`` is touched. ``head`` uses mask ``(1 << n) - 1``; ``tail`` uses ``((1 << n) - 1) << (size - n)`` with a base-pointer offset (via uintptr_t to dodge -Warray-bounds), so every arch with native predicated load/store inherits its intrinsic for free. Tested on sse2/sse41/avx2/avx512f/emulated256 native and neon64/rvv under qemu.

Mirror the AVX/AVX2 runtime-mask load_masked / store_masked overloads on the new 128-bit SSE-register variants of those ISAs: - avx_128: float / double via _mm_maskload_ps/pd, _mm_maskstore_ps/pd - avx2_128: 32/64-bit integers via _mm_maskload_epi32/64, _mm_maskstore_epi32/64 8/16-bit integers continue to fall through to the scalar common path (no native maskload/store intrinsic at those widths). Both alignment modes route to the same intrinsic since masked-off lanes do not fault.

serge-sans-paille · 2026-05-02T19:58:03Z

Coild you split the head / tail part in another PR? This one is already quite dense...

serge-sans-paille · 2026-05-02T20:00:35Z

+            // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
+            // intrinsic that suppresses inactive-lane reads in hardware.
+            constexpr std::size_t size = batch<T, A>::size;
+            alignas(A::alignment()) std::array<T, size> buffer {};


to make it worse, building a mask is not always a single operation depending on the target...

serge-sans-paille · 2026-05-02T20:01:38Z

+            // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
+            // intrinsic that suppresses inactive-lane reads in hardware.
+            constexpr std::size_t size = batch<T, A>::size;
+            alignas(A::alignment()) std::array<T, size> buffer {};


this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case

serge-sans-paille · 2026-05-02T20:03:33Z

+        XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>>
+        load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept
+        {
+            using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>;


why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8

serge-sans-paille · 2026-05-02T20:05:31Z

+            }
+            else
+            {
+                _mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src));


ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?

serge-sans-paille · 2026-05-02T20:10:02Z

+        // constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which
+        // is replication-based and does not correctly express a per-lane
+        // mask on SVE wider than 128 bits — going through ``as_batch_bool``
+        // gives the right predicate for every vector width. ``int32``/


Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.

serge-sans-paille · 2026-05-02T20:12:20Z

+     * so partial loads across a page boundary are safe. \c stream_mode is not
+     * supported.
+     *
+     * \warning Runtime-mask loads carry a significant performance penalty on


I don't think we should go into details here:

it's difficult to maintain this kind of documentation (what about newly added architectures)

we already have the case for other operations and we don't specify it.

I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.

serge-sans-paille · 2026-05-02T20:15:03Z

+        static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
+                      "supported load mode");
+        constexpr uint64_t full_mask = details::full_mask(size);
+        const auto bits = mask.mask();


I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...

serge-sans-paille · 2026-05-02T20:15:12Z

+        static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
+                      "supported store mode");
+        constexpr uint64_t full_mask = details::full_mask(size);
+        const auto bits = mask.mask();


DiamonDinoia force-pushed the feat/dynamic-masks branch 4 times, most recently from dda5de6 to 7484c4b Compare April 30, 2026 22:51

DiamonDinoia added 3 commits May 1, 2026 15:41

DiamonDinoia force-pushed the feat/dynamic-masks branch from 7484c4b to d5f21c7 Compare May 1, 2026 19:51

DiamonDinoia requested a review from serge-sans-paille May 1, 2026 20:49

serge-sans-paille reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks

DiamonDinoia commented Apr 28, 2026

Uh oh!

serge-sans-paille commented May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

serge-sans-paille May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DiamonDinoia commented Apr 28, 2026

Uh oh!

serge-sans-paille commented May 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants