Skip to content

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332

Open
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks
Open

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.

@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch 4 times, most recently from dda5de6 to 7484c4b Compare April 30, 2026 22:51
Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked
across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path
fallback is collapsed to a whole-vector select, and the unaligned
page-cross fast path is dropped since the underlying intrinsics suppress
faults on masked-off lanes regardless of alignment.

Also: forward SVE compile-time masked load/store through the
runtime path so the per-lane predicate is correct on SVE wider
than 128 bits (the previous svdupq_b* path replicates a 128-bit
chunk pattern across the vector).
Sugar over runtime-mask load/store for loop head/tail remainders.
Take ``n`` directly instead of a constructed batch_bool; only
``mem[0, n)`` is touched. ``head`` uses mask ``(1 << n) - 1``;
``tail`` uses ``((1 << n) - 1) << (size - n)`` with a base-pointer
offset (via uintptr_t to dodge -Warray-bounds), so every arch with
native predicated load/store inherits its intrinsic for free.

Tested on sse2/sse41/avx2/avx512f/emulated256 native and
neon64/rvv under qemu.
Mirror the AVX/AVX2 runtime-mask load_masked / store_masked overloads
on the new 128-bit SSE-register variants of those ISAs:

- avx_128: float / double via _mm_maskload_ps/pd, _mm_maskstore_ps/pd
- avx2_128: 32/64-bit integers via _mm_maskload_epi32/64, _mm_maskstore_epi32/64

8/16-bit integers continue to fall through to the scalar common path
(no native maskload/store intrinsic at those widths). Both alignment
modes route to the same intrinsic since masked-off lanes do not fault.
@serge-sans-paille
Copy link
Copy Markdown
Contributor

Coild you split the head / tail part in another PR? This one is already quite dense...

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it worse, building a mask is not always a single operation depending on the target...

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case

XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>>
load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept
{
using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8

}
else
{
_mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?

// constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which
// is replication-based and does not correctly express a per-lane
// mask on SVE wider than 128 bits — going through ``as_batch_bool``
// gives the right predicate for every vector width. ``int32``/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.

* so partial loads across a page boundary are safe. \c stream_mode is not
* supported.
*
* \warning Runtime-mask loads carry a significant performance penalty on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should go into details here:

  1. it's difficult to maintain this kind of documentation (what about newly added architectures)
  2. we already have the case for other operations and we don't specify it.

I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.

static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported load mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...

static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported store mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants