feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
Conversation
dda5de6 to
7484c4b
Compare
Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment. Also: forward SVE compile-time masked load/store through the runtime path so the per-lane predicate is correct on SVE wider than 128 bits (the previous svdupq_b* path replicates a 128-bit chunk pattern across the vector).
Sugar over runtime-mask load/store for loop head/tail remainders. Take ``n`` directly instead of a constructed batch_bool; only ``mem[0, n)`` is touched. ``head`` uses mask ``(1 << n) - 1``; ``tail`` uses ``((1 << n) - 1) << (size - n)`` with a base-pointer offset (via uintptr_t to dodge -Warray-bounds), so every arch with native predicated load/store inherits its intrinsic for free. Tested on sse2/sse41/avx2/avx512f/emulated256 native and neon64/rvv under qemu.
Mirror the AVX/AVX2 runtime-mask load_masked / store_masked overloads on the new 128-bit SSE-register variants of those ISAs: - avx_128: float / double via _mm_maskload_ps/pd, _mm_maskstore_ps/pd - avx2_128: 32/64-bit integers via _mm_maskload_epi32/64, _mm_maskstore_epi32/64 8/16-bit integers continue to fall through to the scalar common path (no native maskload/store intrinsic at those widths). Both alignment modes route to the same intrinsic since masked-off lanes do not fault.
7484c4b to
d5f21c7
Compare
|
Coild you split the head / tail part in another PR? This one is already quite dense... |
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
to make it worse, building a mask is not always a single operation depending on the target...
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case
| XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>> | ||
| load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept | ||
| { | ||
| using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>; |
There was a problem hiding this comment.
why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8
| } | ||
| else | ||
| { | ||
| _mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src)); |
There was a problem hiding this comment.
ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?
| // constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which | ||
| // is replication-based and does not correctly express a per-lane | ||
| // mask on SVE wider than 128 bits — going through ``as_batch_bool`` | ||
| // gives the right predicate for every vector width. ``int32``/ |
There was a problem hiding this comment.
Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.
| * so partial loads across a page boundary are safe. \c stream_mode is not | ||
| * supported. | ||
| * | ||
| * \warning Runtime-mask loads carry a significant performance penalty on |
There was a problem hiding this comment.
I don't think we should go into details here:
- it's difficult to maintain this kind of documentation (what about newly added architectures)
- we already have the case for other operations and we don't specify it.
I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported load mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
There was a problem hiding this comment.
I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported store mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.