PERF: Add SIMD instructions with xsimd to reduce moments by Alvaro-Kothe · Pull Request #64905 · pandas-dev/pandas

Alvaro-Kothe · 2026-03-28T11:31:54Z

xref Discussion: SIMD strategy for pandas C/C++ code #64884
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
I have reviewed and followed all the contribution guidelines
If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

Overview

This PR introduces SIMD for moment accumulation for skewness and kurtosis, using the xsimd library to wrap SIMD instructions and runtime dispatch. A meson option is added to disable SIMD, and CI verifies both the SIMD and scalar-only paths.

The algorithm is basically the Welford algorithm, computing central moments in a single pass.

The test test_stat_method had to be modified, because the order that the values are accumulated changed and due to float point arithmetic the test can't assert strict equality.

Benchmark

AVX2 Benchmark

Change	Before [`3cfeda2`]	After [`1e2ccc0`] <perf/skew-kurt-omp-xsimd>	Ratio	Benchmark (Parameter)
-	4.33±0.3ms	3.81±0.5ms	0.88	gil.ParallelRolling.time_rolling('skew')
-	2.97±0.3ms	2.40±0.04ms	0.81	rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'kurt')
-	3.02±0.3ms	2.45±0.04ms	0.81	rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'kurt')
-	23.0±0.9μs	18.0±2μs	0.78	series_methods.NanOps.time_func('skew', 1000, 'Int64')
-	19.4±1μs	14.0±3μs	0.72	series_methods.NanOps.time_func('skew', 1000, 'int32')
-	19.5±0.6μs	13.4±0.8μs	0.69	series_methods.NanOps.time_func('skew', 1000, 'int8')
-	22.8±1μs	15.6±2μs	0.68	series_methods.NanOps.time_func('kurt', 1000, 'boolean')
-	23.1±1μs	15.4±2μs	0.67	series_methods.NanOps.time_func('kurt', 1000, 'Int64')
-	19.1±0.6μs	12.8±1μs	0.67	series_methods.NanOps.time_func('skew', 1000, 'float64')
-	19.6±1μs	13.0±1μs	0.66	series_methods.NanOps.time_func('skew', 1000, 'int64')
-	18.8±0.6μs	11.9±1μs	0.63	series_methods.NanOps.time_func('kurt', 1000, 'int64')
-	19.3±1μs	12.0±2μs	0.62	series_methods.NanOps.time_func('kurt', 1000, 'float64')
-	19.1±1μs	11.9±2μs	0.62	series_methods.NanOps.time_func('kurt', 1000, 'int32')
-	18.0±3μs	11.2±2μs	0.62	series_methods.NanOps.time_func('kurt', 1000, 'int8')
-	3.75±0.06ms	2.00±0.04ms	0.53	stat_ops.FrameOps.time_op('kurt', 'Int64', None)
-	4.25±0.4ms	1.94±0.04ms	0.46	stat_ops.FrameOps.time_op('skew', 'Int64', None)
-	7.55±0.05ms	3.38±0.03ms	0.45	series_methods.NanOps.time_func('skew', `1000000`, 'Int64')
-	7.46±0.06ms	3.34±0.04ms	0.45	series_methods.NanOps.time_func('skew', `1000000`, 'int64')
-	4.21±0.5ms	1.85±0.1ms	0.44	stat_ops.FrameOps.time_op('skew', 'int', None)
-	7.32±0.05ms	3.11±0.03ms	0.42	series_methods.NanOps.time_func('skew', `1000000`, 'int32')
-	4.53±0.1ms	1.92±0.07ms	0.42	stat_ops.FrameOps.time_op('kurt', 'int', None)
-	7.24±0.09ms	3.00±0.3ms	0.41	series_methods.NanOps.time_func('skew', `1000000`, 'boolean')
-	3.99±0.5ms	1.62±0.02ms	0.41	stat_ops.FrameOps.time_op('skew', 'Int64', 0)
-	4.41±0.06ms	1.66±0.03ms	0.38	stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
-	3.71±0.5ms	1.40±0.02ms	0.38	stat_ops.FrameOps.time_op('kurt', 'int', 0)
-	8.83±1ms	3.29±0.3ms	0.37	series_methods.NanOps.time_func('kurt', `1000000`, 'Int64')
-	8.46±1ms	3.13±0.03ms	0.37	series_methods.NanOps.time_func('kurt', `1000000`, 'boolean')
-	8.25±1ms	3.06±0.07ms	0.37	series_methods.NanOps.time_func('skew', `1000000`, 'int8')
-	759±7μs	279±40μs	0.37	stat_ops.SeriesOps.time_op('kurt', 'int')
-	735±5μs	271±40μs	0.37	stat_ops.SeriesOps.time_op('skew', 'int')
-	6.58±0.06ms	2.29±0.02ms	0.35	series_methods.NanOps.time_func('skew', `1000000`, 'float64')
-	4.00±0.03ms	1.39±0.03ms	0.35	stat_ops.FrameOps.time_op('skew', 'int', 0)
-	2.97±0.1ms	1.02±0.1ms	0.34	stat_ops.FrameOps.time_op('kurt', 'float', 0)
-	9.84±0.2ms	3.15±0.3ms	0.32	series_methods.NanOps.time_func('kurt', `1000000`, 'int64')
-	3.45±0.4ms	1.11±0.1ms	0.32	stat_ops.FrameMultiIndexOps.time_op('kurt')
-	3.41±0.4ms	1.07±0.1ms	0.32	stat_ops.FrameMultiIndexOps.time_op('skew')
-	705±4μs	222±30μs	0.32	stat_ops.SeriesMultiIndexOps.time_op('kurt')
-	684±4μs	217±30μs	0.32	stat_ops.SeriesMultiIndexOps.time_op('skew')
-	704±3μs	222±30μs	0.32	stat_ops.SeriesOps.time_op('kurt', 'float')
-	9.41±0.08ms	2.89±0.3ms	0.31	series_methods.NanOps.time_func('kurt', `1000000`, 'int8')
-	681±5μs	214±30μs	0.31	stat_ops.SeriesOps.time_op('skew', 'float')
-	9.75±0.2ms	2.95±0.3ms	0.3	series_methods.NanOps.time_func('kurt', `1000000`, 'int32')
-	3.30±0.4ms	971±100μs	0.29	stat_ops.FrameOps.time_op('skew', 'float', 0)
-	7.80±1ms	2.17±0.3ms	0.28	series_methods.NanOps.time_func('kurt', `1000000`, 'float64')
-	3.28±0.4ms	897±100μs	0.27	stat_ops.FrameOps.time_op('kurt', 'float', None)
-	3.57±0.02ms	964±10μs	0.27	stat_ops.FrameOps.time_op('skew', 'float', None)

jbrockmendel · 2026-03-28T17:51:21Z

What does this do to wheel size and import-time memory footprint?

Alvaro-Kothe · 2026-03-28T18:22:16Z

I didn't see any changes on the wheel size and the memory footprint at import type increase by 4 KiB.

Edit: I was looking at the wrong file. The wheel size decreased by 20 KiB.

jbrockmendel · 2026-04-06T17:41:48Z

@Alvaro-Kothe can you attend the dev call on the 22nd? I think that's the time+place to convince the rest of the team to move forward with SIMD.

Alvaro-Kothe · 2026-04-06T20:57:47Z

@Alvaro-Kothe can you attend the dev call on the 22nd?

Yeah, I can probably attend.

jorisvandenbossche · 2026-04-07T20:32:42Z

@Alvaro-Kothe on #64582 (comment) you said:

@WillAyd I did some experimentation with meson's simd module on #64905 and one of the problems that I found is that it doesn't detect neon support (mesonbuild/meson#11209) and it doesn't support AVX512 instructions (mesonbuild/meson#2085).

That makes it essentially a non-starter to use meson-simd?

BTW, as far as I understand, numpy essentially implemented their own simd meson module (presumably because the upstream one wasn't sufficient or marked as experimental). No idea how reusable that would be or if there are plans to separate it from numpy.

jorisvandenbossche · 2026-04-07T20:34:30Z

@Alvaro-Kothe can you attend the dev call on the 22nd? I think that's the time+place to convince the rest of the team to move forward with SIMD.

@jbrockmendel please also make the case for what you want to do on the issue you opened. It will definitely be useful to talk about the topic tomorrow, but for such an important topic it is also important to have a written account of that and async discussion.

Alvaro-Kothe · 2026-04-07T22:40:27Z

That makes it essentially a non-starter to use meson-simd?

I found meson's unstable-simd module limited -in terms of supported architectures- and risky due to lack of backward compatibility, it even warns:

../../pandas/_libs/meson.build:59: WARNING: Module SIMD has no backwards or forwards compatibility and might not exist in future releases.

BTW, as far as I understand, numpy essentially implemented their own simd meson module (presumably because the upstream one wasn't sufficient or marked as experimental). No idea how reusable that would be or if there are plans to separate it from numpy.

In the NumPy fork of meson, they created a features module to handle their multiple SIMD architectures necessities.

In here, at least for x86 and ARM, it seems possible to replace the built-in meson module with a dictionary and a foreach loop. This gives us full control over compiler flags and necessary meta-programming, similar to the approach used by Krita to handle their xsimd targets.

SIMD Build Sketch

# SIMD architecture configuration: Map name to compiler flags
simd_arch_config = {}

if host_machine.cpu_family() in ['x86', 'x86_64']
    simd_arch_config += {
        'sse2':    {'flags': cxx.get_id() == 'msvc' ? ['/arch:SSE2'] : ['-msse2']},
        'avx2':    {'flags': cxx.get_id() == 'msvc' ? ['/arch:AVX2'] : ['-mavx2']},
        'avx512f': {'flags': cxx.get_id() == 'msvc' ? ['/arch:AVX512'] : ['-mavx512f']},
    }
elif host_machine.cpu_family() == 'aarch64'
    simd_arch_config += {
        'neon64':  {'flags': []} # Baseline for aarch64
    }
endif

simd_libs = []
simd_config = configuration_data()

foreach name, cfg : simd_arch_config
    # Create a unique source file for each instruction set to avoid linking errors
    src = fs.copyfile(
        'my_simd_module_instantiation_.cpp',
        'my_simd_module_@0@.cpp'.format(name),
    )

    # Compile a static library for this specific architecture
    lib = static_library(
        'my_simd_module_@0@'.format(name),
        src,
        include_directories: [templates_includes],
        dependencies: [xsimd_dep],
        cpp_args: cfg['flags'],
    )
    simd_libs += lib

    # Define preprocessor macro (e.g., PANDAS_HAVE_AVX2)
    simd_config.set('PANDAS_HAVE_@0@'.format(name.to_upper()), 1)
endforeach

configure_file(
    output: 'my_simdconfig.h',
    configuration: simd_config,
)

my_simd_libraries_dep = declare_dependency(
    link_with: simd_libs,
    include_directories: include_directories('.'),
    dependencies: [xsimd_dep],
)

jorisvandenbossche · 2026-04-21T22:23:33Z

BTW, it seems that xsimd is also currently adding more CPU feature detection (xtensor-stack/xsimd#1245, and a bunch of recent PRs that were merged)

WillAyd

@Alvaro-Kothe is there a way we can split this into smaller chunks? As is, this is a huge change and will need a good deal of revision

Maybe its best to get SIMD detection / xsimd integration done as precursors and then come back to the Moments impl?

WillAyd · 2026-05-05T11:49:21Z

+/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
+void moments_merge(Moments *acc, const Moments *src, int max_moment) {
+  if (acc->n == 0) {
+    acc->n = src->n;


This should be a memcpy not a bunch of individual assignments

WillAyd · 2026-05-05T11:50:19Z

+    return;
+  }
+
+  double n_a = (double)acc->n;


Suggested change

double n_a = (double)acc->n;

double n_a = acc->n;

These are C-style casts that should not be used in cpp, but also are unnecessary

This give warnings when compiling with -Wconversion, changed to static_cast

WillAyd · 2026-05-05T11:53:58Z

+#include <math.h>
+#include <stdint.h>
+
+extern "C" {


Suggested change

extern "C" {

The extern block belongs in the header

WillAyd · 2026-05-05T11:56:51Z

+supported_simd_archs = {}
+if get_option('simd').allowed()
+    foreach name, flags : simd_arch_flags
+        if host_machine.cpu_family() in ['x86', 'x86_64']


Are you sure this is correct? I am wary of maintaining our own logic for which systems support which instruction sets

For this initial PR I was trying to limit the architectures to the ones we have wheels for.

For the instructions sets, SSE2 is always available on 64-bit, but may not be available on 32-bit, so this is basically creating a minimum CPU requirement in pandas of SSE2 for x86.

WillAyd · 2026-05-05T11:58:31Z

+    moments_config.set('PANDAS_HAVE_SCALAR', 1)
+endif
+
+configure_file(


I think you will want to install this file. I'd also suggest using a prefix of pandas/ - not that its going to be installed onto a real system but probably best to follow that convention so meson-python intercepts it consistently

I think you will want to install this file.

From what I've seen, this would only be included in the wheel, and since we are not installing header files, installing this config file doesn't bring any benefit IMO.

Here are the headers included in the wheel if I were to add this change:

$ unzip -l dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl "*.h" Archive: dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl Length Date Time Name --------- ---------- ----- ---- 197 05-05-2026 13:25 pandas/_libs/moments_simdconfig.h --------- ------- 197 1 file

I'd also suggest using a prefix of pandas/

Specifying directories in configure_file is only available on meson 1.10.0, For prior versions, I have to move this configuration logic to another meson.build file.

I do agree it's better to have the configuration in another directory, so I will make this change.

WillAyd · 2026-05-05T12:07:50Z

 m_dep = cc.find_library('m', required: false)
 fast_float = subproject('fast_float')
 fast_float_dep = fast_float.get_variable('fast_float_dep')
+xsimd = subproject('xsimd')


I would still prefer the dependency syntax we have used on other PRs. If we wanted it tied to a particular version we could do that

WillAyd · 2026-05-05T12:10:18Z

+  return moments_acc;
+}
+
+#define MOMENTS_EXTERN_TEMPLATE(ARCH)                                          \


I'm not sure I follow these patterns - using a macro to then instantiate a template seems to really mix up C / C++ patterns. Can't the template take care of this by itself?

Yea let's get rid of these extern declarations and move what's in the module to the header. That's less performant from a compilation perspective, but I don't think that's a huge deal. This feels like a premature optimization.

Regarding the mix of macros/templates, let's prefer C++17 features. I think something like this in the config file:

constexpr bool kEnableAVX512CD = @PANDAS_HAVE_AVX512CD@; constexpr bool kEnableAVX2 = @PANDAS_HAVE_AVX2@; ...

Could simplify the header a lot. To the effect of

constexpr bool kUseSimd = std::enable_if_t<kEnableAVX512CD> || std::enable_if_t<kEnableAVX2>; struct accumulate_moments_simd { template <class Arch> Moments operator()([[maybe_unused]] Arch, const double *values, std::size_t n, int skipna, const uint8_t *mask, int max_moment) { if constexpr(kUseSimd) { // SIMD implementation that forwards the ARCH type to xsimd } else { // Non-SIMD implementation (uses xsimd::common?) } } };

Sorry if misreading - the layers of indirection here are hard to follow, so I think there's a more concise way of expressing

C++20 might make the intent even clearer if we wanted separate methods:

struct accumulate_moments_simd { template <class Arch> requires (kUseSimd) Moments operator()(Arch, const double *values, std::size_t n, int skipna, const uint8_t *mask, int max_moment) { ... } Moments operator()(xsimd::common, const double *values, std::size_t n, int skipna, const uint8_t *mask, int max_moment) { ... } };

There might even be some constexpr vector / enum tricks to be had instead of having to send the Arch type and maintain a constant for allowed SIMD types separately

I am having a hard time to understand this one. How would the template be instantiated?

Just to clarify the purpose of MOMENTS_EXTERN_TEMPLATE:

The dispatch method tries to implicitly instantiate all architectures in pandas::moments::arch_list, but it won't compile them without the proper flags. So I created this macro to define external linkage and prevent implicit instantiation.

For MOMENTS_INSTANTIATE:

It's exactly as the name says, it instantiate the method for the target architecture. moments_simd.cpp is compiled multiple times with different compiler flags and selecting the proper instantiation with -DPANDAS_SIMD_<ARCH>, here is the relevant meson instruction:

moments_libs += static_library( 'moments_simd_@0@'.format(arch_name), 'src/moments_simd.cpp', include_directories: [inc_pd], dependencies: [xsimd_dep], cpp_args: arch_flags + [ '-DPANDAS_SIMD_@0@'.format(arch_name.to_upper()), ],

The dispatch method tries to implicitly instantiate all architectures in pandas::moments::arch_list, but it won't compile them without the proper flags. So I created this macro to define external linkage and prevent implicit instantiation.

It could be my lack of understanding but I don't think this is a very common pattern - is that documented by xsimd? I think changing the linkage is more of a compile-time performance optimization, which would barely register in our code base but adds a lot of indirection. In the general case of C++ I would expect the header file to contain the template declaration / instantiations

I am having a hard time to understand this one. How would the template be instantiated?

I don't think the pattern would be any different. If you wanted to lean into the constexpr pattern, you could replace:

#if PANDAS_HAVE_AVX512CD ::add<xsimd::avx512cd> #endif

with

if constexpr (kEnableAVX512CD) { ::add<xsimd::avx512cd> }

But I don't think that matters much for the rest of the implementation

Not familiar with Krita but on a quick scan I see that the xsimd implementation makes heavy use of macros in the header file to control the implementation:

https://invent.kde.org/kenoi/krita/-/blob/master/libs/pigment/KoOptimizedPixelDataScalerU8ToU16.h

That is not what's going on this PR, which I think is in a middle state. I would prefer the Arrow approach as its a closer project to pandas

I would prefer the Arrow approach as its a closer project to pandas

Fair point. Changed.

Cool thanks. FWIW its also the documented approach for xsimd (I'm learning about this as we go, so appreciate your patience)

https://xsimd.readthedocs.io/en/latest/api/dispatching.html#arch-dispatching

I'd also suggest we get rid of the MOMENTS_INSTANTIATE macro in each file - you can just define the template with the appropriate architecture set

Should I remove the MOMENTS_EXTERN_TEMPLATE macro too?

Yea - let's try to limit our macro use.

WillAyd · 2026-05-05T12:11:13Z

+
+namespace pandas::moments {
+
+#if defined(PANDAS_SIMD_AVX512CD)


If we get rid of the macro wrapping a template as described in another comment then this file seems unnecessary - I think this should all be instatiated in the header

WillAyd · 2026-05-05T12:13:09Z

+}
+
+template <class Arch>
+Moments accumulate_moments_simd::operator()(Arch /*unused*/,


Suggested change

Moments accumulate_moments_simd::operator()(Arch /*unused*/,

Moments accumulate_moments_simd::operator()([[maybe_unused]] Arch,

Is this throwing a warning now?

I added /*unused*/ was to make clang-tidy happy.

It isn't possible to use [[maybe_unused]] there, AFAIK, this attribute is for functions and class members.

../../pandas/_libs/include/pandas/moments_simd.hpp|157 col 27| warning: attribute ignored [-Wattributes] || 157 | Moments operator()(Arch [[maybe_unused]], const double *values, std::size_t n, || | ^

I think that's because you have it backwards - should be [[maybe_unused]] Arch not Arch [[maybe_unused]]

Indeed, it was swapped, but it didn't fix the clang-tidy warning

All parameters should be named in a function [readability-named-parameter]

Anyway, guess it can be ignored.

WillAyd · 2026-05-05T12:15:16Z

+
+struct accumulate_moments_simd {
+  template <class Arch>
+  Moments operator()(Arch /*unused*/, const double *values, std::size_t n,


Suggested change

Moments operator()(Arch /*unused*/, const double *values, std::size_t n,

Moments operator()(Arch /*unused*/, std::vector<double> values, std::size_t n,

Unless there's a need for extern linkage (which isn't possible with a templated function anyway) we should be using thre standard C++ types)

The dispatch function (reduce_moments) needs linkage with "C" to be used in Cython. AFAIK, there isn't a vector constructor from a pointer and I don't think it's worth copying the data in values into a container.

Where is reduce_moments defined?

Generally the strategy should be to reduce the size of the C interface when dealing with C++. So I'd be interested to see the call site in Cython and see what we can do to elide the extern C requirement, since AFAIU Cython can call invoke C++ templates directly

If it helps, we could also consider bumping to C++20 and taking advantage of std::span, which essentially declares this as requiring a non-owning view into a buffer of some size

Ah I see it in algos.pyx - would it make more sense to have an algos_cpp.pyx file that can be more C++ native?

Just curious as I have limited experience with Cython's C++ wrapper. If that's infeasible then I think we just need to re-evaluate the size of the extern declaration, and perhaps create a shim that will take something like the raw C storage and size and convert it to a std::span for the rest of the implementation to use

Where is reduce_moments defined?

It's defined in moments.cpp.

Cython can call invoke C++

This is correct, but I think it requires transpiling algos.pyx to C++ instead of C.

Alvaro-Kothe

is there a way we can split this into smaller chunks?

Sure, will move SIMD detection to another PR.

Alvaro-Kothe · 2026-05-05T12:42:36Z

+
+struct accumulate_moments_simd {
+  template <class Arch>
+  Moments operator()(Arch /*unused*/, const double *values, std::size_t n,


The dispatch function (reduce_moments) needs linkage with "C" to be used in Cython. AFAIK, there isn't a vector constructor from a pointer and I don't think it's worth copying the data in values into a container.

Alvaro-Kothe · 2026-05-05T12:46:57Z

+}
+
+template <class Arch>
+Moments accumulate_moments_simd::operator()(Arch /*unused*/,


I added /*unused*/ was to make clang-tidy happy.

It isn't possible to use [[maybe_unused]] there, AFAIK, this attribute is for functions and class members.

../../pandas/_libs/include/pandas/moments_simd.hpp|157 col 27| warning: attribute ignored [-Wattributes] || 157 | Moments operator()(Arch [[maybe_unused]], const double *values, std::size_t n, || | ^

Alvaro-Kothe · 2026-05-05T12:53:41Z

+/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
+void moments_merge(Moments *acc, const Moments *src, int max_moment) {
+  if (acc->n == 0) {
+    acc->n = src->n;


Alvaro-Kothe · 2026-05-05T12:55:18Z

+    return;
+  }
+
+  double n_a = (double)acc->n;


This give warnings when compiling with -Wconversion, changed to static_cast

Alvaro-Kothe · 2026-05-05T13:19:04Z

+supported_simd_archs = {}
+if get_option('simd').allowed()
+    foreach name, flags : simd_arch_flags
+        if host_machine.cpu_family() in ['x86', 'x86_64']


For this initial PR I was trying to limit the architectures to the ones we have wheels for.

For the instructions sets, SSE2 is always available on 64-bit, but may not be available on 32-bit, so this is basically creating a minimum CPU requirement in pandas of SSE2 for x86.

Alvaro-Kothe · 2026-05-05T13:44:46Z

+    moments_config.set('PANDAS_HAVE_SCALAR', 1)
+endif
+
+configure_file(


I think you will want to install this file.

From what I've seen, this would only be included in the wheel, and since we are not installing header files, installing this config file doesn't bring any benefit IMO.

Here are the headers included in the wheel if I were to add this change:

$ unzip -l dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl "*.h" Archive: dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl Length Date Time Name --------- ---------- ----- ---- 197 05-05-2026 13:25 pandas/_libs/moments_simdconfig.h --------- ------- 197 1 file

I'd also suggest using a prefix of pandas/

Specifying directories in configure_file is only available on meson 1.10.0, For prior versions, I have to move this configuration logic to another meson.build file.

I do agree it's better to have the configuration in another directory, so I will make this change.

Alvaro-Kothe · 2026-05-05T13:45:59Z

@@ -0,0 +1,11 @@
+project(


I submitted it a few weeks ago: mesonbuild/wrapdb#2705

Alvaro-Kothe · 2026-05-05T13:48:55Z

+)
+
+xsimd_inc = include_directories('include')
+


Not for now, since we are not adding xtl to deal with complex numbers.

Alvaro-Kothe · 2026-05-05T14:02:35Z

+    version: '14.1.0',
+)
+
+xsimd_inc = include_directories('include')


The original goal of this meson.build file was to basically vendor xsimd, if I understand correctly, you are saying to add install_headers there, but IMO it will complicate installing pandas, requiring to add -Cinstall-args=--skip-subprojects to be able to build the wheel and to install from source with pip install

Alvaro-Kothe · 2026-05-05T14:07:35Z

@@ -0,0 +1,5 @@
+option(
+    'simd',
+    type: 'feature',


auto_features is auto by default

WillAyd · 2026-05-11T12:47:19Z

+
+/// Pack bits from boolean mask.
+///
+/// Adapted from nanoarrow.


I wasn't asking to vendor this but rather just add nanoarrow and call the function. There's likely many more uses for nanoarrow in the future anyway, if you plan on doing more high performance calls like this in C++

WillAyd · 2026-05-11T12:48:57Z

+///
+/// Adapted from nanoarrow.
+/// <https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356-L361>
+template <std::size_t N>


The proper way to call this would be to unpack 64 bits at a time and adjust the calling logic appropriately. Only unpacking 2 bits at a time is extremely slow

It was simpler to reason about processing the mask at the same rate as I process the values.

The proper way to call this would be to unpack 64 bits

IMO, it'll be more natural (and probably more performant) to use SIMD instructions to process the mask instead of unpacking it into a 64 bit bitmask. I'll check if it's feasible.

It's a little bit rough on the edges, but seems to be working. The core idea is to use xsimd::widen to distribute the batches of uint8_t into batches of uint64_t for then to be converted into packed doubles.

Hmm why is a double required? That's a huge amount of memory overhead, particularly if starting with a bitmask

If xsimd can't support a bit/byte mask conversion then just stick with nanoarrow - it seems unlikely that widening further is worth it

Hmm why is a double required? That's a huge amount of memory overhead, particularly if starting with a bitmask

In SSE2, there aren't instructions that uses a bitmask. Instead, masking is performed using vector masks, where each lane must match the width of the data being processed. Since we are calculating on double values, any mask must be widened to 64 bits per element.

If xsimd can't support a bit/byte mask conversion then just stick with nanoarrow - it seems unlikely that widening further is worth it

If we were to use a bitmask and use xsimd::from_mask, xsimd's current implementation for AVX2 constructs an array and performs a load
(https://github.com/xtensor-stack/xsimd/blob/a9039449fdfd3cb4816c6c33c45deebf7183af29/include/xsimd/arch/xsimd_avx.hpp#L616-L639).

Which seems worse than widening IMO, since it has to perform a lot more loads and comparisons to build the bitmask, then for each block it must perform a bitshift + land to generate the correct bitmask for the batch and finally will end up constructing the mask manually.

That's a huge amount of memory overhead

If memory pressure is a concern, probably it's better to write this function directly in accumulate_moments_simd_masked_impl to reduce register pressure, since apparently this function wasn't inlined:

generated assembly

_ZN6pandas7moments6detail17convert_u8_to_u64IN5xsimd4avx2EEESt5arrayINS3_5batchImT_EELm8EENS6_IhS7_EE: .LFB16169: .cfi_startproc vmovdqa %xmm0, %xmm1 vextracti128 $0x1, %ymm0, %xmm0 movq %rdi, %rax vpmovzxbw %xmm1, %ymm1 vpmovzxbw %xmm0, %ymm0 vmovdqa %xmm1, %xmm3 vmovdqa %xmm0, %xmm2 vextracti128 $0x1, %ymm1, %xmm1 vextracti128 $0x1, %ymm0, %xmm0 vpmovzxwd %xmm3, %ymm3 vpmovzxwd %xmm1, %ymm1 vpmovzxwd %xmm2, %ymm2 vpmovzxwd %xmm0, %ymm0 vpmovzxdq %xmm3, %ymm6 vpmovzxdq %xmm1, %ymm7 vpmovzxdq %xmm2, %ymm5 vpmovzxdq %xmm0, %ymm4 vmovdqa %ymm6, (%rdi) vextracti128 $0x1, %ymm3, %xmm3 vextracti128 $0x1, %ymm1, %xmm1 vextracti128 $0x1, %ymm2, %xmm2 vmovdqa %ymm7, 64(%rdi) vextracti128 $0x1, %ymm0, %xmm0 vpmovzxdq %xmm3, %ymm3 vpmovzxdq %xmm1, %ymm1 vmovdqa %ymm5, 128(%rdi) vpmovzxdq %xmm2, %ymm2 vpmovzxdq %xmm0, %ymm0 vmovdqa %ymm3, 32(%rdi) vmovdqa %ymm1, 96(%rdi) vmovdqa %ymm2, 160(%rdi) vmovdqa %ymm4, 192(%rdi) vmovdqa %ymm0, 224(%rdi) ret

I think we are making this very complicated; nanoarrow has a function for packing/unpacking as needed that solves the problem. Let's just start with that and leave it to a future enhancement to do something else

I am using xsimd function batch_bool.mask from xsimd to pack the bits.

nanoarrow has a function for packing/unpacking as needed that solves the problem. Let's just start with that and leave it to a future enhancement to do something else

The main problem is that this PR is already introducing a lot of changes.

Adding a new C++ dependency

Adding SIMD

Increasing the standard to C++ 20

Adding arrow/nanoarrow as a C++ dependency will ramp up complexity more than it should IMO for a PR whose goal is to start introducing SIMD in the codebase.

I also am trying to minimize the amount of change. Nanoarrow is installable via:

meson wrap install nanoarrow

can be added as a dependency via:

nanoarrow_dep = dependency('nanoarrow')

and can replace all of the bit packing/unpacking code that you have here.

Overall I feel like that is a much simpler solution, and it prevents us from maintaining bit fiddling code ourselves.

can replace all of the bit packing/unpacking code that you have here.

The bit unpacking in here seems minimal and uses SIMD instructions specific for the architecture in use.

Load and compare several values in the mask at once here:

const mask_batch_type mask8 = xsimd::load_unaligned<A>(&mask[right]); xsimd::batch_bool<uint8_t, A> isna_mask = mask8 != mask_batch_type(0U); if (!xsimd::any(isna_mask)) { continue; }

Pack it with std::uint64_t isna_bitmask = isna_mask.mask();

And create a mask with

auto isna_pd = xsimd::batch_bool<double, A>::from_mask( (isna_bitmask >> i) & ((step * 2) - 1));

jbrockmendel · 2026-05-12T00:23:12Z

I haven't followed too closely and am happy to defer to you two on most of this. Just want to comment that our Bus Factor is higher in c than cpp (and higher yet in cython). Shouldn't be the main factor, but if there's need of a tie-breaking factor.

WillAyd · 2026-05-12T21:06:52Z

+    std::optional<std::span<const uint8_t>>, int) noexcept;
+#endif
+
+using arch_list = xsimd::arch_list<>


I can see where that is confusing - it looks like xsimd offers a xsimd::detail::supported type which can be used to filter an arch_list from a set of types to those that are actually supported. I'm not sure why that is in the detail namespace, but it appears to have the missing functionality.

You may want to ask upstream why that isn't public, and if there's a desire to make it so in the future. In the meantime you could still just use it, as its available in a header file.

Keep in mind that if you stick to that pattern, you can leverage other parts of the static type system for xsimd, like the xsimd::arch_list::best. Things like that are very difficult to represent with macros, so we want to implement design patterns that we will use consistently

- Reduce code duplication - Improve whitespace distribution - Improve naming

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 7c00f3e to e78fb18 Compare March 28, 2026 13:37

Alvaro-Kothe mentioned this pull request Mar 28, 2026

PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions #64582

Closed

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 3 times, most recently from a6fad44 to 05dbcbe Compare March 31, 2026 00:34

Alvaro-Kothe mentioned this pull request Apr 1, 2026

PERF: Use SIMD for read_csv C tokenizer #64515

Open

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 95bbc20 to b0966e8 Compare April 3, 2026 13:43

jbrockmendel added the Performance Memory or execution speed performance label Apr 6, 2026

jorisvandenbossche mentioned this pull request Apr 7, 2026

Discussion: SIMD strategy for pandas C/C++ code #64884

Open

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from d680ec2 to 4389094 Compare May 2, 2026 18:33

Alvaro-Kothe changed the title ~~PERF: [POC] use xsimd with meson simd module to reduce moments~~ PERF: [POC] Add SIMD instructions with xsimd to reduce moments May 2, 2026

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from 2dbb13e to 3a1e345 Compare May 4, 2026 22:08

Alvaro-Kothe changed the title ~~PERF: [POC] Add SIMD instructions with xsimd to reduce moments~~ PERF: Add SIMD instructions with xsimd to reduce moments May 4, 2026

Alvaro-Kothe marked this pull request as ready for review May 4, 2026 23:01

Alvaro-Kothe requested a review from mroeschke as a code owner May 4, 2026 23:01

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from ab16e69 to 1d7d462 Compare May 5, 2026 03:17

WillAyd requested changes May 5, 2026

View reviewed changes

Alvaro-Kothe commented May 5, 2026

View reviewed changes

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 30d3272 to 7fb5f04 Compare May 5, 2026 14:13

Alvaro-Kothe mentioned this pull request May 5, 2026

BLD: Add xsimd Dependency and SIMD detection #65471

Open

1 task

Alvaro-Kothe added 17 commits May 8, 2026 16:04

refactor: remove maybe_unused directive

6b5f11b

refactor: use separate files for instantiation

375e012

refactor: remove MOMENTS_* macros

7d01e35

fix: bump to c++20 and use optional, span and unlikely

5c3956a

fix: use relative tolerance for assertion

5afecd8

refactor: move moments_merge to c++

19b0285

refactor: change n member to size_t in Moments struct

dc0341f

refactor: organize math expressions

ed4f066

refactor: remove static from moments_add_valuem

41a4e1a

fix: use result.n in if branch

e708151

refactor: use auto

6b64620

refactor: move subspan outside of function call

9671223

refactor: remove step outside of template

0743dd7

refactor: use bool for skipna

974d08f

refactor: deduplicate #if macro usage

57bdfd9

fix: try to fix maybe_unitialized warning

970cd0a

refactor: mark as external and don't use it

9cda6c0

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 3256fd3 to 9cda6c0 Compare May 8, 2026 19:05

Alvaro-Kothe added 4 commits May 9, 2026 08:37

fix: ensure internal linkage in simd module

a0e3abb

perf: use nanoarrow method to pack bits

eb806f8

refactor: improve naming

e92e2d1

refactor: increase similarity between moments update functions

1e2ccc0

WillAyd requested changes May 11, 2026

View reviewed changes

perf: use simd instructions to handle mask directly

8ca9a2c

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 8659a91 to 8ca9a2c Compare May 12, 2026 02:01

WillAyd requested changes May 12, 2026

View reviewed changes

Alvaro-Kothe added 2 commits May 12, 2026 20:48

use bitmask

bd2a040

chore: improve readability

381be60

- Reduce code duplication - Improve whitespace distribution - Improve naming

Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 3510676 to 381be60 Compare May 13, 2026 00:24


		namespace pandas::moments {

		#if defined(PANDAS_SIMD_AVX512CD)

	Moments accumulate_moments_simd::operator()(Arch /unused/,
	Moments accumulate_moments_simd::operator()([[maybe_unused]] Arch,

	Moments operator()(Arch /unused/, const double *values, std::size_t n,
	Moments operator()(Arch /unused/, std::vector<double> values, std::size_t n,

		)

		xsimd_inc = include_directories('include')

Uh oh!

Conversation

Alvaro-Kothe commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Benchmark

Uh oh!

jbrockmendel commented Mar 28, 2026

Uh oh!

Alvaro-Kothe commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Apr 6, 2026

Uh oh!

Alvaro-Kothe commented Apr 6, 2026

Uh oh!

jorisvandenbossche commented Apr 7, 2026

Uh oh!

jorisvandenbossche commented Apr 7, 2026

Uh oh!

Alvaro-Kothe commented Apr 7, 2026

Uh oh!

jorisvandenbossche commented Apr 21, 2026

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Alvaro-Kothe commented Mar 28, 2026 •

edited

Loading

Alvaro-Kothe commented Mar 28, 2026 •

edited

Loading

WillAyd May 6, 2026 •

edited

Loading

WillAyd May 5, 2026 •

edited

Loading

WillAyd May 5, 2026 •

edited

Loading