PERF: Use SIMD for read_csv C tokenizer#64515
Conversation
|
On today's dev call @jorisvandenbossche mentioned "highway" and "xsimd" as options to either depend on or vendor, rather than writing our own code. |
| #include <stdbool.h> | ||
| #include <stdlib.h> | ||
|
|
||
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) |
There was a problem hiding this comment.
I gave @Alvaro-Kothe the same feedback on another PR; I'm hesitant to roll our own SIMD detection logic. I'm not an expert in the area, but the fact that there are so many libraries and tools that abstract SIMD makes me think its deceptively difficult to maintain
Meson has a module for detecting SIMD capabilities built into the tool; I'd prefer we understand the limitations of that before we try to roll our own
I mentioned similar reservations on a dev call a few weeks ago (Brock mentioned that briefly above). If we want to use SIMD, I think we should first investigate to use one of those libraries that provide an abstracted interface (as a starter, this also makes sure those functions are tested for all different architectures, which we (currently) don't do on our CI?). (I know numpy uses highway (https://numpy.org/neps/nep-0038-SIMD-optimizations.html, https://numpy.org/neps/nep-0054-simd-cpp-highway.html), and Arrow C++ is using xsimd) Since this has come up in multiple PRs now, @jbrockmendel or @Alvaro-Kothe, could one of you open an issue to discuss the use of SIMD in the pandas codebase? |
|
@Alvaro-Kothe can you take point on joris's discussion idea? If I do it, it'll end up being a very thin human wrapper over a copy-pasted claude text. @jorisvandenbossche @WillAyd while im on board with bigger-picture discussion about how to do simd, i don't think anyone is against doing it at all. In that case, can we move forward with this (and alvaro's work) and then if we decide to use xsimd/highway/etc we can do follow-up refactors? In both this and Alvaro's work, these are blockers to more perf-improvement work. |
fc4e670 to
8adf399
Compare
|
I also lean toward Joris' and Will's opinion about at least seeing/evaluating what all our SIMD options are before merging a PR enabling it. |
|
opened #64884 |
How is this PR a blocker for more perf-related work on csv? (unless it would be more simd improvements)
I think that can be part of the general discussion: we can discuss both what is feasible/practical on the short-term, as what we would want to investigate more for a better, longer term solution (e.g. I think this PR and Alvaro's are already using a different approach? we might want to stick to one on the short term) |
|
|
||
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) | ||
| # include <arm_neon.h> | ||
| #elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \ |
There was a problem hiding this comment.
What is __SSE2 for? (didn't directly find anything about it, only see __SSE2__ online)
There was a problem hiding this comment.
claude says its a non-standard variant some compilers define and is safe to remove. will do so.
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) | ||
| # include <arm_neon.h> | ||
| #elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \ | ||
| defined(_M_IX86) |
There was a problem hiding this comment.
I think this is not sufficient. Following https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170, that defines the 32-bit x86 platform, but it is only _M_IX86_FP that defines the available instruction set, and that can theoretically be lower than SSE2.
I have seen defined(_M_IX86_FP) && (_M_IX86_FP >= 2) being used elsewhere
| # include <arm_neon.h> | ||
| #elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \ | ||
| defined(_M_IX86) | ||
| # include <immintrin.h> |
There was a problem hiding this comment.
| # include <immintrin.h> | |
| # include <emmintrin.h> |
?
Since that gives the header specifically for SSE2, avoiding the risk to accidentally use a more recent instruction?
| #elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \ | ||
| defined(_M_IX86) |
There was a problem hiding this comment.
Can we #define something like PANDAS_HAS_SSE2 when doing this check the first time above, so we don't have to repeat the full conditional every time?
|
|
||
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) | ||
| if (!self->delim_whitespace) { | ||
| size_t remaining = self->datalen - (i + 1); |
There was a problem hiding this comment.
Just to better understand how the specific simd code is used (didn't look in detail at the surrounding code): this is adding some new logic, but doesn't seem to replace another line of code that would otherwise run now? (or put differently, there is no fallback?)
There was a problem hiding this comment.
claude: the preceding PUSH_CHAR(c) already handled the current byte. The SIMD block then scans ahead and bulk-copies any subsequent normal characters. If there are fewer than 16 bytes remaining or skip is 0, it does nothing and the byte-at-a-time loop continues normally. The fallback is the existing code
There was a problem hiding this comment.
Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)
I'm holding back non-SIMD perf PRs to avoid having too many PRs a) touching nearby code and b) in general. |
8ccd75d to
9bfce01
Compare
| return i; | ||
| } | ||
|
|
||
| #elif defined(PANDAS_HAS_SSE2) |
There was a problem hiding this comment.
It is possible, but some profiling (on mac) suggests its slightly slower.
There was a problem hiding this comment.
Thanks for checking. Your CPU probably doesn't support SVE and the runtime check may have reduced performance a little bit.
b8eb550 to
cf43e61
Compare
| #include <stdbool.h> | ||
| #include <stdlib.h> | ||
|
|
||
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) |
There was a problem hiding this comment.
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) | |
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM64) |
Windows ARM also supports NEON?
|
|
||
| #if defined(__ARM_NEON) || defined(__ARM_NEON__) | ||
| if (!self->delim_whitespace) { | ||
| size_t remaining = self->datalen - (i + 1); |
There was a problem hiding this comment.
Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)
| #ifdef PANDAS_HAS_NEON | ||
| { | ||
| size_t remaining = self->datalen - (i + 1); | ||
| if (remaining >= 16) { | ||
| size_t skip = | ||
| fast_scan_quoted_neon(buf, remaining, vquote, vescape); | ||
| if (skip > 0) { | ||
| memcpy(stream, buf, skip); | ||
| stream += skip; | ||
| slen += skip; | ||
| buf += skip; | ||
| i += skip; | ||
| } | ||
| } | ||
| } | ||
| #elif defined(PANDAS_HAS_SSE2) | ||
| { | ||
| size_t remaining = self->datalen - (i + 1); | ||
| if (remaining >= 16) { | ||
| size_t skip = fast_scan_quoted_sse(buf, remaining, vquote, vescape); | ||
| if (skip > 0) { | ||
| memcpy(stream, buf, skip); |
There was a problem hiding this comment.
To avoid this duplicated code for both versions (in the end, only 1 line is different in both cases), we could also move the neon vs sse2 in a fast_scan_quoted_simd function that does that dispatch? And then here only need time the above block?
Co-authored-by: William Ayd <william.ayd@icloud.com>
|
@jbrockmendel on the dev call, I think you mentioned that using an abstraction layer (like xsimd or highway) would be overkill in this case or result in way more complex code (don't recall your exact phrasing, but something like this). |
For the last dev call I discussed the with claude and concluded it was overkill. This morning I asked it to imlpement a xsimd-based implementation, which I pushed. It isn't working because xsimd isn't wired up as a dependency, but can give us an idea of the tradeoffs involved. |
| for (; i + kStep <= len; i += kStep) { | ||
| const auto chunk = batch_u8::load_unaligned(p + i); | ||
| auto mask = (chunk == v[0]); | ||
| for (int j = 1; j < N; ++j) { |
There was a problem hiding this comment.
@Alvaro-Kothe is going through a similar exercise on his SIMD PR and I gave the feedback that this type of looping + bit fiddling is extremely inefficient. nanoarrow has built-in functionality to do this better and it sounds like @Alvaro-Kothe is researching something with xsimd (?) - let's keep tabs on that
There was a problem hiding this comment.
I don't think this was a problem with the non-xsimd variant? That felt lower-touch to me and I marginally prefer it.
Use NEON (ARM) and SSE2 (x86-64) intrinsics to accelerate the
tokenizer's inner loop. When processing normal characters in the
IN_FIELD and IN_QUOTED_FIELD states, scan 16 bytes at a time to
find the next special character (delimiter, line terminator, quote,
etc.) and bulk-copy the intervening bytes via memcpy instead of
one-at-a-time PUSH_CHAR.
Two scan variants:
- fast_scan_{neon,sse}: checks 6 special chars (delimiter, line
terminator, CR, quote, escape, comment) for unquoted fields
- fast_scan_quoted_{neon,sse}: checks only 2 chars (quote, escape)
for quoted fields, where delimiters are literal
Disabled for delim_whitespace=True (needs isblank() which checks
multiple characters). Falls back to scalar for fields shorter than
16 bytes and on architectures without NEON/SSE2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GCC/Clang-only __builtin_ctz with a portable _pandas_ctz wrapper that uses _BitScanForward on MSVC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Define PANDAS_HAS_NEON / PANDAS_HAS_SSE2 macros to avoid repeating the full platform-detection conditional at every #ifdef site - Replace <immintrin.h> with <emmintrin.h> (SSE2-specific header) - Fix _M_IX86 check to require _M_IX86_FP >= 2 (actual SSE2 support) - Drop non-standard __SSE2 (without trailing underscores) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add _M_ARM64 to NEON detection for Windows ARM, deduplicate NEON/SSE2 call sites with unified dispatch macros, and add comments to the SIMD scan blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…__builtin_ctzll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hand-rolled NEON and SSE2 intrinsics in the C tokenizer's
bulk-scan path with a small C++ shim that uses xsimd. The new
`simd_scan` interface lives in `src/parser/simd_scan.{h,cpp}` and
exposes a portable scanner API to `tokenizer.c`. xsimd selects the
best compile-time target and handles NEON-vs-SSE2 mask extraction
internally, so the platform `#ifdef` ladders, ctz portability
wrappers, and per-call broadcast setup in `tokenizer.c` all go away.
Behavior is unchanged: same 16-byte chunk size, same scalar fallback
for trailing bytes, same disable for `delim_whitespace=True`.
Depends on GH#65471, which adds the `xsimd_dep` meson dependency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tslibs/parsing extension links tokenizer.c but its meson entry was not updated alongside the other three tokenizer.c consumers (lib, pandas_parser, parsers), leaving pd_scanner_create / pd_scanner_destroy / pd_scanner_scan unresolved at import time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a7beb5b to
615a2b4
Compare
Summary
Use NEON (ARM) and SSE2 (x86-64) SIMD intrinsics to accelerate the C tokenizer's inner loop. When the state machine is in
IN_FIELDorIN_QUOTED_FIELDprocessing a normal character, scan 16 bytes at a time to find the next special character andmemcpythe intervening bytes in bulk instead of one-at-a-timePUSH_CHAR.fast_scan_{neon,sse}: checks 6 special chars for unquoted fieldsfast_scan_quoted_{neon,sse}: checks only quote + escape for quoted fields (fewer comparisons)delim_whitespace=True; scalar fallback for fields <16 bytes and non-NEON/SSE2 architecturesPerformance
Manually verified benchmarks (median of 10, separate processes per branch):
Benchmarks that showed improvement in ASV but were manually verified as noise (no real change):
ReadCSVEngine.time_read_stringcsv('c'),ReadCSVCategorical.time_convert_direct('c'),ReadCSVDatePyarrowEngine— fields too short for SIMD or wrong engine.No regressions:
ReadCSVCParserLowMemory.peakmem_over_2gb_inputandReadCSVConcatDatetimeBadDateValuewere manually verified as measurement noise (allocation logic is unchanged).