PERF: Use SIMD for read_csv C tokenizer by jbrockmendel · Pull Request #64515 · pandas-dev/pandas

jbrockmendel · 2026-03-11T00:21:37Z

Summary

Use NEON (ARM) and SSE2 (x86-64) SIMD intrinsics to accelerate the C tokenizer's inner loop. When the state machine is in IN_FIELD or IN_QUOTED_FIELD processing a normal character, scan 16 bytes at a time to find the next special character and memcpy the intervening bytes in bulk instead of one-at-a-time PUSH_CHAR.

fast_scan_{neon,sse}: checks 6 special chars for unquoted fields
fast_scan_quoted_{neon,sse}: checks only quote + escape for quoted fields (fewer comparisons)
Disabled for delim_whitespace=True; scalar fallback for fields <16 bytes and non-NEON/SSE2 architectures

Performance

Manually verified benchmarks (median of 10, separate processes per branch):

# --- Long quoted Unicode strings (MemMapUTF8) ---
# setup
import pandas as pd, numpy as np, tempfile, os
from pandas import DataFrame, concat, date_range, read_csv
lines = []
for lnum in range(ord(" "), ord("\U00010080"), 128):
    line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n"
    try:
        line.encode("utf-8")
    except UnicodeEncodeError:
        continue
    lines.append(line)
df = concat([DataFrame(lines) for _ in range(100)], ignore_index=True)
fname = os.path.join(tempfile.gettempdir(), "bench_utf8.csv")
df.to_csv(fname, index=False, header=False, encoding="utf-8")

%timeit read_csv(fname, header=None, memory_map=True, encoding="utf-8", engine="c")
30.9 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- PR
56.7 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- main

# --- Numeric data with pipe separator (Thousands) ---
# setup
np.random.seed(42)
data = np.random.randn(10000, 8) * np.random.randint(100, 10000, (10000, 8))
df = DataFrame(data)
fname = os.path.join(tempfile.gettempdir(), "bench_thou.csv")
df.to_csv(fname, sep="|")

%timeit read_csv(fname, sep="|", thousands=None, engine="c")
4.77 ms ± 198 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR
5.95 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- main

# --- Wide string fields ---
# setup
np.random.seed(42)
data = np.random.choice(
    ["abcdefghijklmnopqrstuvwxyz0123456789" * 3,
     "the quick brown fox jumps over the lazy dog " * 2,
     "Lorem ipsum dolor sit amet consectetur adipiscing" * 2],
    size=(200_000, 20),
)
df = DataFrame(data, columns=[f"col_{i}" for i in range(20)])
fname = os.path.join(tempfile.gettempdir(), "bench_wide.csv")
df.to_csv(fname, index=False)

%timeit read_csv(fname, engine="c")
1.03 s ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR
1.61 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- main

Benchmarks that showed improvement in ASV but were manually verified as noise (no real change): ReadCSVEngine.time_read_stringcsv('c'), ReadCSVCategorical.time_convert_direct('c'), ReadCSVDatePyarrowEngine — fields too short for SIMD or wrong engine.

No regressions: ReadCSVCParserLowMemory.peakmem_over_2gb_input and ReadCSVConcatDatetimeBadDateValue were manually verified as measurement noise (allocation logic is unchanged).

jbrockmendel · 2026-03-11T21:47:58Z

On today's dev call @jorisvandenbossche mentioned "highway" and "xsimd" as options to either depend on or vendor, rather than writing our own code.

WillAyd · 2026-03-26T12:46:43Z

 #include <stdbool.h>
 #include <stdlib.h>

+#if defined(__ARM_NEON) || defined(__ARM_NEON__)


I gave @Alvaro-Kothe the same feedback on another PR; I'm hesitant to roll our own SIMD detection logic. I'm not an expert in the area, but the fact that there are so many libraries and tools that abstract SIMD makes me think its deceptively difficult to maintain

Meson has a module for detecting SIMD capabilities built into the tool; I'd prefer we understand the limitations of that before we try to roll our own

jorisvandenbossche · 2026-03-26T14:58:22Z

I'm hesitant to roll our own SIMD detection logic. I'm not an expert in the area, but the fact that there are so many libraries and tools that abstract SIMD makes me think its deceptively difficult to maintain

I mentioned similar reservations on a dev call a few weeks ago (Brock mentioned that briefly above). If we want to use SIMD, I think we should first investigate to use one of those libraries that provide an abstracted interface (as a starter, this also makes sure those functions are tested for all different architectures, which we (currently) don't do on our CI?).

(I know numpy uses highway (https://numpy.org/neps/nep-0038-SIMD-optimizations.html, https://numpy.org/neps/nep-0054-simd-cpp-highway.html), and Arrow C++ is using xsimd)

Since this has come up in multiple PRs now, @jbrockmendel or @Alvaro-Kothe, could one of you open an issue to discuss the use of SIMD in the pandas codebase?
(I am far from an expert, and I don't know all the pros/cons of different approaches (eg Alvaro mentioned relying on vector extensions from compilers, is that an alternative to one of those abstraction layers?), but I suppose there are questions around maintainability, whether to use runtime dispatch or prebuilt binaries, etc)

jbrockmendel · 2026-03-26T17:08:02Z

@Alvaro-Kothe can you take point on joris's discussion idea? If I do it, it'll end up being a very thin human wrapper over a copy-pasted claude text.

@jorisvandenbossche @WillAyd while im on board with bigger-picture discussion about how to do simd, i don't think anyone is against doing it at all. In that case, can we move forward with this (and alvaro's work) and then if we decide to use xsimd/highway/etc we can do follow-up refactors? In both this and Alvaro's work, these are blockers to more perf-improvement work.

mroeschke · 2026-03-26T17:56:25Z

I also lean toward Joris' and Will's opinion about at least seeing/evaluating what all our SIMD options are before merging a PR enabling it.

jbrockmendel · 2026-03-26T18:04:06Z

opened #64884

jorisvandenbossche · 2026-03-27T21:54:29Z

In both this and Alvaro's work, these are blockers to more perf-improvement work.

How is this PR a blocker for more perf-related work on csv? (unless it would be more simd improvements)

can we move forward with this (and alvaro's work) and then if we decide to use xsimd/highway/etc we can do follow-up refactors?

I think that can be part of the general discussion: we can discuss both what is feasible/practical on the short-term, as what we would want to investigate more for a better, longer term solution (e.g. I think this PR and Alvaro's are already using a different approach? we might want to stick to one on the short term)

jorisvandenbossche · 2026-03-27T21:55:41Z


+#if defined(__ARM_NEON) || defined(__ARM_NEON__)
+#  include <arm_neon.h>
+#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) ||               \


What is __SSE2 for? (didn't directly find anything about it, only see __SSE2__ online)

claude says its a non-standard variant some compilers define and is safe to remove. will do so.

jorisvandenbossche · 2026-03-27T21:59:59Z

+#if defined(__ARM_NEON) || defined(__ARM_NEON__)
+#  include <arm_neon.h>
+#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) ||               \
+    defined(_M_IX86)


I think this is not sufficient. Following https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170, that defines the 32-bit x86 platform, but it is only _M_IX86_FP that defines the available instruction set, and that can theoretically be lower than SSE2.

I have seen defined(_M_IX86_FP) && (_M_IX86_FP >= 2) being used elsewhere

will update

jorisvandenbossche · 2026-03-27T22:01:46Z

+#  include <arm_neon.h>
+#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) ||               \
+    defined(_M_IX86)
+#  include <immintrin.h>


Suggested change

# include <immintrin.h>

# include <emmintrin.h>

?

Since that gives the header specifically for SSE2, avoiding the risk to accidentally use a more recent instruction?

will update

jorisvandenbossche · 2026-03-27T22:03:53Z

+#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) ||               \
+    defined(_M_IX86)


Can we #define something like PANDAS_HAS_SSE2 when doing this check the first time above, so we don't have to repeat the full conditional every time?

jorisvandenbossche · 2026-03-27T22:13:47Z

+
+#if defined(__ARM_NEON) || defined(__ARM_NEON__)
+        if (!self->delim_whitespace) {
+          size_t remaining = self->datalen - (i + 1);


Just to better understand how the specific simd code is used (didn't look in detail at the surrounding code): this is adding some new logic, but doesn't seem to replace another line of code that would otherwise run now? (or put differently, there is no fallback?)

claude: the preceding PUSH_CHAR(c) already handled the current byte. The SIMD block then scans ahead and bulk-copies any subsequent normal characters. If there are fewer than 16 bytes remaining or skip is 0, it does nothing and the byte-at-a-time loop continues normally. The fallback is the existing code

Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)

jbrockmendel · 2026-03-27T22:49:39Z

How is this PR a blocker for more perf-related work on csv? (unless it would be more simd improvements)

I'm holding back non-SIMD perf PRs to avoid having too many PRs a) touching nearby code and b) in general.

Alvaro-Kothe · 2026-04-01T23:57:45Z

+  return i;
+}
+
+#elif defined(PANDAS_HAS_SSE2)


Is it possible to add a higher instruction set that processes 256/512 bits (like SVE or AVX2)? Mainly to compare against #64582 and #64905.

It is possible, but some profiling (on mac) suggests its slightly slower.

Thanks for checking. Your CPU probably doesn't support SVE and the runtime check may have reduced performance a little bit.

jorisvandenbossche · 2026-04-07T12:59:50Z

 #include <stdbool.h>
 #include <stdlib.h>

+#if defined(__ARM_NEON) || defined(__ARM_NEON__)


Suggested change

#if defined(__ARM_NEON) || defined(__ARM_NEON__)

#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM64)

Windows ARM also supports NEON?

jorisvandenbossche · 2026-04-07T13:07:53Z

+
+#if defined(__ARM_NEON) || defined(__ARM_NEON__)
+        if (!self->delim_whitespace) {
+          size_t remaining = self->datalen - (i + 1);


Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)

jorisvandenbossche · 2026-04-07T13:10:32Z

+#ifdef PANDAS_HAS_NEON
+        {
+          size_t remaining = self->datalen - (i + 1);
+          if (remaining >= 16) {
+            size_t skip =
+                fast_scan_quoted_neon(buf, remaining, vquote, vescape);
+            if (skip > 0) {
+              memcpy(stream, buf, skip);
+              stream += skip;
+              slen += skip;
+              buf += skip;
+              i += skip;
+            }
+          }
+        }
+#elif defined(PANDAS_HAS_SSE2)
+        {
+          size_t remaining = self->datalen - (i + 1);
+          if (remaining >= 16) {
+            size_t skip = fast_scan_quoted_sse(buf, remaining, vquote, vescape);
+            if (skip > 0) {
+              memcpy(stream, buf, skip);


To avoid this duplicated code for both versions (in the end, only 1 line is different in both cases), we could also move the neon vs sse2 in a fast_scan_quoted_simd function that does that dispatch? And then here only need time the above block?

Co-authored-by: William Ayd <william.ayd@icloud.com>

jorisvandenbossche · 2026-05-11T15:07:42Z

@jbrockmendel on the dev call, I think you mentioned that using an abstraction layer (like xsimd or highway) would be overkill in this case or result in way more complex code (don't recall your exact phrasing, but something like this).
Did you actually generate a draft solution to come to that conclusion? If so, could you give some more details?

jbrockmendel · 2026-05-11T21:26:02Z

Did you actually generate a draft solution to come to that conclusion? If so, could you give some more details?

For the last dev call I discussed the with claude and concluded it was overkill. This morning I asked it to imlpement a xsimd-based implementation, which I pushed. It isn't working because xsimd isn't wired up as a dependency, but can give us an idea of the tradeoffs involved.

WillAyd · 2026-05-11T21:33:58Z

+  for (; i + kStep <= len; i += kStep) {
+    const auto chunk = batch_u8::load_unaligned(p + i);
+    auto mask = (chunk == v[0]);
+    for (int j = 1; j < N; ++j) {


@Alvaro-Kothe is going through a similar exercise on his SIMD PR and I gave the feedback that this type of looping + bit fiddling is extremely inefficient. nanoarrow has built-in functionality to do this better and it sounds like @Alvaro-Kothe is researching something with xsimd (?) - let's keep tabs on that

I don't think this was a problem with the non-xsimd variant? That felt lower-touch to me and I marginally prefer it.

Use NEON (ARM) and SSE2 (x86-64) intrinsics to accelerate the tokenizer's inner loop. When processing normal characters in the IN_FIELD and IN_QUOTED_FIELD states, scan 16 bytes at a time to find the next special character (delimiter, line terminator, quote, etc.) and bulk-copy the intervening bytes via memcpy instead of one-at-a-time PUSH_CHAR. Two scan variants: - fast_scan_{neon,sse}: checks 6 special chars (delimiter, line terminator, CR, quote, escape, comment) for unquoted fields - fast_scan_quoted_{neon,sse}: checks only 2 chars (quote, escape) for quoted fields, where delimiters are literal Disabled for delim_whitespace=True (needs isblank() which checks multiple characters). Falls back to scalar for fields shorter than 16 bytes and on architectures without NEON/SSE2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace GCC/Clang-only __builtin_ctz with a portable _pandas_ctz wrapper that uses _BitScanForward on MSVC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Define PANDAS_HAS_NEON / PANDAS_HAS_SSE2 macros to avoid repeating the full platform-detection conditional at every #ifdef site - Replace <immintrin.h> with <emmintrin.h> (SSE2-specific header) - Fix _M_IX86 check to require _M_IX86_FP >= 2 (actual SSE2 support) - Drop non-standard __SSE2 (without trailing underscores) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add _M_ARM64 to NEON detection for Windows ARM, deduplicate NEON/SSE2 call sites with unified dispatch macros, and add comments to the SIMD scan blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…__builtin_ctzll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the hand-rolled NEON and SSE2 intrinsics in the C tokenizer's bulk-scan path with a small C++ shim that uses xsimd. The new `simd_scan` interface lives in `src/parser/simd_scan.{h,cpp}` and exposes a portable scanner API to `tokenizer.c`. xsimd selects the best compile-time target and handles NEON-vs-SSE2 mask extraction internally, so the platform `#ifdef` ladders, ctz portability wrappers, and per-call broadcast setup in `tokenizer.c` all go away. Behavior is unchanged: same 16-byte chunk size, same scalar fallback for trailing bytes, same disable for `delim_whitespace=True`. Depends on GH#65471, which adds the `xsimd_dep` meson dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The tslibs/parsing extension links tokenizer.c but its meson entry was not updated alongside the other three tokenizer.c consumers (lib, pandas_parser, parsers), leaving pd_scanner_create / pd_scanner_destroy / pd_scanner_scan unresolved at import time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jbrockmendel added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 11, 2026

jbrockmendel requested a review from WillAyd March 25, 2026 21:33

WillAyd requested changes Mar 26, 2026

View reviewed changes

jbrockmendel force-pushed the perf-read_csv-simd branch from fc4e670 to 8adf399 Compare March 26, 2026 17:08

jbrockmendel mentioned this pull request Mar 26, 2026

Discussion: SIMD strategy for pandas C/C++ code #64884

Open

jorisvandenbossche reviewed Mar 27, 2026

View reviewed changes

jorisvandenbossche added the Build Library building on various platforms label Mar 27, 2026

jbrockmendel force-pushed the perf-read_csv-simd branch from 8ccd75d to 9bfce01 Compare April 1, 2026 20:02

Alvaro-Kothe reviewed Apr 1, 2026

View reviewed changes

jbrockmendel mentioned this pull request Apr 2, 2026

PERF: Optimize _categorical_convert CSV parser when categories are known ahead of time #17743

Closed

jbrockmendel force-pushed the perf-read_csv-simd branch 2 times, most recently from b8eb550 to cf43e61 Compare April 6, 2026 16:52

jorisvandenbossche reviewed Apr 7, 2026

View reviewed changes

Alvaro-Kothe and others added 7 commits May 7, 2026 16:03

build(simd): add xsimd dependency and simd verification

300e26c

Update pandas/_libs/meson.build

1022e10

Co-authored-by: William Ayd <william.ayd@icloud.com>

ci: remove no simd job

ed588ad

refactor: create arch specific loop

84a4982

refactor: move configuration set inside main verification

aaf025b

fix: bump xsimd to 14.2 for MSVC ARM support

0fefde7

build: remove simd option

effdd43

WillAyd reviewed May 11, 2026

View reviewed changes

jbrockmendel and others added 8 commits May 11, 2026 17:17

BLD: Fix MSVC build for SIMD tokenizer by replacing __builtin_ctz

44c3f86

Replace GCC/Clang-only __builtin_ctz with a portable _pandas_ctz wrapper that uses _BitScanForward on MSVC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BLD: Fix -Werror build failure from tautological char < 256 comparison

b097535

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLN: Address review comments on SIMD tokenizer

e384990

Add _M_ARM64 to NEON detection for Windows ARM, deduplicate NEON/SSE2 call sites with unified dispatch macros, and add comments to the SIMD scan blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BLD: Fix MSVC build for SIMD tokenizer on Windows ARM64 by replacing …

ee47d11

…__builtin_ctzll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jbrockmendel force-pushed the perf-read_csv-simd branch from a7beb5b to 615a2b4 Compare May 12, 2026 00:19

		#elif defined(__SSE2__) \|\| defined(__SSE2) \|\| defined(_M_X64) \|\| \
		defined(_M_IX86)

	#if defined(__ARM_NEON) \|\| defined(__ARM_NEON__)
	#if defined(__ARM_NEON) \|\| defined(__ARM_NEON__) \|\| defined(_M_ARM64)

Uh oh!

Conversation

jbrockmendel commented Mar 11, 2026

Summary

Performance

Uh oh!

jbrockmendel commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Mar 26, 2026

Uh oh!

mroeschke commented Mar 26, 2026

Uh oh!

jbrockmendel commented Mar 26, 2026

Uh oh!

jorisvandenbossche commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented May 11, 2026

Uh oh!

jbrockmendel commented May 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jorisvandenbossche commented Mar 26, 2026 •

edited

Loading