Skip to content

PERF: Use SIMD for read_csv C tokenizer#64515

Open
jbrockmendel wants to merge 15 commits into
pandas-dev:mainfrom
jbrockmendel:perf-read_csv-simd
Open

PERF: Use SIMD for read_csv C tokenizer#64515
jbrockmendel wants to merge 15 commits into
pandas-dev:mainfrom
jbrockmendel:perf-read_csv-simd

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

Use NEON (ARM) and SSE2 (x86-64) SIMD intrinsics to accelerate the C tokenizer's inner loop. When the state machine is in IN_FIELD or IN_QUOTED_FIELD processing a normal character, scan 16 bytes at a time to find the next special character and memcpy the intervening bytes in bulk instead of one-at-a-time PUSH_CHAR.

  • fast_scan_{neon,sse}: checks 6 special chars for unquoted fields
  • fast_scan_quoted_{neon,sse}: checks only quote + escape for quoted fields (fewer comparisons)
  • Disabled for delim_whitespace=True; scalar fallback for fields <16 bytes and non-NEON/SSE2 architectures

Performance

Manually verified benchmarks (median of 10, separate processes per branch):

# --- Long quoted Unicode strings (MemMapUTF8) ---
# setup
import pandas as pd, numpy as np, tempfile, os
from pandas import DataFrame, concat, date_range, read_csv
lines = []
for lnum in range(ord(" "), ord("\U00010080"), 128):
    line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n"
    try:
        line.encode("utf-8")
    except UnicodeEncodeError:
        continue
    lines.append(line)
df = concat([DataFrame(lines) for _ in range(100)], ignore_index=True)
fname = os.path.join(tempfile.gettempdir(), "bench_utf8.csv")
df.to_csv(fname, index=False, header=False, encoding="utf-8")

%timeit read_csv(fname, header=None, memory_map=True, encoding="utf-8", engine="c")
30.9 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- PR
56.7 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- main
# --- Numeric data with pipe separator (Thousands) ---
# setup
np.random.seed(42)
data = np.random.randn(10000, 8) * np.random.randint(100, 10000, (10000, 8))
df = DataFrame(data)
fname = os.path.join(tempfile.gettempdir(), "bench_thou.csv")
df.to_csv(fname, sep="|")

%timeit read_csv(fname, sep="|", thousands=None, engine="c")
4.77 ms ± 198 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR
5.95 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- main
# --- Wide string fields ---
# setup
np.random.seed(42)
data = np.random.choice(
    ["abcdefghijklmnopqrstuvwxyz0123456789" * 3,
     "the quick brown fox jumps over the lazy dog " * 2,
     "Lorem ipsum dolor sit amet consectetur adipiscing" * 2],
    size=(200_000, 20),
)
df = DataFrame(data, columns=[f"col_{i}" for i in range(20)])
fname = os.path.join(tempfile.gettempdir(), "bench_wide.csv")
df.to_csv(fname, index=False)

%timeit read_csv(fname, engine="c")
1.03 s ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR
1.61 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- main

Benchmarks that showed improvement in ASV but were manually verified as noise (no real change): ReadCSVEngine.time_read_stringcsv('c'), ReadCSVCategorical.time_convert_direct('c'), ReadCSVDatePyarrowEngine — fields too short for SIMD or wrong engine.

No regressions: ReadCSVCParserLowMemory.peakmem_over_2gb_input and ReadCSVConcatDatetimeBadDateValue were manually verified as measurement noise (allocation logic is unchanged).

@jbrockmendel jbrockmendel added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 11, 2026
@jbrockmendel
Copy link
Copy Markdown
Member Author

On today's dev call @jorisvandenbossche mentioned "highway" and "xsimd" as options to either depend on or vendor, rather than writing our own code.

@jbrockmendel jbrockmendel requested a review from WillAyd March 25, 2026 21:33
Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
#include <stdbool.h>
#include <stdlib.h>

#if defined(__ARM_NEON) || defined(__ARM_NEON__)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave @Alvaro-Kothe the same feedback on another PR; I'm hesitant to roll our own SIMD detection logic. I'm not an expert in the area, but the fact that there are so many libraries and tools that abstract SIMD makes me think its deceptively difficult to maintain

Meson has a module for detecting SIMD capabilities built into the tool; I'd prefer we understand the limitations of that before we try to roll our own

@jorisvandenbossche
Copy link
Copy Markdown
Member

jorisvandenbossche commented Mar 26, 2026

I'm hesitant to roll our own SIMD detection logic. I'm not an expert in the area, but the fact that there are so many libraries and tools that abstract SIMD makes me think its deceptively difficult to maintain

I mentioned similar reservations on a dev call a few weeks ago (Brock mentioned that briefly above). If we want to use SIMD, I think we should first investigate to use one of those libraries that provide an abstracted interface (as a starter, this also makes sure those functions are tested for all different architectures, which we (currently) don't do on our CI?).

(I know numpy uses highway (https://numpy.org/neps/nep-0038-SIMD-optimizations.html, https://numpy.org/neps/nep-0054-simd-cpp-highway.html), and Arrow C++ is using xsimd)

Since this has come up in multiple PRs now, @jbrockmendel or @Alvaro-Kothe, could one of you open an issue to discuss the use of SIMD in the pandas codebase?
(I am far from an expert, and I don't know all the pros/cons of different approaches (eg Alvaro mentioned relying on vector extensions from compilers, is that an alternative to one of those abstraction layers?), but I suppose there are questions around maintainability, whether to use runtime dispatch or prebuilt binaries, etc)

@jbrockmendel
Copy link
Copy Markdown
Member Author

@Alvaro-Kothe can you take point on joris's discussion idea? If I do it, it'll end up being a very thin human wrapper over a copy-pasted claude text.

@jorisvandenbossche @WillAyd while im on board with bigger-picture discussion about how to do simd, i don't think anyone is against doing it at all. In that case, can we move forward with this (and alvaro's work) and then if we decide to use xsimd/highway/etc we can do follow-up refactors? In both this and Alvaro's work, these are blockers to more perf-improvement work.

@mroeschke
Copy link
Copy Markdown
Member

I also lean toward Joris' and Will's opinion about at least seeing/evaluating what all our SIMD options are before merging a PR enabling it.

@jbrockmendel
Copy link
Copy Markdown
Member Author

opened #64884

@jorisvandenbossche
Copy link
Copy Markdown
Member

In both this and Alvaro's work, these are blockers to more perf-improvement work.

How is this PR a blocker for more perf-related work on csv? (unless it would be more simd improvements)

can we move forward with this (and alvaro's work) and then if we decide to use xsimd/highway/etc we can do follow-up refactors?

I think that can be part of the general discussion: we can discuss both what is feasible/practical on the short-term, as what we would want to investigate more for a better, longer term solution (e.g. I think this PR and Alvaro's are already using a different approach? we might want to stick to one on the short term)

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated

#if defined(__ARM_NEON) || defined(__ARM_NEON__)
# include <arm_neon.h>
#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is __SSE2 for? (didn't directly find anything about it, only see __SSE2__ online)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude says its a non-standard variant some compilers define and is safe to remove. will do so.

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
# include <arm_neon.h>
#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \
defined(_M_IX86)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not sufficient. Following https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170, that defines the 32-bit x86 platform, but it is only _M_IX86_FP that defines the available instruction set, and that can theoretically be lower than SSE2.

I have seen defined(_M_IX86_FP) && (_M_IX86_FP >= 2) being used elsewhere

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
# include <arm_neon.h>
#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \
defined(_M_IX86)
# include <immintrin.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# include <immintrin.h>
# include <emmintrin.h>

?

Since that gives the header specifically for SSE2, avoiding the risk to accidentally use a more recent instruction?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
Comment on lines +745 to +746
#elif defined(__SSE2__) || defined(__SSE2) || defined(_M_X64) || \
defined(_M_IX86)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we #define something like PANDAS_HAS_SSE2 when doing this check the first time above, so we don't have to repeat the full conditional every time?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment on lines +1094 to +1097

#if defined(__ARM_NEON) || defined(__ARM_NEON__)
if (!self->delim_whitespace) {
size_t remaining = self->datalen - (i + 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to better understand how the specific simd code is used (didn't look in detail at the surrounding code): this is adding some new logic, but doesn't seem to replace another line of code that would otherwise run now? (or put differently, there is no fallback?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude: the preceding PUSH_CHAR(c) already handled the current byte. The SIMD block then scans ahead and bulk-copies any subsequent normal characters. If there are fewer than 16 bytes remaining or skip is 0, it does nothing and the byte-at-a-time loop continues normally. The fallback is the existing code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)

@jorisvandenbossche jorisvandenbossche added the Build Library building on various platforms label Mar 27, 2026
@jbrockmendel
Copy link
Copy Markdown
Member Author

How is this PR a blocker for more perf-related work on csv? (unless it would be more simd improvements)

I'm holding back non-SIMD perf PRs to avoid having too many PRs a) touching nearby code and b) in general.

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
return i;
}

#elif defined(PANDAS_HAS_SSE2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a higher instruction set that processes 256/512 bits (like SVE or AVX2)? Mainly to compare against #64582 and #64905.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, but some profiling (on mac) suggests its slightly slower.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. Your CPU probably doesn't support SVE and the runtime check may have reduced performance a little bit.

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
#include <stdbool.h>
#include <stdlib.h>

#if defined(__ARM_NEON) || defined(__ARM_NEON__)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM64)

Windows ARM also supports NEON?

Comment on lines +1094 to +1097

#if defined(__ARM_NEON) || defined(__ARM_NEON__)
if (!self->delim_whitespace) {
size_t remaining = self->datalen - (i + 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments to clarify this for future readers? (like the existing comment for the "Scalar bulk scan fallback", but then for the simd code)

Comment thread pandas/_libs/src/parser/tokenizer.c Outdated
Comment on lines +1129 to +1150
#ifdef PANDAS_HAS_NEON
{
size_t remaining = self->datalen - (i + 1);
if (remaining >= 16) {
size_t skip =
fast_scan_quoted_neon(buf, remaining, vquote, vescape);
if (skip > 0) {
memcpy(stream, buf, skip);
stream += skip;
slen += skip;
buf += skip;
i += skip;
}
}
}
#elif defined(PANDAS_HAS_SSE2)
{
size_t remaining = self->datalen - (i + 1);
if (remaining >= 16) {
size_t skip = fast_scan_quoted_sse(buf, remaining, vquote, vescape);
if (skip > 0) {
memcpy(stream, buf, skip);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this duplicated code for both versions (in the end, only 1 line is different in both cases), we could also move the neon vs sse2 in a fast_scan_quoted_simd function that does that dispatch? And then here only need time the above block?

@jorisvandenbossche
Copy link
Copy Markdown
Member

@jbrockmendel on the dev call, I think you mentioned that using an abstraction layer (like xsimd or highway) would be overkill in this case or result in way more complex code (don't recall your exact phrasing, but something like this).
Did you actually generate a draft solution to come to that conclusion? If so, could you give some more details?

@jbrockmendel
Copy link
Copy Markdown
Member Author

Did you actually generate a draft solution to come to that conclusion? If so, could you give some more details?

For the last dev call I discussed the with claude and concluded it was overkill. This morning I asked it to imlpement a xsimd-based implementation, which I pushed. It isn't working because xsimd isn't wired up as a dependency, but can give us an idea of the tradeoffs involved.

for (; i + kStep <= len; i += kStep) {
const auto chunk = batch_u8::load_unaligned(p + i);
auto mask = (chunk == v[0]);
for (int j = 1; j < N; ++j) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alvaro-Kothe is going through a similar exercise on his SIMD PR and I gave the feedback that this type of looping + bit fiddling is extremely inefficient. nanoarrow has built-in functionality to do this better and it sounds like @Alvaro-Kothe is researching something with xsimd (?) - let's keep tabs on that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was a problem with the non-xsimd variant? That felt lower-touch to me and I marginally prefer it.

jbrockmendel and others added 8 commits May 11, 2026 17:17
Use NEON (ARM) and SSE2 (x86-64) intrinsics to accelerate the
tokenizer's inner loop. When processing normal characters in the
IN_FIELD and IN_QUOTED_FIELD states, scan 16 bytes at a time to
find the next special character (delimiter, line terminator, quote,
etc.) and bulk-copy the intervening bytes via memcpy instead of
one-at-a-time PUSH_CHAR.

Two scan variants:
- fast_scan_{neon,sse}: checks 6 special chars (delimiter, line
  terminator, CR, quote, escape, comment) for unquoted fields
- fast_scan_quoted_{neon,sse}: checks only 2 chars (quote, escape)
  for quoted fields, where delimiters are literal

Disabled for delim_whitespace=True (needs isblank() which checks
multiple characters). Falls back to scalar for fields shorter than
16 bytes and on architectures without NEON/SSE2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace GCC/Clang-only __builtin_ctz with a portable _pandas_ctz
wrapper that uses _BitScanForward on MSVC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Define PANDAS_HAS_NEON / PANDAS_HAS_SSE2 macros to avoid repeating
  the full platform-detection conditional at every #ifdef site
- Replace <immintrin.h> with <emmintrin.h> (SSE2-specific header)
- Fix _M_IX86 check to require _M_IX86_FP >= 2 (actual SSE2 support)
- Drop non-standard __SSE2 (without trailing underscores)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add _M_ARM64 to NEON detection for Windows ARM, deduplicate
NEON/SSE2 call sites with unified dispatch macros, and add
comments to the SIMD scan blocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…__builtin_ctzll

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hand-rolled NEON and SSE2 intrinsics in the C tokenizer's
bulk-scan path with a small C++ shim that uses xsimd. The new
`simd_scan` interface lives in `src/parser/simd_scan.{h,cpp}` and
exposes a portable scanner API to `tokenizer.c`. xsimd selects the
best compile-time target and handles NEON-vs-SSE2 mask extraction
internally, so the platform `#ifdef` ladders, ctz portability
wrappers, and per-call broadcast setup in `tokenizer.c` all go away.

Behavior is unchanged: same 16-byte chunk size, same scalar fallback
for trailing bytes, same disable for `delim_whitespace=True`.

Depends on GH#65471, which adds the `xsimd_dep` meson dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tslibs/parsing extension links tokenizer.c but its meson entry was
not updated alongside the other three tokenizer.c consumers (lib,
pandas_parser, parsers), leaving pd_scanner_create / pd_scanner_destroy
/ pd_scanner_scan unresolved at import time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build Library building on various platforms IO CSV read_csv, to_csv Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants