Use tweaked PSWF implementation by mreineck · Pull Request #854 · flatironinstitute/finufft

mreineck · 2026-05-04T12:32:44Z

This PR introduces an optimized implementation of the PSWF code.

Advantages:

Significantly faster function evaluation (initialization has not become much faster though)
lower memory consumption for a single class object
more compact code

Open issues:

Do we want to throw an error if the iteration in the initialization function does not converge? If yes, do we introduce a new error code? (The existing implementation silently ignores the error case, and to my knowledge, it never triggers.)
I did not make any changes to the thread-safe caching section. That may need some additional tweaking at another time.
At the moment, evaluation is only supported for one abscissa at a time. Performance can be improved by evaluating a whole vector of abscissas in a single call, but implementation depends on the choice of SIMD library etc., so I left this open for now.

DiamonDinoia · 2026-05-04T16:06:31Z

I leave the decision to @ahbarnett and @lu1and10,

My suggestion would be to make all the related changes in this PR in one go instead of follow up PR. As I want to use this code in the GPU next, I'd rather make all the changes now than making GPU changes on a moving target.

It is better to have an error. It never triggers but techically finufft can be compiled for half precision with minimal changes. It could trigger in that case. Adding an error costs nothing and might save debug time in the future.
I would remove the cache entirely
I would do the SIMD eval

I would measure the performance of 2 and 3 and see where we end up.

@mreineck I do not want you to taks with something that might be reverted so I would wait for @lu1and10 and @ahbarnett to confirm what they like.

I think 2/3 can be a follow up PR as @lu1and10 prefers but to me we can just do tree commits here in this PR so the history is meaningful.

lu1and10 · 2026-05-04T16:24:20Z

Thanks for the comments. I like this PR overall and would be in favor of moving it forward.

For (1), I agree that having an error path is better, even if it should basically never trigger.

For (2), I am OK with either keeping the current cache for now or revisiting it if benchmarks show it matters.

For (3), I would prefer to leave SIMD/batched evaluation for a separate PR. A follow-up PR would also let us benchmark the batched path properly and decide on the SIMD abstraction without expanding the scope here.

So my preference is: merge this PR once the error handling/details are settled, and track cache/SIMD batching separately if needed.

mreineck · 2026-05-04T18:17:11Z

Thanks for the comments! I'll look at the error handling tomorrow or perhaps later today!

Removing the cache will require structural changes in the rest of the code in order to keep efficiency, but perhaps they are minor (I haven't looked closely yet.)

mreineck · 2026-05-04T18:35:08Z

Just a small word of caution:

It never triggers but techically finufft can be compiled for half precision with minimal changes. It could trigger in that case.

Using the PSWF setup code (not necessarily the function evaluation) with anything but double precision will require changes I don't feel competent to do, and the performance gains will probably not be worth the effort.

mreineck · 2026-05-04T19:34:27Z

Here is a version that avoids the caches and only builds one PSWF0 object for every transform.
The implementation feels somewhat indirect, since instead of calling kernel_definition repeatedly, we now have a function kernel_definition_lambda, which is called once and returns a lambda that performs the actual kernel evaluation.

mreineck · 2026-05-05T05:47:08Z

Interesting ... gcc-13 doesn't seem to be available any more on some test platforms, causing CI runs to fail.
Perhaps this is related to gcc 16 having been released recently, and gcc 13 therefore being phased out(?)

ahbarnett · 2026-05-05T20:12:59Z

Thanks all,
As we just discussed, I would also prefer SIMD to be a separate PR, just for simplicity, and since speed is not the issue here. Remember type-3 would use the fast kernel Horner eval anyway.
So, Marco, if you want to tweak this, just confine it to non-SIMD for now.

Re half-prec:
The prolate construction (pswf.cpp) should always be double prec. kernel_definition() can caste that to single or whatever. There won't be any speed gain (this is only plan time, and once it's sub-100us, we're good).

Otherwise I'll bring it in...

DiamonDinoia · 2026-05-05T20:20:24Z

Thanks all, As we just discussed, I would also prefer SIMD to be a separate PR, just for simplicity, and since speed is not the issue here. Remember type-3 would use the fast kernel Horner eval anyway. So, Marco, if you want to tweak this, just confine it to non-SIMD for now.

Okay.

Re half-prec: The prolate construction (pswf.cpp) should always be double prec. kernel_definition() can caste that to single or whatever. There won't be any speed gain (this is only plan time, and once it's sub-100us, we're good).

Right you are correct.

Otherwise I'll bring it in...

I have no objections.

DiamonDinoia · 2026-05-05T20:49:41Z

Interesting ... gcc-13 doesn't seem to be available any more on some test platforms, causing CI runs to fail. Perhaps this is related to gcc 16 having been released recently, and gcc 13 therefore being phased out(?)

Took the chance to update the compilers in Ci. It should work now.

github-actions · 2026-05-12T16:44:11Z

FFT backend: DUCC

Numbers are advisory: GitHub-hosted runners have variable hardware. Treat <1.10× as noise.

CPU and compiler configuration

CPU name: AMD EPYC 7763 64-Core Processor.

Arch: X86_64.

Core count: 2.

ISA extensions: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avx, avx2, bmi1, bmi2, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpuid, cr8_legacy, cx16, cx8, de, decodeassists, erms, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fsrm, fxsr, fxsr_opt, ht, hypervisor, invpcid, lahf_lm, lm, mca, mce, misalignsse, mmx, mmxext, movbe, msr, mtrr, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, pae, pat, pausefilter, pcid, pclmulqdq, pdpe1gb, pfthreshold, pge, pni, popcnt, pse, pse36, rdpid, rdpru, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, svm, syscall, topoext, tsc, tsc_known_freq, tsc_reliable, tsc_scale, umip, user_shstk, v_vmsave_vmload, vaes, vmcb_clean, vme, vmmcall, vpclmulqdq, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves.

Compiler version: c++ (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0.

Compiler flags: -march=native.

perftest commands

taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3

DiamonDinoia · 2026-05-15T17:44:33Z

Hi @mreineck,

src/common/kernel.cpp kf=4 — crashes. Lambda passes beta·√(1−z²) − 1.0 to cyl_bessel_i;
argument goes negative near |z|=1 → std::domain_error. try tolsweep 4.
src/common/kernel.cpp kf=6 — wrong values. Same -1.0-inside-arg error in the cosh lambda.
tolsweep 6 shows worstfac up to 4.4 (~2 digits lost).

Fix:

// kf==4
return (common::cyl_bessel_i(0, beta * std::sqrt(1.0 - z*z)) - 1.0) / besselbeta;
// kf==6
return (std::cosh(beta * std::sqrt(1.0 - z*z)) - 1.0) / coshbeta;

prolql1 throws bare int — every other path throws finufft::exception. Will surface as
FINUFFT_ERR_UNKNOWN_EXCEPTION instead of FINUFFT_ERR_PSWF_SETUP.
Lambda captures PSWF0 by value (copies two vectors) — move it in.

…al finufft::exception in PSWF setup

mreineck · 2026-05-15T19:27:46Z

Wow, thank you for spotting the parenthesis errors - these were embarassing!
Can we do anything to the unit test suite to trigger them?

In pswf.cpp I threw int because my brain was still in cufinufft mode; fixed now!

Lambda captures PSWF0 by value (copies two vectors) — move it in.

I don't really understand that one. If I move the construction of the PSWF object into the lambda, the setup machinery will run every time the lambda is called, which we don't want. Copying the two vectors once is very cheap in comparison.

DiamonDinoia · 2026-05-16T00:09:22Z

I don't really understand that one. If I move the construction of the PSWF object into the lambda, the setup machinery will run every time the lambda is called, which we don't want. Copying the two vectors once is very cheap in comparison.

Oh my bad! The copy happens once, at lambda construction time

DiamonDinoia · 2026-05-16T00:11:20Z

I guess only rebasing master is needed then I am happy. Though @lu1and10 and @ahbarnett are the ones that understand more about this than me.

mreineck · 2026-05-16T07:25:31Z

Master is merged. Please feel free to squash the commits on the branch in any way you see fit - I don't have enough experience with this to do it myself.

I also re-added the "return 0 if z is outside [-1;1]" to the kernel evaluation; I had forgotten this when switching to the lambda.

mreineck · 2026-05-16T07:27:47Z

Oh my bad! The copy happens once, at lambda construction time

"Move-capturing" would have been a nice feature here, but I suspect that this was left out of the standard for good reasons ...

lu1and10 · 2026-05-16T10:21:29Z

looks good, should we merge?

mreineck · 2026-05-19T11:38:44Z

According to the meeting minutes, @ahbarnett wanted to have a look, so perhaps we should wait.

ahbarnett

Hi Martin. Looks good.
Shame about the code duplication in the kernel defn func now :( (I liked my old short kernel func, but is there no way to write such a code?).

One issue is that we don't actually test all these other unused kernels, so a rewrite such as this can break them and we don't know. I guess they are listed in the docs as not for users anyway.

Could you comment if the PSWF code is capable of eval outside of [-1,1] ? My memory is that it isn't, and that we don't need that even if we use pswf for direct type-3 deconvolution weight calc.

Thanks, Alex

ahbarnett · 2026-05-20T21:22:50Z

-  else if (kf == 3)
+    const double expbeta = std::exp(beta);
+    return [beta, expbeta](double z) {
+      if (std::abs(z) > 1.0) return 0.0; // restrict support to [-1,1]


it's a bit sad how much code duplication this PR created. I guess there's no way to have the clean return of 0.0 as in my original l.43 ? Same for the arg which is now repeated several times...

I think I can write a version with less code duplication, which is more similar in spirit to the current structure. That will have to wait till next week though.
My goal was to minimize the individual functions that are doing the actual evaluation; if we keep the current style, the evaluation function itself will contain the if(kf==<something>) sequence, which felt worse to me.
Some degree of duplication will be unavoidable however, if we want to separate the initialization part (i.r. precomputing the expbeta or setting up the PSWF0) and the evaluation proper - and I think we really want to do that.

ahbarnett · 2026-05-20T21:25:51Z

-*/
-double pswf(double c, double x);
+
+/* Class for evaluation of the prolate spheroidal wavefunction


I like this doc - thanks!

mreineck · 2026-05-21T06:05:40Z

One issue is that we don't actually test all these other unused kernels, so a rewrite such as this can break them and we don't know. I guess they are listed in the docs as not for users anyway.

My personal feeling is that we are close to the point where all the other kernel shapes can be removed - then we don't have to worry about testing them and keeping them up to date. I don't see any reason why anyone should prefer non-PSWF kernels, except for accuracy comparisons of course, but that can always be done by running two different versions of the library against each other.

Once the Cuda part has switched to PSWF, I think this should be seriously considered.

mreineck · 2026-05-21T06:11:25Z

Could you comment if the PSWF code is capable of eval outside of [-1,1] ? My memory is that it isn't, and that we don't need that even if we use pswf for direct type-3 deconvolution weight calc.

That's correct, the code can't produce anything meaningful outside of [-1;1]. Actually I'm not sure why that should even be a thing ... all definitions of the PSWF I've seen so far gave a definition range of [-1;1] or even (-1;1). Are the Roghlin extensions for some specific use case?

If values outside this range are required for type-3 deconvolution (and I honestly don't know if they are), then this patch shouldn't go in. But please note that the current PSWF implementation doesn't support this range extension either yet - some of the ingredients are there, but I think the code needs quite some overhaul to activate them,

ahbarnett · 2026-05-21T15:10:35Z

THanks for the responses, Martin. If you switch this PR from Draft, I'll bring it in.

mreineck · 2026-05-22T16:11:31Z

Done! Sorry, I was offline yesterday.

use tweaked PSWF implementation

9d9075d

mreineck added 2 commits May 4, 2026 20:42

introduce error code for non-convergence in PSWF setup

08a3533

introduce kernel evaluation lambdas to avoid PSWF caching

414818c

remove unused code

f38212f

mreineck added 3 commits May 6, 2026 08:15

tiny improvement

e07fdca

Merge branch 'master' into simplify_pswf

7661968

merge master

0f1c643

github-actions Bot added a commit that referenced this pull request May 12, 2026

Perftest image PR #854 @ 0f1c643 [no ci]

2923248

fix embarassing porting errors in kernel formula for kf=4/6; throw re…

19d5263

…al finufft::exception in PSWF setup

mreineck added 2 commits May 16, 2026 09:17

merge master

716000c

re-add [-1;1] support limitation to kernel evaluation

595fd2a

github-actions Bot added a commit that referenced this pull request May 16, 2026

Perftest image PR #854 @ 595fd2a [no ci]

60fdf7f

ahbarnett approved these changes May 20, 2026

View reviewed changes

mreineck marked this pull request as ready for review May 22, 2026 16:11

Conversation

mreineck commented May 4, 2026

Uh oh!

DiamonDinoia commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lu1and10 commented May 4, 2026

Uh oh!

mreineck commented May 4, 2026

Uh oh!

mreineck commented May 4, 2026

Uh oh!

mreineck commented May 4, 2026

Uh oh!

mreineck commented May 5, 2026

Uh oh!

ahbarnett commented May 5, 2026

Uh oh!

DiamonDinoia commented May 5, 2026

Uh oh!

DiamonDinoia commented May 5, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented May 15, 2026

Uh oh!

mreineck commented May 15, 2026

Uh oh!

DiamonDinoia commented May 16, 2026

Uh oh!

DiamonDinoia commented May 16, 2026

Uh oh!

mreineck commented May 16, 2026

Uh oh!

mreineck commented May 16, 2026

Uh oh!

lu1and10 commented May 16, 2026

Uh oh!

mreineck commented May 19, 2026

Uh oh!

ahbarnett left a comment

Choose a reason for hiding this comment

Uh oh!

ahbarnett May 20, 2026

Choose a reason for hiding this comment

Uh oh!

mreineck May 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahbarnett May 20, 2026

Choose a reason for hiding this comment

Uh oh!

mreineck commented May 21, 2026

Uh oh!

mreineck commented May 21, 2026

Uh oh!

ahbarnett commented May 21, 2026

Uh oh!

mreineck commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DiamonDinoia commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading