Skip to content

Use tweaked PSWF implementation#854

Open
mreineck wants to merge 10 commits into
masterfrom
simplify_pswf
Open

Use tweaked PSWF implementation#854
mreineck wants to merge 10 commits into
masterfrom
simplify_pswf

Conversation

@mreineck
Copy link
Copy Markdown
Collaborator

@mreineck mreineck commented May 4, 2026

This PR introduces an optimized implementation of the PSWF code.

Advantages:

  • Significantly faster function evaluation (initialization has not become much faster though)
  • lower memory consumption for a single class object
  • more compact code

Open issues:

  • Do we want to throw an error if the iteration in the initialization function does not converge? If yes, do we introduce a new error code? (The existing implementation silently ignores the error case, and to my knowledge, it never triggers.)
  • I did not make any changes to the thread-safe caching section. That may need some additional tweaking at another time.
  • At the moment, evaluation is only supported for one abscissa at a time. Performance can be improved by evaluating a whole vector of abscissas in a single call, but implementation depends on the choice of SIMD library etc., so I left this open for now.

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

DiamonDinoia commented May 4, 2026

I leave the decision to @ahbarnett and @lu1and10,

My suggestion would be to make all the related changes in this PR in one go instead of follow up PR. As I want to use this code in the GPU next, I'd rather make all the changes now than making GPU changes on a moving target.

  1. It is better to have an error. It never triggers but techically finufft can be compiled for half precision with minimal changes. It could trigger in that case. Adding an error costs nothing and might save debug time in the future.
  2. I would remove the cache entirely
  3. I would do the SIMD eval

I would measure the performance of 2 and 3 and see where we end up.

@mreineck I do not want you to taks with something that might be reverted so I would wait for @lu1and10 and @ahbarnett to confirm what they like.

I think 2/3 can be a follow up PR as @lu1and10 prefers but to me we can just do tree commits here in this PR so the history is meaningful.

@lu1and10
Copy link
Copy Markdown
Member

lu1and10 commented May 4, 2026

Thanks for the comments. I like this PR overall and would be in favor of moving it forward.

For (1), I agree that having an error path is better, even if it should basically never trigger.

For (2), I am OK with either keeping the current cache for now or revisiting it if benchmarks show it matters.

For (3), I would prefer to leave SIMD/batched evaluation for a separate PR. A follow-up PR would also let us benchmark the batched path properly and decide on the SIMD abstraction without expanding the scope here.

So my preference is: merge this PR once the error handling/details are settled, and track cache/SIMD batching separately if needed.

@mreineck
Copy link
Copy Markdown
Collaborator Author

mreineck commented May 4, 2026

Thanks for the comments! I'll look at the error handling tomorrow or perhaps later today!

Removing the cache will require structural changes in the rest of the code in order to keep efficiency, but perhaps they are minor (I haven't looked closely yet.)

@mreineck
Copy link
Copy Markdown
Collaborator Author

mreineck commented May 4, 2026

Just a small word of caution:

It never triggers but techically finufft can be compiled for half precision with minimal changes. It could trigger in that case.

Using the PSWF setup code (not necessarily the function evaluation) with anything but double precision will require changes I don't feel competent to do, and the performance gains will probably not be worth the effort.

@mreineck
Copy link
Copy Markdown
Collaborator Author

mreineck commented May 4, 2026

Here is a version that avoids the caches and only builds one PSWF0 object for every transform.
The implementation feels somewhat indirect, since instead of calling kernel_definition repeatedly, we now have a function kernel_definition_lambda, which is called once and returns a lambda that performs the actual kernel evaluation.

@mreineck
Copy link
Copy Markdown
Collaborator Author

mreineck commented May 5, 2026

Interesting ... gcc-13 doesn't seem to be available any more on some test platforms, causing CI runs to fail.
Perhaps this is related to gcc 16 having been released recently, and gcc 13 therefore being phased out(?)

@ahbarnett
Copy link
Copy Markdown
Collaborator

Thanks all,
As we just discussed, I would also prefer SIMD to be a separate PR, just for simplicity, and since speed is not the issue here. Remember type-3 would use the fast kernel Horner eval anyway.
So, Marco, if you want to tweak this, just confine it to non-SIMD for now.

Re half-prec:
The prolate construction (pswf.cpp) should always be double prec. kernel_definition() can caste that to single or whatever. There won't be any speed gain (this is only plan time, and once it's sub-100us, we're good).

Otherwise I'll bring it in...

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

Thanks all, As we just discussed, I would also prefer SIMD to be a separate PR, just for simplicity, and since speed is not the issue here. Remember type-3 would use the fast kernel Horner eval anyway. So, Marco, if you want to tweak this, just confine it to non-SIMD for now.

Okay.

Re half-prec: The prolate construction (pswf.cpp) should always be double prec. kernel_definition() can caste that to single or whatever. There won't be any speed gain (this is only plan time, and once it's sub-100us, we're good).

Right you are correct.

Otherwise I'll bring it in...

I have no objections.

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

Interesting ... gcc-13 doesn't seem to be available any more on some test platforms, causing CI runs to fail. Perhaps this is related to gcc 16 having been released recently, and gcc 13 therefore being phased out(?)

Took the chance to update the compilers in Ci. It should work now.

github-actions Bot added a commit that referenced this pull request May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Perftest plot

FFT backend: DUCC

Numbers are advisory: GitHub-hosted runners have variable hardware. Treat <1.10× as noise.

CPU and compiler configuration

CPU name: AMD EPYC 7763 64-Core Processor.

Arch: X86_64.

Core count: 2.

ISA extensions: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avx, avx2, bmi1, bmi2, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpuid, cr8_legacy, cx16, cx8, de, decodeassists, erms, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fsrm, fxsr, fxsr_opt, ht, hypervisor, invpcid, lahf_lm, lm, mca, mce, misalignsse, mmx, mmxext, movbe, msr, mtrr, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, pae, pat, pausefilter, pcid, pclmulqdq, pdpe1gb, pfthreshold, pge, pni, popcnt, pse, pse36, rdpid, rdpru, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, svm, syscall, topoext, tsc, tsc_known_freq, tsc_reliable, tsc_scale, umip, user_shstk, v_vmsave_vmload, vaes, vmcb_clean, vme, vmmcall, vpclmulqdq, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves.

Compiler version: c++ (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0.

Compiler flags: -march=native.

perftest commands
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=0.0001 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=10000.0 --N2=1 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
taskset -c 0 /home/runner/work/finufft/finufft/builds/master/perftest/perftest --arg --prec=d --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=1 --M=10000000.0 --tol=1e-09 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=f --N1=320 --N2=320 --N3=1 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-05 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=1
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=2
/home/runner/work/finufft/finufft/builds/master/perftest/perftest --prec=d --N1=192 --N2=192 --N3=128 --ntransf=1 --threads=0 --M=10000000.0 --tol=1e-07 --n_runs=15 --sort=1 --upsampfact=0 --kerevalmethod=1 --debug=0 --bandwidth=1.0 --type=3

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

Hi @mreineck,

  • src/common/kernel.cpp kf=4 — crashes. Lambda passes beta·√(1−z²) − 1.0 to cyl_bessel_i;
    argument goes negative near |z|=1 → std::domain_error. try tolsweep 4.
  • src/common/kernel.cpp kf=6 — wrong values. Same -1.0-inside-arg error in the cosh lambda.
    tolsweep 6 shows worstfac up to 4.4 (~2 digits lost).

Fix:

// kf==4
return (common::cyl_bessel_i(0, beta * std::sqrt(1.0 - z*z)) - 1.0) / besselbeta;
// kf==6
return (std::cosh(beta * std::sqrt(1.0 - z*z)) - 1.0) / coshbeta;
  • prolql1 throws bare int — every other path throws finufft::exception. Will surface as
    FINUFFT_ERR_UNKNOWN_EXCEPTION instead of FINUFFT_ERR_PSWF_SETUP.
  • Lambda captures PSWF0 by value (copies two vectors) — move it in.

@mreineck
Copy link
Copy Markdown
Collaborator Author

Wow, thank you for spotting the parenthesis errors - these were embarassing!
Can we do anything to the unit test suite to trigger them?

In pswf.cpp I threw int because my brain was still in cufinufft mode; fixed now!

Lambda captures PSWF0 by value (copies two vectors) — move it in.

I don't really understand that one. If I move the construction of the PSWF object into the lambda, the setup machinery will run every time the lambda is called, which we don't want. Copying the two vectors once is very cheap in comparison.

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

I don't really understand that one. If I move the construction of the PSWF object into the lambda, the setup machinery will run every time the lambda is called, which we don't want. Copying the two vectors once is very cheap in comparison.

Oh my bad! The copy happens once, at lambda construction time

@DiamonDinoia
Copy link
Copy Markdown
Collaborator

I guess only rebasing master is needed then I am happy. Though @lu1and10 and @ahbarnett are the ones that understand more about this than me.

@mreineck
Copy link
Copy Markdown
Collaborator Author

Master is merged. Please feel free to squash the commits on the branch in any way you see fit - I don't have enough experience with this to do it myself.

I also re-added the "return 0 if z is outside [-1;1]" to the kernel evaluation; I had forgotten this when switching to the lambda.

@mreineck
Copy link
Copy Markdown
Collaborator Author

Oh my bad! The copy happens once, at lambda construction time

"Move-capturing" would have been a nice feature here, but I suspect that this was left out of the standard for good reasons ...

github-actions Bot added a commit that referenced this pull request May 16, 2026
@lu1and10
Copy link
Copy Markdown
Member

looks good, should we merge?

@mreineck
Copy link
Copy Markdown
Collaborator Author

According to the meeting minutes, @ahbarnett wanted to have a look, so perhaps we should wait.

Copy link
Copy Markdown
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin. Looks good.
Shame about the code duplication in the kernel defn func now :( (I liked my old short kernel func, but is there no way to write such a code?).

One issue is that we don't actually test all these other unused kernels, so a rewrite such as this can break them and we don't know. I guess they are listed in the docs as not for users anyway.

Could you comment if the PSWF code is capable of eval outside of [-1,1] ? My memory is that it isn't, and that we don't need that even if we use pswf for direct type-3 deconvolution weight calc.

Thanks, Alex

Comment thread src/common/kernel.cpp
else if (kf == 3)
const double expbeta = std::exp(beta);
return [beta, expbeta](double z) {
if (std::abs(z) > 1.0) return 0.0; // restrict support to [-1,1]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit sad how much code duplication this PR created. I guess there's no way to have the clean return of 0.0 as in my original l.43 ? Same for the arg which is now repeated several times...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can write a version with less code duplication, which is more similar in spirit to the current structure. That will have to wait till next week though.
My goal was to minimize the individual functions that are doing the actual evaluation; if we keep the current style, the evaluation function itself will contain the if(kf==<something>) sequence, which felt worse to me.
Some degree of duplication will be unavoidable however, if we want to separate the initialization part (i.r. precomputing the expbeta or setting up the PSWF0) and the evaluation proper - and I think we really want to do that.

*/
double pswf(double c, double x);

/* Class for evaluation of the prolate spheroidal wavefunction
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this doc - thanks!

@mreineck
Copy link
Copy Markdown
Collaborator Author

One issue is that we don't actually test all these other unused kernels, so a rewrite such as this can break them and we don't know. I guess they are listed in the docs as not for users anyway.

My personal feeling is that we are close to the point where all the other kernel shapes can be removed - then we don't have to worry about testing them and keeping them up to date. I don't see any reason why anyone should prefer non-PSWF kernels, except for accuracy comparisons of course, but that can always be done by running two different versions of the library against each other.

Once the Cuda part has switched to PSWF, I think this should be seriously considered.

@mreineck
Copy link
Copy Markdown
Collaborator Author

Could you comment if the PSWF code is capable of eval outside of [-1,1] ? My memory is that it isn't, and that we don't need that even if we use pswf for direct type-3 deconvolution weight calc.

That's correct, the code can't produce anything meaningful outside of [-1;1]. Actually I'm not sure why that should even be a thing ... all definitions of the PSWF I've seen so far gave a definition range of [-1;1] or even (-1;1). Are the Roghlin extensions for some specific use case?

If values outside this range are required for type-3 deconvolution (and I honestly don't know if they are), then this patch shouldn't go in. But please note that the current PSWF implementation doesn't support this range extension either yet - some of the ingredients are there, but I think the code needs quite some overhaul to activate them,

@ahbarnett
Copy link
Copy Markdown
Collaborator

THanks for the responses, Martin. If you switch this PR from Draft, I'll bring it in.

@mreineck mreineck marked this pull request as ready for review May 22, 2026 16:11
@mreineck
Copy link
Copy Markdown
Collaborator Author

Done! Sorry, I was offline yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants