Conversation
|
To precompute simd constants at the start, the best solution I found was doing something like this: #ifdef __SSE2__
static __m128i _128_vec_ones;
#endif
CONSTRUCTOR void simd_init(void)
{
#ifdef __SSE2__
_128_vec_ones = _mm_set1_epi8('1');
#endif
}where |
|
I'm constantly getting these warnings. Apparently they're harmless since I always use warning: cast increases required alignment of target type [-Wcast-align]
653 | _mm256_storeu_si256((__m256i *)r->v, out);The only fixes I found are:
|
f1524a9 to
ba4646b
Compare
|
I built this and ran the benchmarks on my machine ( Signing became faster, but verification became slower. After looking at the bench_internal results, I figured that the culprit is This currently saves ~0.5% in Some more comments:
|
Beware that benchmarks are completely unreliable as of right now. see #1701 |
b050834 to
ac1cb71
Compare
d04b9f4 to
1127dce
Compare
890b234 to
d6b3f65
Compare
This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.
Why not sse and avx512?
arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged...
Tasks
-mavx,-mavx2,-mno-avx,-mno-avx2when building for amd64TODO: precompute)Commits
I've split this PR into multiple commits with the following criteria:
Test & Benchmark
To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running:
./simd-build.sh && ./simd-test.sh && ./simd-bench.shand executing the notebook as is.The baseline is compiled with
"-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__"so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.Results