optimize additive pubkey tweaking with vartime generator point multiplication (>80% speedup)#1843
optimize additive pubkey tweaking with vartime generator point multiplication (>80% speedup)#1843theStack wants to merge 4 commits intobitcoin-core:masterfrom
Conversation
|
Neat. My guess would be that even bigger SDMC tables would be even better, because the cost of constant-time table lookups disappears (which needs to reads through the entire table). |
Having this available in as a global constant allows to introduce an alternative `ecmult_gen` function that doesn't need access to a context, see next commit. Note that the precomputed constant takes the name of the function that previously generated it at run-time (`secp256k1_ecmult_gen_scalar_diff`), while the function is now renamed to include the "compute" verb (`secp256k1_ecmult_gen_compute_scalar_diff`), to match the naming of the table generation function.
Add a faster variable-time variant for generator point multiplication. This is essentially `ecmult_gen` without side-channel mitigations and without requiring a context object. Intended for use cases where the scalar is not representing sensitive data (e.g. pubkey tweaking). On arm64, this is ~66% faster than the constant-time variant (with the default build table size, i.e. ECMULT_GEN_KB=86): $ ./build/bin/bench_ecmult Benchmark , Min(us) , Avg(us) , Max(us) ecmult_gen , 9.94 , 10.0 , 10.1 ecmult_gen_var , 5.96 , 5.99 , 6.02 .....
Rather than using the generic double multiply (with na=1), use the newly introduced fast variable-time generator point multiplication and add it up to the base public key manually. On arm64, this improves the performance of the `secp256k1_ec_pubkey_tweak_add` API function by about 80%: ----- Before (prior this commit): ----- ``` $ ./build/bin/bench tweak Benchmark , Min(us) , Avg(us) , Max(us) ec_pk_tweak_add , 16.1 , 16.2 , 16.6 ``` ----- After (this commit): ----- ``` $ ./build/bin/bench tweak Benchmark , Min(us) , Avg(us) , Max(us) ec_pk_tweak_add , 8.94 , 8.98 , 9.29 ``` Note that the following API functions also benefit from the improved code path: - secp256k1_xonly_pubkey_tweak_add - secp256k1_xonly_pubkey_tweak_add_check (this one is consensus-critical for P2TR script path spends, see BIP341) - secp256k1_keypair_xonly_tweak_add - secp256k1_musig_pubkey_ec_tweak_add - secp256k1_musig_pubkey_xonly_tweak_add
|
Added a preparatory commit that moves the
Ah, indeed! I've tried with table sizes up to ~18 MB, and at least on my arm64 machine, the ~5MB one (GEN_KB=5120) yields the best results (see theStack@aa1a4e7 and
So we would need to embed two different tables to get the most out of it (without degrading the signing/nonce generation speed). Not sure how performance-critical pubkey tweaking is though in practice, the only currently widespread scenario I could think of is verification of taproot commitments. For silentpayments I still have to check how these larger table values would affect overall performance. |
f64306a to
0ebb1aa
Compare
|
Concept ACK |
|
@theStack Yeah, I don't think it's worth having two built-in tables just for this, but it's good to see my intuition confirmed. |
There was a problem hiding this comment.
The code can be deduplicated.
diff
diff --git a/src/eckey_impl.h b/src/eckey_impl.h
index 9cee88d..566a318 100644
--- a/src/eckey_impl.h
+++ b/src/eckey_impl.h
@@ -61,6 +61,7 @@ static int secp256k1_eckey_privkey_tweak_add(secp256k1_scalar *key, const secp25
static int secp256k1_eckey_pubkey_tweak_add(secp256k1_ge *key, const secp256k1_scalar *tweak) {
secp256k1_gej pt;
+ /* `tweak` is public here, so the variable-time generator multiplication is safe. */
secp256k1_ecmult_gen_var(&pt, tweak);
secp256k1_gej_add_ge_var(&pt, &pt, key, NULL);
diff --git a/src/ecmult_gen.h b/src/ecmult_gen.h
index 74942ed..567cc48 100644
--- a/src/ecmult_gen.h
+++ b/src/ecmult_gen.h
@@ -138,6 +138,10 @@ static void secp256k1_ecmult_gen_context_clear(secp256k1_ecmult_gen_context* ctx
/** Multiply with the generator: R = a*G */
static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context* ctx, secp256k1_gej *r, const secp256k1_scalar *a);
+/** Multiply with the generator: R = a*G.
+ *
+ * Variable-time implementation. Only call this for public/non-secret scalars.
+ */
static void secp256k1_ecmult_gen_var(secp256k1_gej *r, const secp256k1_scalar *a);
static void secp256k1_ecmult_gen_blind(secp256k1_ecmult_gen_context *ctx, const secp256k1_hash_ctx *hash_ctx, const unsigned char *seed32);
diff --git a/src/ecmult_gen_impl.h b/src/ecmult_gen_impl.h
index 8e4209f..3c4460e 100644
--- a/src/ecmult_gen_impl.h
+++ b/src/ecmult_gen_impl.h
@@ -30,17 +30,89 @@ static void secp256k1_ecmult_gen_context_clear(secp256k1_ecmult_gen_context *ctx
secp256k1_fe_clear(&ctx->proj_blind);
}
+/* Convert a scalar to a zero-padded array of 32-bit words. Only the first 8 words
+ * can contain non-zero data, but the padding avoids out-of-bounds reads from the
+ * scalar when COMB_BITS > 256. */
+static void secp256k1_ecmult_gen_scalar_to_recoded(uint32_t recoded[(COMB_BITS + 31) >> 5], const secp256k1_scalar *s) {
+ int i;
+
+ memset(recoded, 0, ((COMB_BITS + 31) >> 5) * sizeof(*recoded));
+ for (i = 0; i < 8 && i < ((COMB_BITS + 31) >> 5); ++i) {
+ recoded[i] = secp256k1_scalar_get_bits_limb32(s, 32 * i, 32);
+ }
+}
+
+/* Gather the mask(block)-selected bits of the recoded scalar into a packed value:
+ * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off].
+ *
+ * This constant-time variant mirrors the power side-channel hardening used by
+ * secp256k1_ecmult_gen. */
+static uint32_t secp256k1_ecmult_gen_lookup_bits(const uint32_t recoded[(COMB_BITS + 31) >> 5], uint32_t comb_off, uint32_t block) {
+ uint32_t bits = 0;
+ uint32_t bit_pos = comb_off + block * COMB_TEETH * COMB_SPACING;
+ uint32_t tooth;
+
+ /* Instead of reading individual bits here to construct the bits variable,
+ * build up the result by xoring rotated reads together. In every iteration,
+ * one additional bit is made correct, starting at the bottom. The bits
+ * above that contain junk. This reduces leakage by avoiding computations
+ * on variables that can have only a low number of possible values (e.g.,
+ * just two values when reading a single bit into a variable.) See:
+ * https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-alam.pdf
+ */
+ for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
+ /* Construct bitdata s.t. the bottom bit is the bit we'd like to read.
+ *
+ * We could just set bitdata = recoded[bit_pos >> 5] >> (bit_pos & 0x1f)
+ * but this would simply discard the bits that fall off at the bottom,
+ * and thus, for example, bitdata could still have only two values if we
+ * happen to shift by exactly 31 positions. We use a rotation instead,
+ * which ensures that bitdata doesn't lose entropy. This relies on the
+ * rotation being atomic, i.e., the compiler emitting an actual rot
+ * instruction. */
+ uint32_t bitdata = secp256k1_rotr32(recoded[bit_pos >> 5], bit_pos & 0x1f);
+
+ /* Clear the bit at position tooth, but sssh, don't tell clang. */
+ uint32_t volatile vmask = ~(1 << tooth);
+ bits &= vmask;
+
+ /* Write the bit into position tooth (and junk into higher bits). */
+ bits ^= bitdata << tooth;
+ bit_pos += COMB_SPACING;
+ }
+ return bits;
+}
+
+/* Same bit packing as secp256k1_ecmult_gen_lookup_bits(), but without the
+ * constant-time/power-analysis hardening as the scalar is public here. */
+static uint32_t secp256k1_ecmult_gen_lookup_bits_var(const uint32_t recoded[(COMB_BITS + 31) >> 5], uint32_t comb_off, uint32_t block) {
+ uint32_t bits = 0;
+ uint32_t bit_pos = comb_off + block * COMB_TEETH * COMB_SPACING;
+ uint32_t tooth;
+
+ for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
+ uint32_t bit = (recoded[bit_pos >> 5] >> (bit_pos & 0x1f)) & 1;
+ bits |= bit << tooth;
+ bit_pos += COMB_SPACING;
+ }
+ return bits;
+}
+
+static void secp256k1_ecmult_gen_lookup_table_index(uint32_t bits, uint32_t *sign, uint32_t *abs) {
+ *sign = (bits >> (COMB_TEETH - 1)) & 1;
+ *abs = (bits ^ -*sign) & (COMB_POINTS - 1);
+ VERIFY_CHECK(*sign == 0 || *sign == 1);
+ VERIFY_CHECK(*abs < COMB_POINTS);
+}
+
static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp256k1_gej *r, const secp256k1_scalar *gn) {
uint32_t comb_off;
secp256k1_ge add;
secp256k1_fe neg;
secp256k1_ge_storage adds;
secp256k1_scalar d;
- /* Array of uint32_t values large enough to store COMB_BITS bits. Only the bottom
- * 8 are ever nonzero, but having the zero padding at the end if COMB_BITS>256
- * avoids the need to deal with out-of-bounds reads from a scalar. */
- uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
- int first = 1, i;
+ uint32_t recoded[(COMB_BITS + 31) >> 5];
+ int first = 1;
memset(&adds, 0, sizeof(adds));
@@ -88,9 +160,7 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
/* Compute the scalar d = (gn + ctx->scalar_offset). */
secp256k1_scalar_add(&d, &ctx->scalar_offset, gn);
/* Convert to recoded array. */
- for (i = 0; i < 8 && i < ((COMB_BITS + 31) >> 5); ++i) {
- recoded[i] = secp256k1_scalar_get_bits_limb32(&d, 32 * i, 32);
- }
+ secp256k1_ecmult_gen_scalar_to_recoded(recoded, &d);
secp256k1_scalar_clear(&d);
/* In secp256k1_ecmult_gen_prec_table we have precomputed sums of the
@@ -171,47 +241,11 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
comb_off = COMB_SPACING - 1;
while (1) {
uint32_t block;
- uint32_t bit_pos = comb_off;
/* Inner loop: for each block, add table entries to the result. */
for (block = 0; block < COMB_BLOCKS; ++block) {
- /* Gather the mask(block)-selected bits of d into bits. They're packed:
- * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off]. */
- uint32_t bits = 0, sign, abs, index, tooth;
- /* Instead of reading individual bits here to construct the bits variable,
- * build up the result by xoring rotated reads together. In every iteration,
- * one additional bit is made correct, starting at the bottom. The bits
- * above that contain junk. This reduces leakage by avoiding computations
- * on variables that can have only a low number of possible values (e.g.,
- * just two values when reading a single bit into a variable.) See:
- * https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-alam.pdf
- */
- for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
- /* Construct bitdata s.t. the bottom bit is the bit we'd like to read.
- *
- * We could just set bitdata = recoded[bit_pos >> 5] >> (bit_pos & 0x1f)
- * but this would simply discard the bits that fall off at the bottom,
- * and thus, for example, bitdata could still have only two values if we
- * happen to shift by exactly 31 positions. We use a rotation instead,
- * which ensures that bitdata doesn't lose entropy. This relies on the
- * rotation being atomic, i.e., the compiler emitting an actual rot
- * instruction. */
- uint32_t bitdata = secp256k1_rotr32(recoded[bit_pos >> 5], bit_pos & 0x1f);
-
- /* Clear the bit at position tooth, but sssh, don't tell clang. */
- uint32_t volatile vmask = ~(1 << tooth);
- bits &= vmask;
-
- /* Write the bit into position tooth (and junk into higher bits). */
- bits ^= bitdata << tooth;
- bit_pos += COMB_SPACING;
- }
-
- /* If the top bit of bits is 1, flip them all (corresponding to looking up
- * the negated table value), and remember to negate the result in sign. */
- sign = (bits >> (COMB_TEETH - 1)) & 1;
- abs = (bits ^ -sign) & (COMB_POINTS - 1);
- VERIFY_CHECK(sign == 0 || sign == 1);
- VERIFY_CHECK(abs < COMB_POINTS);
+ uint32_t bits, sign, abs, index;
+ bits = secp256k1_ecmult_gen_lookup_bits(recoded, comb_off, block);
+ secp256k1_ecmult_gen_lookup_table_index(bits, &sign, &abs);
/** This uses a conditional move to avoid any secret data in array indexes.
* _Any_ use of secret indexes has been demonstrated to result in timing
@@ -275,38 +309,22 @@ static void secp256k1_ecmult_gen_var(secp256k1_gej *r, const secp256k1_scalar *g
uint32_t comb_off;
secp256k1_ge add;
secp256k1_scalar d;
- uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
- int i;
+ uint32_t recoded[(COMB_BITS + 31) >> 5];
/* Adjust input scalar for difference and convert to recoded array. */
secp256k1_scalar_add(&d, &secp256k1_ecmult_gen_scalar_diff, gn);
- for (i = 0; i < 8 && i < ((COMB_BITS + 31) >> 5); ++i) {
- recoded[i] = secp256k1_scalar_get_bits_limb32(&d, 32 * i, 32);
- }
+ secp256k1_ecmult_gen_scalar_to_recoded(recoded, &d);
/* Outer loop: iterate over comb_off from COMB_SPACING - 1 down to 0. */
secp256k1_gej_set_infinity(r);
comb_off = COMB_SPACING - 1;
while (1) {
uint32_t block;
- uint32_t bit_pos = comb_off;
/* Inner loop: for each block, add table entries to the result. */
for (block = 0; block < COMB_BLOCKS; ++block) {
- /* Gather the mask(block)-selected bits of d into bits. They're packed:
- * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off]. */
- uint32_t bits = 0, sign, abs, tooth;
- for (tooth = 0; tooth < COMB_TEETH; ++tooth) {
- uint32_t bit = (recoded[bit_pos >> 5] >> (bit_pos & 0x1f)) & 1;
- bits |= bit << tooth;
- bit_pos += COMB_SPACING;
- }
-
- /* If the top bit of bits is 1, flip them all (corresponding to looking up
- * the negated table value), and remember to negate the result in sign. */
- sign = (bits >> (COMB_TEETH - 1)) & 1;
- abs = (bits ^ -sign) & (COMB_POINTS - 1);
- VERIFY_CHECK(sign == 0 || sign == 1);
- VERIFY_CHECK(abs < COMB_POINTS);
+ uint32_t bits, sign, abs;
+ bits = secp256k1_ecmult_gen_lookup_bits_var(recoded, comb_off, block);
+ secp256k1_ecmult_gen_lookup_table_index(bits, &sign, &abs);
/* Perform lookup, negate if necessary and add to r. */
secp256k1_ge_from_storage(&add, &secp256k1_ecmult_gen_prec_table[block][abs]);
The word "typically" is what worries me here. We don't really know what people use these functions for. What if they do have secret tweaks? And isn't the tweak secret even in known applications, e.g., Taproot tweaking, say Q = P + tG? One security feature of Taproot is that it hides the Merkle root until a script-path spend. This only works if the tweak t is secret. If t is public, then the attacker can get P = Q - tG and search for a Merkle root m such that t = hash(P, m). |
Hm good point, the "if something sounds too good to be true, it probably is" proverb comes to my mind 😅
Makes sense. For verifying taproot tweaks though, i.e. |
Hmm, I would consider that a privacy feature, and not a security feature. Generally, taproot script trees will involve multiple distrusted participants, and the leaf/script structure will be shared with them. This doesn't work if the script tree is supposed to be secret from a security perspective. And I think that we're generally not aiming to use constant time operations to protect privacy. Otherwise, it literally applies to everything. You may well want to keep your public keys private, so all functions involving public keys should be constant time? It may make sense to have a mode/extension/module for "do everything constant time", but I don't think it aligns with how the library has operated so far. |
…t-path spends fbffe8a bench: improve `VerifyNestedIfScript` benchmark precision (make stack clearing untimed) (David Gumberg) 616ee6f bench: add script verification benchmark for P2TR script-path spends (Sebastian Falbesoner) Pull request description: Similarly as #34472 already did for key-path spends, this PR adds a benchmark for P2TR script-path spends. So far we don't have a benchmark on master yet that exercises the verification of taproot commitments ([`VerifyTaprootCommitment`](https://github.com/bitcoin/bitcoin/blob/141fbe4d530b51345e62dee1348e82d8a0406ffc/src/script/interpreter.cpp#L1903)). Note that the tapscript leaf intentionally includes a single OP_CHECKSIG as it likely reflects the real world best. Spending tapscript leafs without any signature checks don't make much sense (they could be trivially tampered with and thus stolen by miners), and doing more than one signature check seems the exception rather than the rule. The primary motivation for this PR is to evaluate how potential secp256k1 changes in pubkey tweaking (e.g. [#1843](bitcoin-core/secp256k1#1843)) may impact script verification performance. ACKs for top commit: davidgumberg: reACK fbffe8a l0rinc: ACK fbffe8a sedited: ACK fbffe8a Tree-SHA512: 6fa1f2c336d6332b4f2d22173279ee29ad3ec5e5431109913a6978fef32e22a34d3247729ac9092bfdbbcd8f9dfcad8da50bc4ede2cb7efc1f93d6a744ddf41b
Additive public key tweaking (i.e. given a public key$P$ and a tweak scalar $t$ , calculating $P' = P + t \cdot G$ ) is currently implemented using the "double multiply" algorithm $R = na \cdot A + ng \cdot G$ ), with $na=1$ and $ng=t$ :
secp256k1_ecmult(secp256k1/src/ecmult.h
Lines 43 to 47 in 7262adb
secp256k1/src/eckey_impl.h
Lines 62 to 65 in 7262adb
After some tinkering I found that this could be sped up quite a bit. To take full use of the fact that the tweak scalar is typically not considered sensitive data (as opposed to secret keys or nonces), this PR first introduces a variable-time generator point multiplication routine
secp256k1_ecmult_gen_var, which is essentially a copy ofsecp256k1_ecmult_gen[1], but with all the side-channel resistance mitigations stripped out. This function is significantly faster than its constant-time original in all COMB table size configurations, over 66% in all but the smallest 2KB setting (benchmarked via$ ./build/bin/bench_ecmult):ecmult_genecmult_gen_varPlugging that new routine into the internal function
secp256k1_eckey_pubkey_tweak_addfunction (which is easy as the vartime variant is stateless and thus doesn't need a context object anymore) yields the following speedups for the API functionsecp256k1_ec_pubkey_tweak_add(benchmarked via$ ./build/bin/bench tweakon the second vs. third commit):ec_pubkey_tweak_add(using
ecmult)ec_pubkey_tweak_add(using new
ecmult_gen_var)Note that the improved code path (function
secp256k1_eckey_pubkey_tweak_add) is also used in the following API functions, so they should all benefit from it:secp256k1_xonly_pubkey_tweak_addsecp256k1_xonly_pubkey_tweak_add_check(this one is used for verifying P2TR script path spends in Bitcoin Core and thus consensus-critical, see BIP341)secp256k1_keypair_xonly_tweak_addsecp256k1_musig_pubkey_{ec,xonly}_tweak_addsecp256k1_silentpayments_recipient_scan_outputsin PR Add "silentpayments" module implementing BIP352 (take 4, limited to full-node scanning) #1765Improving Silent Payments scanning performance was the original motivation to look into this (see also https://github.com/craigraw/bench_bip352/), and in fact the "common case" benchmarks show a 15-20% speedup if this branch is applied.
[1] I've also experimented with reintroducing the much simpler
ECMULT_GEN_PREC_BITSprecomputation table (which was used prior to SDMC in versions earlier than 0.5.0) and that performs even better with PREC_BITS=8: theStack@8f7fc93 (needing only 31 point additions rather than 42). However, given that it's unclear where to store this table data (having the choice of either bloating up the library size further or again needing some form of context, if it's calculated at runtime) I decided to stick with a solution that uses the already available table data first.