Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast modular inversion #172

Merged
merged 14 commits into from
Feb 10, 2022
Merged

Fast modular inversion #172

merged 14 commits into from
Feb 10, 2022

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Feb 8, 2022

This implements fast constant-time modular inversion.

Preliminary benchmarks, without Assembly

image
image
image
image
image
image

On BLS12-381, this is almost 8x faster than Niels Möller algorithm (constant-time inversion in GMP) and Fermat's Little Theorem inversion with addition chains.

@mratsim
Copy link
Owner Author

mratsim commented Feb 8, 2022

Discussion of chosen algorithm

There are 3 papers on fast inversion in the past 3 years:

Bernstein-Yang inversion:

Pornin's inversion:

Discussion

This PR implements Bernstein-Yang inversion, there is a sketch of Pornin's inversion at:

Correctly and efficiently implementing Pornin's for generic primes is actually tricky:

  • L22: (u, v) ← (uf₀ + vg₀ mod m, uf₁ + vg₁ mod m)
    This requires efficient modular reduction. This is true for Generalized Mersenne Primes
    like secp256k1 or ED25519 but not BLS12-381.
    Given that Pornin's approach uses divsteps 31 instead of Bernstein 62 (on 64-bit)
    a slow reduction will have twice the impact.
  • BLST's authors delayed the modular reduction but this triggered
    an edge case in fuzzing: supranational/blst@fd45352#commitcomment-66068518
    In the past there was another edge case raised:
  • An efficient implementation requires:
    1. Assembly for cmov in inner loop, leading zero count
    1. fast or delayed/batched modular reduction
    2. an extra bit in the high word for negative integers, making it unsuitable for secp256k1 or P256
      when using a saturated representation.

In particular the inner loop needs to be as streamlined as possible, the lack of cmov and lzcount being platform-dependent makes the inner loop slow in pure Nim/C.
Regarding point 2, delayed/batched modular reduction alone can be done, however Pornin's method relies on an approximation of inputs that needs to be corrected at regular interval and at the computation end. Given the edge cases that popped up in BLST, delaying modular reduction AND correcting the approximation AND doing that constant-time seems fraught with peril.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement fast inversion for public data
1 participant