-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-90716: add _pylong.py module #96673
Conversation
Some relevant discussion on the pros and cons of adding these "smarter" but more complicated algorithms: https://discuss.python.org/t/faster-large-integer-multiplication/13300 I think implementing them in Python lowers the bar quite a bit in terms of what is maintainable in the long term. We have a small number of core devs who are numeric experts so we need to be cautious about adding a bunch of complex code. Maybe this PR already goes too far? It feels like pretty good "bang for buck" to me. |
About long_to_decimal, here are some things that I've noticed/thoughts.
D = decimal.Decimal
mem = {1: D(2)}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
BITLIM = 32768
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * w2pow(w2) + inner(lo, w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
result = inner(n, n.bit_length()) Btw there is an |
|
I agree that persistent dictionaries are bad, but I never intended About the number of keys used. I fixed my implementation to use fewer keys. Now I believe it uses the same number of keys, or sometimes 1 less key than your implementation. For example, try it out with bit width 1,000,001. For mem = {}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
elif w <= BITLIM:
mem[w] = D2 ** w
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
BITLIM = 32768 # 128
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
result = inner(n, n.bit_length()) With this implementation, |
I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.
That seems wrong. Trace through, e.g., mem[w] = w2pow(w2) * w2pow(w - w2) which makes a recursive call with |
Hi bjorn-martinsson, I appreciate the interest in improving the functions in I'm less confident about Writing more extensive unit tests and scripts to produce timing/benchmark reports is perhaps the next step. Based on comments from njs, I wonder if _pylong should use ContextVar and then we could move the |
Neil, I think the one basic thing this is missing is an asymptotically faster str->int function. Conversions in both directions were the real topic of the original issue report, although it focused on int->str (as "the harder" problem). Both are quadratic-time today. @bjorn-martinsson posted suitable code for str->int (in a comment) in gh-90716. It's much like the int->str here, but sticks to native Python int arithmetic so only gets "Karatsuba-like" speedup. Which is huge, but not as huge as we can get from (*) But could be faster after this PR lands (convert the |
Some quick timing scripts. Results on my machine:
|
Ah, okay. I extracted bjorn-martinsson's code from GH-90716. A little timing script is attached. Timing from my machine:
I'm working on a change to longobject.c to use it. The details are a bit tricky though (leading/trailing whitespace, etc). |
I just thought that the recursive formulation looked nicer and is more straight forward. But I agree that this is completely subjective.
I still belive I'm correct, but the reason for why is a bit subtle. Notice that in my implementation, the first time |
I've complied the three different styles for The first style is the one I used in the comment in gh-90716. It is very simple, but slightly slower than the other two styles. I think any of these styles would work, just be consistent and use the same style for both from_str_to_int.py
def str_to_int_only_pow2_powers(s):
DIGLIM = 2048
pow5 = [5]
for _ in range((len(s) - 1).bit_length() - 1):
pow5.append(pow5[-1] * pow5[-1])
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s))
def str_to_int_recursive(s):
DIGLIM = 2048
mem = {}
def w5pow(w):
""" return 5**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] * 5
elif w <= DIGLIM:
mem[w] = 5 ** w
else:
w2 = w >> 1
mem[w] = w5pow(w2) * w5pow(w - w2)
return mem[w]
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
split = (r - l) >> 1
return _str_to_int(r - split, r) + ((_str_to_int(l, r - split) * w5pow(split)) << split)
return _str_to_int(0, len(s))
def str_to_int_iterative(s):
DIGLIM = 2048
w5pow = {}
w = len(s)
while w >= DIGLIM:
w2 = w >> 1
if w & 1:
w5pow[w2 + 1] = None
w5pow[w2] = None
w = w2
if w5pow:
it = reversed(w5pow.keys())
w = next(it)
w5pow[w] = 5 ** w
for w in it:
if w - 1 in w5pow:
val = w5pow[w - 1] * 5
else:
w2 = w >> 1
val = w5pow[w2] * w5pow[w - w2]
w5pow[w] = val
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
split = (r - l) >> 1
return ((_str_to_int(l, r - split) * w5pow[split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s)) from_int_to_str.py
import decimal
def int_to_str_only_pow2_powers(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
pow2 = [D(2)]
for _ in range((n.bit_length() - 1).bit_length() - 1):
pow2.append(pow2[-1] * pow2[-1])
def inner(n, w):
if w <= BITLIM:
return D(n)
lg_w2 = (w - 1).bit_length() - 1
w2 = 1 << lg_w2
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * pow2[lg_w2] + inner(lo, w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result)
def int_to_str_recursive(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
mem = {}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
elif w <= BITLIM:
mem[w] = D(2) ** w
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result)
def int_to_str_iterative(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
w2pow = {}
w = n.bit_length()
while w >= BITLIM:
w2 = w >> 1
if w & 1:
w2pow[w2 + 1] = None
w2pow[w2] = None
w = w2
if w2pow:
it = reversed(w2pow.keys())
w = next(it)
w2pow[w] = D(2)**w
for w in it:
if w - 1 in w2pow:
val = w2pow[w - 1] + w2pow[w - 1]
else:
w2 = w >> 1
assert w2 in w2pow
assert w - w2 in w2pow
val = w2pow[w2] * w2pow[w - w2]
w2pow[w] = val
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * w2pow[w2] + inner(lo, w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result) About Finally, about implementing fast big int multiplication. I have not yet tried coding the Schönhage–Strassen algorithm in Python, but I'm fairly familiar with FFT algorithms and I believe I could get a pretty clean looking implementation using about 10 - 20 lines of Python code. |
I agree these should be as similar as possible. I don't really care which style is picked. I have no interest in being as fast as possible here. "Major improvement" is the goal. "Just" Karatsuba-like speedup is, as said before, huge. Ambition has forever been the fatal enemy of "major improvement" here, and after 30 years of that I'm tired of it 😉. For a given value, int->str is currently way slower than str->int, which is why I focused on the former.
But improving the asymptotics of str->int is also of real value. |
I've made some more updates, getting close to merge quality now. I would like some more unit tests yet and a good code review. My knowledge of the C-API is a bit out of date (e.g. PyUnicode_Writer, etc). I've renamed Mark's
Yeah some improvement from the status quo is my main focus with this PR. I'm pretty sure people will come along and improve Adding in a ContextVar based mechanism seems like neat idea but falls under the ambition category. Also, I think it could be worth considering making a semi-public API so packages like gmpy2 could monkey-patch |
As I discussed before, I think it is important to be consistent with the style. Currently So therefore, I think you should switch |
Okay. To my eye, the recursive version reads easier and seems to perform just as well. I made a gist with the iterative version: Would you be able to create a recursive version of |
EDIT: oops! This originally forgot to add
He already did, but hiding in a file in a comment that doesn't show up unless you click on the right thing. I'll pull it out here, but renamed, and reworked some to be more of a straight drop-in replacement for what you already have, and with some comments. I agree the recursive form reads well. def int_to_decimal(n):
"""Asymptotically fast conversion of an 'int' to Decimal."""
# Function due to Tim Peters. See GH issue #90716 for details.
# https://github.com/python/cpython/issues/90716
#
# The implementation in longobject.c of base conversion algorithms
# between power-of-2 and non-power-of-2 bases are quadratic time.
# This function implements a divide-and-conquer algorithm that is
# faster for large numbers. Builds an equal decimal.Decimal in a
# "clever" recursive way. If we want a string representation, we
# apply str to _that_.
if _DEBUG:
print('int_to_decimal', n.bit_length(), file=sys.stderr)
D = decimal.Decimal
D2 = D(2)
BITLIM = 32768
mem = {}
def w2pow(w):
"""Return D(2)**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to Decimal.
"""
if (result := mem.get(w)) is None:
if w <= BITLIM:
result = D2 ** w
elif w - 1 in mem:
result = (t := mem[w - 1]) + t
else:
w2 = w >> 1
# If w happens to be odd, w-w2 is one
# larger then w2 now. Recurse on the
# smaller first (w2), so that it's in
# the cache and the larger (w-w2) can
# be handled by the cheaper `w-1 in mem`
# branch instead.
result = w2pow(w2) * w2pow(w - w2)
mem[w] = result
return result
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
if n < 0:
negate = True
n = -n
else:
negate = False
result = inner(n, n.bit_length())
if negate:
result = -result
return result |
Here is a def _str_to_long_inner(s):
"""Asymptotically fast conversion of a 'str' to an 'int'."""
# Function due to Bjorn Martinsson. See GH issue #90716 for details.
# https://github.com/python/cpython/issues/90716
#
# The implementation in longobject.c of base conversion algorithms
# between power-of-2 and non-power-of-2 bases are quadratic time.
# This function implements a divide-and-conquer algorithm making use
# of Python's built in big int multiplication. Since Python uses the
# Karatsuba algorithm for multiplication, the time complexity
# of this function is O(len(s)**1.58).
DIGLIM = 2048
mem = {}
def w5pow(w):
"""Return 5**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to 'int'.
"""
if (result := mem.get(w)) is None:
if w <= DIGLIM:
result = 5 ** w
elif w - 1 in mem:
result = mem[w - 1] * 5
else:
w2 = w >> 1
# If w happens to be odd, w-w2 is one
# larger then w2 now. Recurse on the
# smaller first (w2), so that it's in
# the cache and the larger (w-w2) can
# be handled by the cheaper `w-1 in mem`
# branch instead.
result = w5pow(w2) * w5pow(w - w2)
mem[w] = result
return result
def inner(a, b):
if b - a <= DIGLIM:
return int(s[a:b])
mid = (a + b + 1) >> 1
return inner(mid, b) + ((inner(a, mid) * w5pow(b - mid)) << (b - mid))
return inner(0, len(s)) |
Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on |
That's not my goal, and I doubt it's Neil's either. The primary goal is to make decimal_str <-> int conversions asymptotically faster on large values, for the sake of speed. Their current quadratic-time behavior has been an ongoing background complaint for decades. |
I feel this is ready for review now. Removing Mark as a requested reviewer since he mentioned he might not have time to do it, at least anytime soon. I don't think there is any huge rush to get this in. I was hoping it might help reduce the pain of CVE-2020-10735 mitigations. However, the threshold for kicking into the |
The int->str and str->int code are very similar in concept, and should be next to each other in the code (not with unrelated division between them). While the concepts are very similar, the code is currently quite different in essentially pointless ways. Within the last day, @bjorn-martinsson posted a nearly-identical-as-possible recoding of str->int, in the comment starting with:
IMO, that should be adopted. I don't know about Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay. |
They seem quite similar to dynamic scoped variables as in Lisp. I.e. crawl up stack to find most recent binding. The mechanics are different but the behavior seems similar to me. We can forget about that stuff for this PR, I think.
It is 4300 decimal digits and it does seem low as default to me too.
|
This is why the comment at the top of Neil's new file discourages micro-optimizations 😉 - that is, they're a bottomless pit. The numbers I got running your code (on Windows 10 Python 3.10.6 on a quad-core Intel box) are very different from what you got, and in particular But I don't care. Next time the mix of target programs for PGO changes, that could reverse. Or next time Visual Studio has an update. Or ... Windows output
|
Did Codedef _int2digits_via_bytes(a, n):
N = a.bit_length()
b = a.to_bytes(-(-N // 8), 'little')
mask = (1 << n) - 1
return [
(int.from_bytes(b[i//8:(i+n+7)//8], 'little') >> (i%8)) & mask
for i in range(0, N, n)
] |
Nice, compact, linear time, and surprisingly easy to follow. I expect people will vary on this, but I find them easier to understand than the recursive versions. |
In
That better indicates that all of them are "uninitialized" and will be overwritten. Otherwise one might think some of those zeros might remain in the result. Also, if the filling had a bug and didn't overwrite all of them, a |
Here is another implementation for how to do digits2int via bytes. The idea is to merge the digits, i.e. switching from base def _digits2int_via_bytes3(digits, n):
# Change basis from 2**n to 2**(2*n), until exponent is divisible by 8
while n % 8:
m = len(digits)
digits = [
(digits[i + 1] << n) | digits[i] if i + 1 < m else digits[i]
for i in range(0, m, 2)
]
n *= 2
a_bytes = b''.join(digit.to_bytes(n // 8, 'little') for digit in digits)
return int.from_bytes(a_bytes, 'little') Speaking about whether or not to use bytes. I'm personally more of a fan of doing the recursive approach without using from and to bytes. One nice thing about the recursive algorithms is they could be implemented in any language that supports big ints. So they are very natural algorithms. Also it is nice to not have to bother with 8 bits in a byte.
For me it doesn't really matter which one to use. I'm used to initilizing with 0 for a couple of not so important reasons:
|
Ooh, nice one. But I agree that the recursive ones purely using ints is nicer. Still, I'll test more stuff later, also memory consumption, where in the other direction the recursive ones seem to take far more than my bytes versions. And I realized that the |
I think the |
That might be good. I've already been feeling bad about cluttering this. |
Works for me. There's an increasingly involved str->int exploiting |
I'm sorry that I haven't had the bandwidth to keep up with this discussion (though I dearly would have liked to). It seems as though the PR contents have stabilised. If everyone's satisfied that the code is correct, I'd be strongly in favour of merging as-is. @nascheme Is there any reason that you're aware of not to merge at this point? |
Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API. Co-author: Tim Peters <tim.peters@gmail.com> Co-author: Mark Dickinson <dickinsm@gmail.com> Co-author: Bjorn Martinsson
Code from Bjorn Martinsson.
Another optimization from Bjorn. Adjust threshold limits in longobject.c since the _pylong version is faster.
171337e
to
ed50a4a
Compare
|
The buildbot failure is unrelated. See issue #98703. |
Awesome! Thank you to everyone involved. 🎉 🎆 |
I think the CLA bot didn't notice the contribution by @bjorn-martinsson. If you haven't already, can you please sign the PSF CLA agreement: |
Yes I can fill the CLA. Is it possible for me to use the CLA bot or do I need to do it manually? |
Online is fine, AFAIK. The link above has some kind of online PDF you can fill out. Thanks to everyone for your contributions to this change! It was fun to combine all the work. |
Ok I've signed it now. One thing that is a bit weird is that the pdf required me to give my "bugs.python.org username", which as I understand is no longer a thing. So I gave my github username instead. |
* Properly decref on _pylong import error. * Improve the error message on _pylong TypeError. * Tie the return value comments together. These are minor followups to issues not caught among the reviewers on python#96673.
* Properly decref on _pylong import error. * Improve the error message on _pylong TypeError. * Fix the assertion error in pydebug builds to be a TypeError. * Tie the return value comments together. These are minor followups to issues not caught among the reviewers on #96673.
Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be
used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API.
Currently implemented is
int_to_decimal_string()
(based on code from Tim),int_divmod()
(based on code from Mark), andstr_to_int()
(based on code from bjorn-martinsson). The longobject.c module has been changed to call into_pylong
if the Python version is likely to be faster.Examples of performance improvements:
int->str
str->int
divmod
Co-author: Tim Peters tim.peters@gmail.com
Co-author: Mark Dickinson dickinsm@gmail.com
Co-author: Bjorn Martinsson