Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-90716: add _pylong.py module #96673

Merged
merged 6 commits into from
Oct 26, 2022
Merged

gh-90716: add _pylong.py module #96673

merged 6 commits into from
Oct 26, 2022

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Sep 8, 2022

Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be
used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API.

Currently implemented is int_to_decimal_string() (based on code from Tim), int_divmod() (based on code from Mark), and str_to_int() (based on code from bjorn-martinsson). The longobject.c module has been changed to call into _pylong if the Python version is likely to be faster.

Examples of performance improvements:

int->str

python -m timeit -s "i=(1<<2_000_000) - 1" "str(i)" 
Before: 1 loop, best of 5: 3.58 sec per loop
After: 2 loops, best of 5: 168 msec per loop

str->int

python -m timeit -s "s='1'*1_000_000" "int(s)"
Before: 1 loop, best of 5: 3.65 sec per loop
After: 1 loop, best of 5: 470 msec per loop

divmod

python -m timeit -s "a=(1<<20_000_000)-1; b=(1<<100_000)" "divmod(a, b)"
Before: 1 loop, best of 5: 2.23 sec per loop
After: 2 loops, best of 5: 141 msec per loop

Co-author: Tim Peters tim.peters@gmail.com
Co-author: Mark Dickinson dickinsm@gmail.com
Co-author: Bjorn Martinsson

@nascheme nascheme added type-feature A feature request or enhancement DO-NOT-MERGE labels Sep 8, 2022
@nascheme
Copy link
Member Author

nascheme commented Sep 8, 2022

Some relevant discussion on the pros and cons of adding these "smarter" but more complicated algorithms:

https://discuss.python.org/t/faster-large-integer-multiplication/13300

I think implementing them in Python lowers the bar quite a bit in terms of what is maintainable in the long term. We have a small number of core devs who are numeric experts so we need to be cautious about adding a bunch of complex code. Maybe this PR already goes too far? It feels like pretty good "bang for buck" to me.

@bjorn-martinsson
Copy link

About long_to_decimal, here are some things that I've noticed/thoughts.

  1. Increasing the size of BITLIM to something like 32768 gives a noticable performance boost for me.
  2. I think it is more natural to add this algorithm to the decimal module. This code is essentially just a smart way of implementing the constructor of decimal.Decimal when called with an int.
  3. The code is really slow in PyPy. I'm guessing this is because decimal is slow in PyPy.
  4. I think the current implementation is a bit hard to understand. Maybe something like this would make it easier to follow:
D = decimal.Decimal
mem = {1: D(2)}
def w2pow(w):
    """ return D(2)**w and store the result """
    if w not in mem:
        if w - 1 in mem:
            mem[w] = mem[w - 1] + mem[w - 1]
        else:
            w2 = w >> 1
            mem[w] = w2pow(w2) * w2pow(w - w2)
    return mem[w]

BITLIM = 32768
def inner(n, w):
    if w <= BITLIM:
        return D(n)
    w2 = w >> 1
    hi = n >> w2
    lo = n - (hi << w2)
    return inner(hi, w - w2) * w2pow(w2) + inner(lo, w2)

with decimal.localcontext() as ctx:
    ctx.prec = decimal.MAX_PREC
    ctx.Emax = decimal.MAX_EMAX
    ctx.Emin = decimal.MIN_EMIN
    ctx.traps[decimal.Inexact] = 1
    
    result = inner(n, n.bit_length())

Btw there is an if 0: in divmod_fast which I'm assuming is not supposed to be there.

@tim-one
Copy link
Member

tim-one commented Sep 9, 2022

  • No thought whatsoever, or timing, went into the current BITLIM value. I see that, on my disk, I last left it as 3 😉. So as to provoke the code into doing something on "almost all" inputs, for testing.

  • The code doesn't belong in decimal. Leaving aside that decimal is an external project (it's not "our" code), if this experiment is accepted it's intended to be a repository for all coded-in-Python bigint accelerators. If the Decimal() constructor wants to exploit it (which it should!), fine, CPython's decimal wrapper should import this module and call its long_to_decimal() function as needed.

  • The original code aimed to build an efficient power tree as efficiently as reasonably possible. Toward that end, it looked at each plausible entry only once, didn't use recursion, and aimed to minimize the number of dict entries. For example, for bit width 1,000,001 and BITLIM=128, it creates a dict with 16 keys, the smallest of which is 122. Your alternative creates a dict with 27 keys. At your BITLIM=32768, the same for your code's dict, but the original only constructs 6 dict entries. Is that worth a lot? Nope. But since I don't find it hard to understand, I'm not inclined to change it either. Neil can do whatever he likes - fine by me.

  • In any case, a persistent dict is a Bad Idea here. It will only grow over time, to no good end. Regardless of which dict-building code is used, it needs only O(log log n) entries, so there's scant overlap expected between different top-level arguments. In your code much more so, since it creates a chain of entries all the way down to key 1. The original code never creates more than 2 entries with key < BITLIM.

  • if 0: is a common way to "comment out" a block of executable code intended for, e.g., debugging. No code is generated for the block; the compiler throws away the if 0 test - and the block - entirely. Nothing wrong with leaving such code in.

  • I was surprised to see that x + x really is generally faster than x*2 in decimal - good catch 😃

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 9, 2022

I agree that persistent dictionaries are bad, but I never intended mem to be one.

About the number of keys used. I fixed my implementation to use fewer keys. Now I believe it uses the same number of keys, or sometimes 1 less key than your implementation. For example, try it out with bit width 1,000,001. For BITLIM=128 I use 15 keys, and your implementation uses 16.

mem = {}
def w2pow(w):
    """ return D(2)**w and store the result """
    if w not in mem:
        if w - 1 in mem:
            mem[w] = mem[w - 1] + mem[w - 1]
        elif w <= BITLIM:
            mem[w] = D2 ** w
        else:
            w2 = w >> 1
            mem[w] = w2pow(w2) * w2pow(w - w2)
    return mem[w]

BITLIM = 32768 # 128
def inner(n, w):
    if w <= BITLIM:
        return D(n)
    w2 = w >> 1
    hi = n >> w2
    lo = n - (hi << w2)
    return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)

with decimal.localcontext() as ctx:
    ctx.prec = decimal.MAX_PREC
    ctx.Emax = decimal.MAX_EMAX
    ctx.Emin = decimal.MIN_EMIN
    ctx.traps[decimal.Inexact] = 1
    
    result = inner(n, n.bit_length())

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

@tim-one
Copy link
Member

tim-one commented Sep 9, 2022

I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

That seems wrong. Trace through, e.g., w2pow(2**20). That invokes

mem[w] = w2pow(w2) * w2pow(w - w2)

which makes a recursive call with 2**19, which in turn makes a recursive call on the same line with 2**18, and so on, until the function is finally called with 2**15 (your BITLIM value). Only then does the recursion start to unwind.

@nascheme
Copy link
Member Author

nascheme commented Sep 9, 2022

Hi bjorn-martinsson, I appreciate the interest in improving the functions in _pylong.py but polishing the functions is not the primary focus of this PR. Perhaps we should create another issue to focus on that? If they produce correct results and perform reasonably well, I think that's good enough for now. So, I feel like long_to_decimal() is in pretty okay shape already.

I'm less confident about divmod_fast() since Mark has not looked at this PR yet, AFAIK. I extracted the code from an old issue and I believe it was some prototype code, not intended for serious use.

Writing more extensive unit tests and scripts to produce timing/benchmark reports is perhaps the next step. long_to_decimal doesn't handle the bytes writer case, need to fix that. Need to come up with better threshold tests of when to switch to the _pylong functions. I'm using Py_ABS(Py_SIZE(n)) > 1000 but that's almost certainly not the right thing.

Based on comments from njs, I wonder if _pylong should use ContextVar and then we could move the max_str_digits checking into there. Getting away from a global interpreter setting would be quite a nice improvement. Doing it inside _pylong should be a lot easier.

@nascheme nascheme requested a review from mdickinson September 9, 2022 02:46
@tim-one
Copy link
Member

tim-one commented Sep 9, 2022

Neil, I think the one basic thing this is missing is an asymptotically faster str->int function. Conversions in both directions were the real topic of the original issue report, although it focused on int->str (as "the harder" problem). Both are quadratic-time today.

@bjorn-martinsson posted suitable code for str->int (in a comment) in gh-90716. It's much like the int->str here, but sticks to native Python int arithmetic so only gets "Karatsuba-like" speedup. Which is huge, but not as huge as we can get from decimal's fancier *. But I don't see a sane way to apply decimal in the decimal string -> binary int direction. The result has to be a CPython bigint, and conversion from Decimal to int is also quadratic-time(*).

(*) But could be faster after this PR lands (convert the Decimal to a decimal string (linear time in # of digits), then apply the hypothetical new faster-than-quadratic str->int function).

@nascheme
Copy link
Member Author

nascheme commented Sep 9, 2022

Some quick timing scripts.
time_divmod_fast.py.txt
time_long_to_dec.py.txt

Results on my machine:

time_divmod_fast.py
      bits     old time     new time   faster  correct
------------------------------------------------------------------------------
        54        0.000        0.000     0.18     True
       107        0.000        0.000     0.17     True
       213        0.000        0.000     0.17     True
       426        0.000        0.000     0.26     True
       851        0.000        0.000     0.62     True
      1701        0.000        0.000     0.24     True
      3402        0.000        0.000     0.39     True
      6804        0.000        0.000     0.56     True
     13607        0.000        0.000     1.06     True
     27214        0.001        0.000     1.90     True
     54427        0.002        0.001     2.31     True
    108853        0.008        0.002     3.97     True
    217706        0.026        0.005     5.26     True
    435412        0.105        0.014     7.40     True

time_long_to_dec.py
      bits      old      new   ratio equal
      1025    0.000    0.000    0.1x True
      2049    0.000    0.000    0.2x True
      4097    0.000    0.000    0.4x True
      8193    0.000    0.000    0.7x True
     16385    0.000    0.000    1.0x True
     32769    0.001    0.001    1.5x True
     65537    0.005    0.003    2.0x True
    131073    0.016    0.005    3.0x True
    262145    0.062    0.012    5.2x True
    524289    0.246    0.026    9.5x True
   1048577    0.980    0.054   18.0x True
   2097153    3.926    0.114   34.5x True

@nascheme
Copy link
Member Author

nascheme commented Sep 9, 2022

Neil, I think the one basic thing this is missing is an asymptotically faster str->int function.

Ah, okay. I extracted bjorn-martinsson's code from GH-90716. A little timing script is attached. Timing from my machine:

time_str_to_long.py

    digits     old time     new time   faster  correct
------------------------------------------------------------------------------
      1024        0.000        0.000     1.50     True
      2048        0.000        0.000     0.98     True
      4096        0.000        0.000     0.93     True
      8192        0.001        0.001     1.21     True
     16384        0.002        0.001     2.47     True
     32768        0.005        0.002     2.43     True
     65536        0.019        0.005     3.48     True
    131072        0.061        0.016     3.78     True
    262144        0.263        0.049     5.39     True
    524288        0.983        0.147     6.68     True

I'm working on a change to longobject.c to use it. The details are a bit tricky though (leading/trailing whitespace, etc).

time_str_to_long.py.txt

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 9, 2022

I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.

I just thought that the recursive formulation looked nicer and is more straight forward. But I agree that this is completely subjective.

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

That seems wrong. Trace through, e.g., w2pow(2**20). That invokes

mem[w] = w2pow(w2) * w2pow(w - w2)

which makes a recursive call with 2**19, which in turn makes a recursive call on the same line with 2**18, and so on, until the function is finally called with 2**15 (your BITLIM value). Only then does the recursion start to unwind.

I still belive I'm correct, but the reason for why is a bit subtle. Notice that in my implementation, the first time inner() calls w2pow(w2) it does it with a small w2. So the first call will not lead to a 2^20 -> 2^19 -> 2^18 -> ... kind of recursion. In fact, the deepest it will ever go is depth 2.

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 9, 2022

I've complied the three different styles for int -> str conversion and str -> int.

The first style is the one I used in the comment in gh-90716. It is very simple, but slightly slower than the other two styles.
The second style is the recursive style with mem that I posted above.
The third style is what Tim used.

I think any of these styles would work, just be consistent and use the same style for both str -> int and int -> str.

from_str_to_int.py

def str_to_int_only_pow2_powers(s):
    DIGLIM = 2048
    
    pow5 = [5]
    for _ in range((len(s) - 1).bit_length() - 1):
        pow5.append(pow5[-1] * pow5[-1])
    
    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        lg_split = (r - l - 1).bit_length() - 1
        split = 1 << lg_split
        return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
    return _str_to_int(0, len(s))


def str_to_int_recursive(s):
    DIGLIM = 2048
    
    mem = {}
    def w5pow(w):
        """ return 5**w and store the result """
        if w not in mem:
            if w - 1 in mem:
                mem[w] = mem[w - 1] * 5
            elif w <= DIGLIM:
                mem[w] = 5 ** w
            else:
                w2 = w >> 1
                mem[w] = w5pow(w2) * w5pow(w - w2)
        return mem[w]
    
    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        split = (r - l) >> 1
        return _str_to_int(r - split, r) + ((_str_to_int(l, r - split) * w5pow(split)) << split)
    return _str_to_int(0, len(s))

def str_to_int_iterative(s):
    DIGLIM = 2048

    w5pow = {}
    w = len(s)
    while w >= DIGLIM:
        w2 = w >> 1
        if w & 1:
            w5pow[w2 + 1] = None
        w5pow[w2] = None
        w = w2
    if w5pow:
        it = reversed(w5pow.keys())
        w = next(it)
        w5pow[w] = 5 ** w
        for w in it:
            if w - 1 in w5pow:
                val = w5pow[w - 1] * 5
            else:
                w2 = w >> 1
                val = w5pow[w2] * w5pow[w - w2]
            w5pow[w] = val   

    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        split = (r - l) >> 1
        return ((_str_to_int(l, r - split) * w5pow[split]) << split) + _str_to_int(r - split, r)
    return _str_to_int(0, len(s))

from_int_to_str.py

import decimal

def int_to_str_only_pow2_powers(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1
        
        pow2 = [D(2)]
        for _ in range((n.bit_length() - 1).bit_length() - 1):
            pow2.append(pow2[-1] * pow2[-1])
        
        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            lg_w2 = (w - 1).bit_length() - 1
            w2 = 1 << lg_w2
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(hi, w - w2) * pow2[lg_w2] + inner(lo, w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)


def int_to_str_recursive(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        mem = {}
        def w2pow(w):
            """ return D(2)**w and store the result """
            if w not in mem:
                if w - 1 in mem:
                    mem[w] = mem[w - 1] + mem[w - 1]
                elif w <= BITLIM:
                    mem[w] = D(2) ** w
                else:
                    w2 = w >> 1
                    mem[w] = w2pow(w2) * w2pow(w - w2)
            return mem[w]

        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            w2 = w >> 1
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)


def int_to_str_iterative(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        w2pow = {}
        w = n.bit_length()
        while w >= BITLIM:
            w2 = w >> 1
            if w & 1:
                w2pow[w2 + 1] = None
            w2pow[w2] = None
            w = w2
        if w2pow:
            it = reversed(w2pow.keys())
            w = next(it)
            w2pow[w] = D(2)**w
            for w in it:
                if w - 1 in w2pow:
                    val = w2pow[w - 1] + w2pow[w - 1]
                else:
                    w2 = w >> 1
                    assert w2 in w2pow
                    assert w - w2 in w2pow
                    val = w2pow[w2] * w2pow[w - w2]
                w2pow[w] = val

        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            w2 = w >> 1
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(hi, w - w2) * w2pow[w2] + inner(lo, w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)

About str_to_int needing a faster big int multiplication to be significantly faster than int(). I think that it is worth adding str_to_int even if no super fast big int multiplication currently exists in Python. If in the future Python adds faster big int multiplication, then str_to_int will run fast without needing to be modified in any way.

Finally, about implementing fast big int multiplication. I have not yet tried coding the Schönhage–Strassen algorithm in Python, but I'm fairly familiar with FFT algorithms and I believe I could get a pretty clean looking implementation using about 10 - 20 lines of Python code.

Lib/_pylong.py Outdated Show resolved Hide resolved
@tim-one
Copy link
Member

tim-one commented Sep 9, 2022

I agree these should be as similar as possible. I don't really care which style is picked.

I have no interest in being as fast as possible here. "Major improvement" is the goal. "Just" Karatsuba-like speedup is, as said before, huge. Ambition has forever been the fatal enemy of "major improvement" here, and after 30 years of that I'm tired of it 😉.

For a given value, int->str is currently way slower than str->int, which is why I focused on the former.

$ timeit -s "i = 10**100000 - 1; s = str(i)" "str(i)"
2 loops, best of 5: 160 msec per loop

$ timeit -s "i = 10**100000 - 1; s = str(i)" "int(s)"
5 loops, best of 5: 58.2 msec per loop

But improving the asymptotics of str->int is also of real value.

@nascheme
Copy link
Member Author

nascheme commented Sep 9, 2022

I've made some more updates, getting close to merge quality now. I would like some more unit tests yet and a good code review. My knowledge of the C-API is a bit out of date (e.g. PyUnicode_Writer, etc). I've renamed Mark's divmod_fast() to int_divmod, to be more consistent. I hope he might have time to give feedback on it (e.g. is it fit for purpose).

Ambition has forever been the fatal enemy of "major improvement" here

Yeah some improvement from the status quo is my main focus with this PR. I'm pretty sure people will come along and improve _pylong.py afterwards. Perhaps we will need to pushback to prevent it from growing a lot of complexity. However, even with the simple and short version I have now, it seems we are getting some decent speedups, if the integers are large enough.

Adding in a ContextVar based mechanism seems like neat idea but falls under the ambition category. Also, I think it could be worth considering making a semi-public API so packages like gmpy2 could monkey-patch _pylong. Probably they should be able to adjust the threshold levels for using _pylong as well, since those implementations have a different cutover performance point.

@bjorn-martinsson
Copy link

As I discussed before, I think it is important to be consistent with the style.

Currently int_to_decimal and _str_to_int_inner are coded in different ways even though they fundamentally do the exact same thing. So their implementations could and should look almost the same.

So therefore, I think you should switch _str_to_int_inner to the str_to_int_iterative() function from from_str_to_int.py that I posted above. Doing that will make int_to_decimal and _str_to_int_inner have the same style.

@nascheme
Copy link
Member Author

nascheme commented Sep 10, 2022

I think it is important to be consistent with the style.

Okay. To my eye, the recursive version reads easier and seems to perform just as well. I made a gist with the iterative version:
https://gist.github.com/nascheme/d04e6d63cfbd3e4614ca0e173b961de4
Is that correct? It is more lines of code but I do see how it is more similar to int_to_decimal.

Would you be able to create a recursive version of int_to_decimal? I'd like to see what it looks like. The pow5() seems quite nice to me. I'd assume you would need a similar pow2() function.

@tim-one
Copy link
Member

tim-one commented Sep 10, 2022

EDIT: oops! This originally forgot to add mem[w] = result to cache the result. No effect on correctness, just speed. The problem with using @functools.cache is that then the cache is "invisible", so we can't test whether w-1 is in it.

Would you be able to create a recursive version of int_to_decimal? I'd like to see what it looks like.

He already did, but hiding in a file in a comment that doesn't show up unless you click on the right thing. I'll pull it out here, but renamed, and reworked some to be more of a straight drop-in replacement for what you already have, and with some comments. I agree the recursive form reads well.

def int_to_decimal(n):
    """Asymptotically fast conversion of an 'int' to Decimal."""

    # Function due to Tim Peters.  See GH issue #90716 for details.
    # https://github.com/python/cpython/issues/90716
    #
    # The implementation in longobject.c of base conversion algorithms
    # between power-of-2 and non-power-of-2 bases are quadratic time.
    # This function implements a divide-and-conquer algorithm that is
    # faster for large numbers.  Builds an equal decimal.Decimal in a
    # "clever" recursive way.  If we want a string representation, we
    # apply str to _that_.

    if _DEBUG:
        print('int_to_decimal', n.bit_length(), file=sys.stderr)

    D = decimal.Decimal
    D2 = D(2)

    BITLIM = 32768

    mem = {}
    def w2pow(w):
        """Return D(2)**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to Decimal.
"""             
        if (result := mem.get(w)) is None:
            if w <= BITLIM:
                result = D2 ** w
            elif w - 1 in mem:
                result = (t := mem[w - 1]) + t
            else:
                w2 = w >> 1
                # If w happens to be odd, w-w2 is one
                # larger then w2 now. Recurse on the
                # smaller first (w2), so that it's in
                # the cache and the larger (w-w2) can
                # be handled by the cheaper `w-1 in mem`
                # branch instead.
                result = w2pow(w2) * w2pow(w - w2)
            mem[w] = result
        return result

    def inner(n, w):
        if w <= BITLIM:
            return D(n)
        w2 = w >> 1
        hi = n >> w2
        lo = n - (hi << w2)
        return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)

    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        if n < 0:
            negate = True
            n = -n
        else:
            negate = False
        result = inner(n, n.bit_length())
        if negate:
            result = -result

    return result

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 10, 2022

Here is a _str_to_long_inner implementation in the same exact style as the one Tim just posted.

def _str_to_long_inner(s):
    """Asymptotically fast conversion of a 'str' to an 'int'."""
    
    # Function due to Bjorn Martinsson.  See GH issue #90716 for details.
    # https://github.com/python/cpython/issues/90716
    #
    # The implementation in longobject.c of base conversion algorithms
    # between power-of-2 and non-power-of-2 bases are quadratic time. 
    # This function implements a divide-and-conquer algorithm making use 
    # of Python's built in big int multiplication. Since Python uses the
    # Karatsuba algorithm for multiplication, the time complexity
    # of this function is O(len(s)**1.58).
    
    DIGLIM = 2048

    mem = {}
    def w5pow(w):
        """Return 5**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to 'int'.
"""             
        if (result := mem.get(w)) is None:
            if w <= DIGLIM:
                result = 5 ** w
            elif w - 1 in mem:
                result = mem[w - 1] * 5
            else:
                w2 = w >> 1
                # If w happens to be odd, w-w2 is one
                # larger then w2 now. Recurse on the
                # smaller first (w2), so that it's in
                # the cache and the larger (w-w2) can
                # be handled by the cheaper `w-1 in mem`
                # branch instead.
                result = w5pow(w2) * w5pow(w - w2)
            mem[w] = result
        return result
    
    def inner(a, b):
        if b - a <= DIGLIM:
            return int(s[a:b])
        mid = (a + b + 1) >> 1 
        return inner(mid, b) + ((inner(a, mid) * w5pow(b - mid)) << (b - mid))
    return inner(0, len(s))

@smichr
Copy link

smichr commented Sep 10, 2022

Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on str(int) conversion to protect against DoS attack?

@tim-one
Copy link
Member

tim-one commented Sep 10, 2022

Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on str(int) conversion to protect against DoS attack?

That's not my goal, and I doubt it's Neil's either. The primary goal is to make decimal_str <-> int conversions asymptotically faster on large values, for the sake of speed. Their current quadratic-time behavior has been an ongoing background complaint for decades.

@nascheme nascheme removed the request for review from mdickinson September 10, 2022 22:12
@nascheme
Copy link
Member Author

I feel this is ready for review now. Removing Mark as a requested reviewer since he mentioned he might not have time to do it, at least anytime soon.

I don't think there is any huge rush to get this in. I was hoping it might help reduce the pain of CVE-2020-10735 mitigations. However, the threshold for kicking into the _pylong functions is quite a bit higher than the default 4300 limit. Maybe with some more work we can get there. Also, implementing a ContextVar system in _pylong and then putting the int_max_str_digits check in there could be a good idea. That can be done as a separate PR though.

@tim-one
Copy link
Member

tim-one commented Sep 10, 2022

The int->str and str->int code are very similar in concept, and should be next to each other in the code (not with unrelated division between them).

While the concepts are very similar, the code is currently quite different in essentially pointless ways. Within the last day, @bjorn-martinsson posted a nearly-identical-as-possible recoding of str->int, in the comment starting with:

Here is a _str_to_long_iner implementation in the same exact style as the one Tim just posted.

IMO, that should be adopted.

I don't know about ContextVars, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.

Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.

@nascheme
Copy link
Member Author

I don't know about ContextVars, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.

They seem quite similar to dynamic scoped variables as in Lisp. I.e. crawl up stack to find most recent binding. The mechanics are different but the behavior seems similar to me. We can forget about that stuff for this PR, I think.

Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.

It is 4300 decimal digits and it does seem low as default to me too.

./python -m timeit -s "s='1'*4000" "int(s)"
5000 loops, best of 5: 78.2 usec per loop
./python -m timeit -s "s='1'*4000; i = int(s)" "str(i)"
2000 loops, best of 5: 162 usec per loop

@tim-one
Copy link
Member

tim-one commented Sep 27, 2022

Benchmark for "concatenating" two numbers with + vs |, or clearing half the bits of a number with - vs ^:

This is why the comment at the top of Neil's new file discourages micro-optimizations 😉 - that is, they're a bottomless pit. The numbers I got running your code (on Windows 10 Python 3.10.6 on a quad-core Intel box) are very different from what you got, and in particular ^ was almost always slower than - for clearing high bits.

But I don't care. Next time the mix of target programs for PGO changes, that could reverse. Or next time Visual Studio has an update. Or ...

Windows output
         32   0.92   0.91   0.89
         64   1.01   0.93   0.57
        128   1.08   0.93   0.99
        256   1.07   0.92   1.04
        512   1.20   0.96   1.25
      1,024   1.41   0.79   1.57
      2,048   1.27   0.68   1.46
      4,096   1.55   0.97   1.87
      8,192   1.87   0.99   2.38
     16,384   2.04   1.01   2.79
     32,768   2.20   1.00   2.95
     65,536   1.91   0.96   2.71
    131,072   2.14   0.96   2.78
    262,144   2.10   0.98   2.83
    524,288   2.01   0.56   2.30
  1,048,576   1.53   0.79   2.00
  2,097,152   1.56   0.87   2.00
  4,194,304   1.12   0.48   1.32
  8,388,608   1.22   0.67   1.54
 16,777,216   1.26   0.67   1.43
 33,554,432   1.19   0.64   1.26
 67,108,864   1.18   0.62   1.23
134,217,728   1.00   0.62   1.48

@pochmann
Copy link
Contributor

pochmann commented Sep 27, 2022

Did _int2digits via bytes as well now. I actually like it (and also my _digits2int_via_bytes2).

Code
def _int2digits_via_bytes(a, n):
    N = a.bit_length()
    b = a.to_bytes(-(-N // 8), 'little')
    mask = (1 << n) - 1
    return [
        (int.from_bytes(b[i//8:(i+n+7)//8], 'little') >> (i%8)) & mask
        for i in range(0, N, n)
    ]

@tim-one
Copy link
Member

tim-one commented Sep 27, 2022

Did _int2digits via bytes as well now. I actually like it (and also my _digits2int_via_bytes2).

Nice, compact, linear time, and surprisingly easy to follow. I expect people will vary on this, but I find them easier to understand than the recursive versions.

@pochmann
Copy link
Contributor

pochmann commented Sep 27, 2022

In _int2digits, I suggest to initialize with None instead of 0:

a_digits = [None] * ((a.bit_length() + n - 1) // n)

That better indicates that all of them are "uninitialized" and will be overwritten. Otherwise one might think some of those zeros might remain in the result. Also, if the filling had a bug and didn't overwrite all of them, a None would likely cause an error later on and the bug would get noticed.

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 27, 2022

Here is another implementation for how to do digits2int via bytes. The idea is to merge the digits, i.e. switching from base 2**n to 2**(2*n), until n is divisible by 8 (taking at most 3 iterations). A nice thing about this approach is there is almost no fiddeling with % 8 at all in the code.

def _digits2int_via_bytes3(digits, n):
    # Change basis from 2**n to 2**(2*n), until exponent is divisible by 8
    while n % 8:
        m = len(digits)
        digits = [
            (digits[i + 1] << n) | digits[i] if i + 1 < m else digits[i] 
            for i in range(0, m, 2)
        ]
        n *= 2
    a_bytes = b''.join(digit.to_bytes(n // 8, 'little') for digit in digits)
    return int.from_bytes(a_bytes, 'little')

Speaking about whether or not to use bytes. I'm personally more of a fan of doing the recursive approach without using from and to bytes. One nice thing about the recursive algorithms is they could be implemented in any language that supports big ints. So they are very natural algorithms. Also it is nice to not have to bother with 8 bits in a byte.

In _int2digits, I suggest to initialize with None instead of 0:

For me it doesn't really matter which one to use. I'm used to initilizing with 0 for a couple of not so important reasons:

  1. In languages like C++, this is how you would initilize it.
  2. PyPy has a special "strategy" for optimizing lists which only contain (non-big) integers. So for any Python code intended to be run in PyPy I prefer using [0] * n over [None] * n.
  3. If someone reads the code, then it is clear that the list will contain integers and nothing else.

@pochmann
Copy link
Contributor

Ooh, nice one. But I agree that the recursive ones purely using ints is nicer. Still, I'll test more stuff later, also memory consumption, where in the other direction the recursive ones seem to take far more than my bytes versions. And I realized that the a = 123456**12345670 I've used for benchmarking has 209 million bits but the last 74 million are all zeros, which I don't think is good. I have two versions with bytearray, but that's frustratingly slow (int.from_bytes accepts bytearray but apparently doesn't really like it).

@nascheme
Copy link
Member Author

I think the via_bytes versions read pretty well but I slightly prefer the current (recursive) ones as well. It seems a shame to lose all this nice experimentation though. Should I create a git repo where we can dump all of these experiments? Seems better than embedded inside github comments.

@pochmann
Copy link
Contributor

That might be good. I've already been feeling bad about cluttering this.

@tim-one
Copy link
Member

tim-one commented Sep 28, 2022

Should I create a git repo where we can dump all of these experiments?

Works for me. There's an increasingly involved str->int exploiting decimal's better asymptotics buried in comments on the issue report too. Alas, while it's faster the longer the string, it's still slower than what the PR already has before reaching about
3 million digits.

@mdickinson
Copy link
Member

I'm sorry that I haven't had the bandwidth to keep up with this discussion (though I dearly would have liked to).

It seems as though the PR contents have stabilised. If everyone's satisfied that the code is correct, I'd be strongly in favour of merging as-is. @nascheme Is there any reason that you're aware of not to merge at this point?

Add Python implementations of certain longobject.c functions. These use
asymptotically faster algorithms that can be used for operations on
integers with many digits. In those cases, the performance overhead of
the Python implementation is not significant since the asymptotic
behavior is what dominates runtime. Functions provided by this module
should be considered private and not part of any public API.

Co-author: Tim Peters <tim.peters@gmail.com>
Co-author: Mark Dickinson <dickinsm@gmail.com>
Co-author: Bjorn Martinsson
Another optimization from Bjorn.  Adjust threshold limits in
longobject.c since the _pylong version is faster.
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Windows10 3.x has failed when building commit de69816.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/146/builds/3788) and take a look at the build logs.
  4. Check if the failure is related to this commit (de69816) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/146/builds/3788

Failed tests:

  • test_asyncio

Failed subtests:

  • test_drain_raises - test.test_asyncio.test_streams.StreamTests.test_drain_raises

Summary of the results of the build (if available):

== Tests result: FAILURE then FAILURE ==

404 tests OK.

10 slowest tests:

  • test_multiprocessing_spawn: 4 min 11 sec
  • test_wmi: 2 min 46 sec
  • test_tokenize: 2 min 45 sec
  • test_capi: 2 min 38 sec
  • test_asyncio: 2 min 29 sec
  • test_mmap: 2 min 10 sec
  • test_lib2to3: 2 min 6 sec
  • test_peg_generator: 1 min 57 sec
  • test_io: 1 min 50 sec
  • test_unparse: 1 min 49 sec

1 test failed:
test_asyncio

32 tests skipped:
test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_epoll
test_fcntl test_fork1 test_gdb test_grp test_ioctl test_kqueue
test_multiprocessing_fork test_multiprocessing_forkserver test_nis
test_openpty test_ossaudiodev test_perf_profiler test_pipes
test_poll test_posix test_pty test_pwd test_readline test_resource
test_spwd test_syslog test_threadsignals test_wait3 test_wait4
test_xxlimited test_xxtestfuzz test_zipfile64

1 re-run test:
test_asyncio

Total duration: 16 min 55 sec

Click to see traceback logs
Traceback (most recent call last):
  File "D:\buildarea\3.x.bolen-windows10\build\Lib\test\test_asyncio\test_streams.py", line 933, in test_drain_raises
    self.assertEqual([], messages)
AssertionError: Lists differ: [] != [{'message': 'Exception in callback _Proac[248 chars]))>}]

@Fidget-Spinner
Copy link
Member

The buildbot failure is unrelated. See issue #98703.

@mdickinson
Copy link
Member

Awesome! Thank you to everyone involved. 🎉 🎆

@nascheme
Copy link
Member Author

I think the CLA bot didn't notice the contribution by @bjorn-martinsson. If you haven't already, can you please sign the PSF CLA agreement:
https://www.python.org/psf/contrib/contrib-form/

@bjorn-martinsson
Copy link

I think the CLA bot didn't notice the contribution by @bjorn-martinsson. If you haven't already, can you please sign the PSF CLA agreement: https://www.python.org/psf/contrib/contrib-form/

Yes I can fill the CLA. Is it possible for me to use the CLA bot or do I need to do it manually?

@nascheme
Copy link
Member Author

Online is fine, AFAIK. The link above has some kind of online PDF you can fill out.

Thanks to everyone for your contributions to this change! It was fun to combine all the work.

@bjorn-martinsson
Copy link

bjorn-martinsson commented Oct 27, 2022

Online is fine, AFAIK. The link above has some kind of online PDF you can fill out.

Thanks to everyone for your contributions to this change! It was fun to combine all the work.

Ok I've signed it now. One thing that is a bit weird is that the pdf required me to give my "bugs.python.org username", which as I understand is no longer a thing. So I gave my github username instead.

gpshead added a commit to gpshead/cpython that referenced this pull request Nov 3, 2022
* Properly decref on _pylong import error.
* Improve the error message on _pylong TypeError.
* Tie the return value comments together.

These are minor followups to issues not caught among the reviewers on
python#96673.
gpshead added a commit that referenced this pull request Nov 3, 2022
* Properly decref on _pylong import error.
* Improve the error message on _pylong TypeError.
* Fix the assertion error in pydebug builds to be a TypeError.
* Tie the return value comments together.

These are minor followups to issues not caught among the reviewers on
#96673.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.