Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster multiplication for multivariate power series #10532

Open
nilesjohnson mannequin opened this issue Dec 29, 2010 · 23 comments
Open

Faster multiplication for multivariate power series #10532

nilesjohnson mannequin opened this issue Dec 29, 2010 · 23 comments

Comments

@nilesjohnson
Copy link
Mannequin

nilesjohnson mannequin commented Dec 29, 2010

Multivariate power series are implemented in #1956. Ticket #10480 has a couple of different algorithms for improving PowerSeries_poly multiplication. MPowerSeries._mul_ should be modified to take advantage of the optimal algorithm.

CC: @sagetrac-mario @sagetrac-pernici @lftabera @nathanncohen

Component: commutative algebra

Keywords: multivariate power series multiplication Karatsuba

Author: pernici

Reviewer: Niles Johnson

Issue created by migration from https://trac.sagemath.org/ticket/10532

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Dec 29, 2010

depends on #1956 and trac_10480_fast_PowerSeries_poly_multiplication2.patch

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Dec 29, 2010

Author: pernici

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Dec 29, 2010

comment:1

Attachment: trac_10532_faster_MPowerSeries_mul.patch.gz

A patch posted by pernici at #1956: First apply #1956, then trac_10480_fast_PowerSeries_poly_multiplication2.patch from #10480, then attachment: trac_10532_faster_MPowerSeries_mul.patch. Timings before and after the patch are listed in #1956, comment 59, showing much improvement with this patch.

Comparisons with the Karatsuba algorithm of #10480 still need to be done.

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Dec 29, 2010

Reviewer: Niles Johnson

@nilesjohnson nilesjohnson mannequin added the s: needs info label Dec 29, 2010
@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Dec 29, 2010

comment:2

(cc me)

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 9, 2011

benchmark

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 9, 2011

Attachment: mu2.sage.gz

benchmark

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 9, 2011

comment:3

Attachment: bench1.sage.gz

Attachments: bench1.sage, mu2.sage

Here are some benchmarks comparing:

(1) with do_mul_trunc_generic
from trac_10480_fast_PowerSeries_poly_multiplication2.patch,
trac_1956_faster_MPowerSeries_mul.patch

(2,K) with do_trunc_karatsuba, where K is K_threshold
from trac_karatsuba_improvements.patch,
trac_10480_fast_PowerSeries_poly_multiplication-alternative.patch

(1) or (2) are applied after applying
trac_1956_multi_power_series_new_4.patch,
trac_1956_uni_multi_ps_2.patch,
trac_1956_multi_ps_cleanup.patch.

These tests are run with QQ and GF(7) as examples of fields
implemented in libsingular, RR as field not implemented there.

Various values of K are tried; (1) corresponds to the K=infinity case
for QQ and GF(7) (slightly slower because do_mul_trunc_generic has a
slightly slower implementation than the corresponding do_trunc_classical
by lftabera), but not for RR (and I guess for other fields not used in
libsingular), where (1) is slower.

The conclusion is that I think that (2) with K=infinity (or a large
integer to avoid using infinity) is the best choice.

bench1.sage are the benchmarks posted by Niles to compare with Magma,
with precision multiplied by N.

times are in seconds;
processor:4-core: Intel Core i7 CPU 860 @ 2.80GHz
on x86_64 GNU/Linux

prec = prec1*N
bench1.sage with GF(7)
test no. 1     2     3     4     5     6     7     8
prec1    16    15    50    30    51    30    30    30
N=3
(1)      0.45  0.03                    0.49  0.49  0.45
(2,8)    2.16  0.06                    1.21  1.25  1.10
(2,64)   0.45  0.03                    0.55  0.56  0.54
(2,128)  0.43  0.03                    0.44  0.45  0.44
N=4
(1)      1.80  0.06                    1.44  1.34  1.22
(2,8)    3.95  0.12                    3.89  3.79  3.25
(2,64)   1.88  0.06                    1.69  1.60  1.50
(2,128)  1.78  0.06                    1.45  1.34  1.22
N=5
(1)      8.96  0.21                    5.01  4.65  4.27
(2,8)    23.2  0.52                    14.8  14.3  12.3
(2,64)   11.2  0.23                    6.55  6.01  5.53
(2,128)  8.74  0.19                    5.50  5.11  4.69
N=6
(1)      27.0  0.67                    9.99  9.35  9.17
(2,8)    87.2  1.56                    30.4  31.2  26.4
(2,64)   34.5  0.74                    13.6  12.8  12.4
(2,128)  26.1  0.67                    10.9  10.1  9.79
N=7
(1)      40.8  1.46                    18.6  17.3  16.1
(2,8)    132   3.67                    57.6  60.4  51.4
(2,64)   51.1  1.55                    25.0  23.3  21.1
(2,128)  39.3  1.51                    20.4  19.2  17.6
(2,256)  38.3  1.32                    18.3  16.9  15.9


bench1.sage with RR
test no. 1     2     3     4     5     6     7     8
prec1    16    15    50    30    51    30    30    30
N=1
(1)      1.18  0.06  0.88  0.70  0.23  0.67  0.69  0.70
(2,8)    0.34  0.04  0.61  4.92  0.18  0.47  0.57  0.89
(2,64)   0.18  0.04  0.74  0.27  0.20  0.24  0.26  0.26
N=2
(1)      57    0.96  1.6   8.4   0.67  8.2   8.5   8.4
(2,8)    19.5  0.29  1.4   >540  0.44  9.9   32    183
(2,64)   6.06  0.16  1.2   2.65  0.43  2.55  2.63  2.68
(2,128)  6.07  0.16  1.3   2.64  0.48  2.57  2.63  2.63
N=3
(2,64)   58.1  0.93  1.39  79.4  0.72  18.5  28.8  48.8
(2,128)  58.9  0.94  1.59  11.8  0.78  11.6  12.0  11.7

mu2.sage is a modification of the benchmark mu.sage posted in ticket #1956
in which the random polynomials are evaluated in t_i + 1;
the various series are stored in dat/ to compare times with
the same series and different algorithms; this is a workabout for
the lack of something like random.seed for random_element.
Timings are only indicative.

mu2.sage  with QQ
        N=1   2     3     4     5     6     7     8
n=2
prec=20 bound=20
(1)     0.007 0.01  0.02  0.04  0.09  0.21  0.58  1.79
(2,8)   0.015 0.03  0.06  0.10  0.22  0.58  1.88  5.92
(2,32)  0.007 0.01  0.02  0.04  0.09  0.20  0.57  1.74
prec=20 bound=100
(1)     0.007 0.01  0.03  0.05  0.11  0.28  0.79  2.4
(2,8)   0.019 0.04  0.07  0.13  0.30  0.85  2.64  8.2
(2,32)  0.007 0.01  0.03  0.05  0.11  0.27  0.78  2.4
prec=100 bound=100
(1)     2.96  4.46  7.14  13.4  31.8
(2,8)   18.0  32.6  67.1  162   494
(2,32)  6.82  10.7  18.8  38.7  104
(2,128) 2.91  4.40  7.23  13.5  32.1

mu2.sage  with GF(7)
n=2
prec=100 bound=100
(1)     0.08  0.08  0.08  0.08  0.09  0.09  0.09  0.09
(2,8)   0.41  0.45  0.45  0.44  0.45  0.45  0.46  0.44
(2,128) 0.07  0.08  0.08  0.08  0.08  0.08  0.08  0.08

prec=200 bound=200
(1)     1.05  1.21  1.22  1.23  1.22
(2,8)   11.0  12.5  12.4  12.4  12.5
(2,128) 1.41  1.59  1.58  1.59  1.58
(2,256) 1.00  1.14  1.12  1.14  1.15

mu2.sage with RR
n=2
prec=20 bound=20
(1)     0.03  0.06  0.06  0.06  0.06  0.06  0.06  0.06
(2,8)   0.03  0.04  0.15  0.41  1.15  3.66  12.8  45.7
(2,128) 0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01

prec=20 bound=100
(1)     0.05  0.05  0.06  0.06  0.06  0.06  0.06  0.06
(2,8)   0.03  0.07  0.23  0.42  1.08  2.71  5.94  10.2
(2,128) 0.01  0.01  0.01  0.01  0.01  0.01  0.01  0.01

prec=100 bound=100
(1)     1.87  1.87  1.91  1.87  1.88
(2,8)   1.06  8.95  69.0  332   2050
(2,128) 0.31  0.31  0.31  0.31  0.31

n=4
prec=20 bound=20
(1)     4.05  68.8  69.5  71.1  76.1
(2,8)   3.91  66.0  615   >2000
(2,128) 1.16  2.00  2.02  2.05  2.03

@lftabera
Copy link
Contributor

lftabera commented Jan 9, 2011

comment:4

If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm. I will try to look at this, but probably not this week.

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 10, 2011

comment:5

In the message above I claimed that patch (1) uses do_mul_trunc_generic;
this is not true for RR, in which case it falls back to _mul_generic,
the classical multiplication;
this is the reason for which (1) is much slower than the K=infinity
case for RR, calling do_trunc_classical.
Since the truncated classical multiplication
do_trunc_classical computes
the same terms as the classical multiplication, apart from those
which are then truncated away, it is fine for non-exact fields.

The conclusion of the previous message stands:
I think that (2) with K=infinity (or a large integer to avoid
using infinity) is the best choice.

lftabera wrote:

If K=infinity is needed to obtain good performance then either we are doing something wrong or truncated karatsuba should be eliminated in favor of the classical truncated algorithm.

Right, it can call directly the classical truncated algorithm, but the K=infinity case
calls the classical truncated algorithm almost immediately, so
I think that the overhead of calling it is negligible anyway.

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 19, 2011

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Jan 19, 2011

comment:6

The attached patch trac_1956_faster_MPowerSeries_mul2.patch
is applied after
trac_1956_multi_power_series_new_4.patch,
trac_1956_uni_multi_ps_2.patch,
trac_1956_multi_ps_cleanup.patch

For multivariate series Karatsuba multiplication is slower than generic
multiplication also in the case of prec=infinity

sage: R.<x,y,z,w> = QQ[[]]
sage: p = 1 + x/2 + y/3 + z/4
sage: %timeit p^20

times in ms; K=K_threshold
                  p^5     p^10    p^20       p^40
generic           0.31    1.62    28.5       1070
Karatsuba K=8     0.56    5.7     172        1900
Karatsuba K=64    0.56    5.7     167        1080

The attached patch uses generic multiplication for any precision;
it has been made independent from ticket #10480, which deals mainly
with the truncated Karatsuba multiplication, which is fast only
for univariate series.

_mul_trunc_generic, coming from do_trunc_classical by lftabera,
has a lot of code in common with _mul_generic and maybe it should be merged
with it; also, it would be nice to do the _square_generic optimization
for finite prec.

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 8, 2011

Attachment: trac_10532_invert.patch.gz

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 8, 2011

comment:7

The attached trac_10532_invert.patch is applied after #1956 patches and
trac_1956_faster_MPowerSeries_mul2.patch

In this patch __invert__ is implemented using _mul_trunc_generic;
inversion of a generic multivariate series becomes much faster.

In the benchmarks (0) is without this patch, (1) with it

sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: sx = sum(x)
sage: prec = 15
sage: p = (1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1
sage: %time p1 = p^-1
Wall time: 0.40 s

N  prec     (0)    (1)
2  30       1.18   0.08
2  50       30.3   0.87
4  15       42.0   0.40
4  20       662    3.2
sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: prec = 30
sage: p = sum([(i+1)*x[i]^(i+1) for i in range(N)]) + R.O(prec)
sage: p = 1 + sum([p^i/i for i in range(1,prec)])
sage: %time p1 = p^-1
Wall time: 0.28 s

N  prec     (0)    (1)
2  30       0.37   0.03
2  50       5.7    0.28
4  30       32.6   0.28
8  30       1566   1.0

A special kind of series inversion are series of degree much lower than prec, like

p = 1 + 2*x[0] + 3*x[1] + 4*x[2] + 5*x[3]  + R.O(30)

inversion is much faster because in the iteration in the Newton method
the expensive multiplication at order next_prec becomes trivial.
It remains a square of a polynomial of degree roughly next_prec/2 .
In this case this patch does not change speed; in this ticket there is a lot
of benchmarks for this case, see bench1.sage

Finally I point out that in ticket #10720 I wrote a similar but slower
version of __invert__ for series on rings of polynomials; I will improve
that version, but notice that the speedup of __invert__ in a generic
series on rings of polynomials is much smaller than here.

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 9, 2011

Attachment: trac_10532_square_trunc.patch.gz

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 9, 2011

comment:8

The attached trac_10532_square_trunc.patch is applied after #1956 patches,
trac_1956_faster_MPowerSeries_mul2.patch and trac_10532_invert.patch

With this patch truncated multiplication is optimized for squares.

Benchmark comparison with PARI/GP:

sage: N = 4
sage: R = PowerSeriesRing(QQ,N,'x')
sage: x = R.gens()
sage: sx = sum(x)
sage: prec = 20
sage: n = 1
sage: %time p = ((1 + sx + R.O(prec))^-1 + (1 + 2*sx + R.O(prec))^-1)^-n
Wall time: 4.06 s

to do the computation with gp allocate more memory than the one assigned
by default; here the computation
is done directly with the background field

sage: try:
....:     gp.allocatemem()
....: except:
....:     pass
....:
sage: N = 4
sage: prec = 20
sage: n = 1
sage: sx = '+'.join(['x%d' % i for i in range(N)])
sage: fmt = '((1 + t*(%s) + O(t^%s))^-1 + (1 + 2*t*(%s) + O(t^%s))^-1)^-%s'
sage: %time p1 = gp(fmt %(sx,prec,sx,prec,n))
Wall time: 1.56 s

(0) version of ticket #1956
(1) with trac_10532_invert.patch
(2) with trac_10532_square_trunc.patch
(3) with gp (PARI/GP)
N  prec  n   (0)    (1)    (2)    (3)
2  30    1   1.2    0.10   0.10   0.06
2  30    20  8.6    0.22   0.18   0.12
2  50    1   21     1.2    1.2    0.55
2  50    20  213    2.1    1.8    1.2
4  15    1   42     0.46   0.46   0.22
4  15    20  323    1.2    0.96   0.43
4  20    1   673    4.1    4.1    1.6
4  20    20  >7000  10     8.1    3.2

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Feb 9, 2011

comment:9

Replying to @sagetrac-pernici:

(0) version of ticket #1956
(1) with trac_10532_invert.patch
(2) with trac_10532_square_trunc.patch
(3) with gp (PARI/GP)

This looks really good -- but I'm a little confused by the comparison with gp . . . If that's always faster, should we just always do the multiplication there? There must be more subtleties involved, but I don't know anything about them :(

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 9, 2011

comment:10

niles wrote:

This looks really good -- but I'm a little confused by the comparison

with gp . . . If that's always faster, should we just always do the
multiplication there? There must be more subtleties involved, but I don't
know anything about them :(

Well, I don't know either, I just looked at it today.
I think that if one would like to use PARI, one should avoid gp
as much as possible and call directly the PARI C library functions;
in fact gp has a big overhead

sage: g = gp('1 + (x+y)*t + O(t^6)')
sage: %timeit g^2
25 loops, best of 3: 16 ms per loop
sage: R.<x,y> = QQ[[]]
sage: p = 1 + x + y + R.O(6)
sage: %timeit p^2
625 loops, best of 3: 56.5 µs per loop

so that it starts to be convenient to use it when the computation is long
enough.

For this reason
in the benchmark in the previous mail gp is slower for small precision

N  prec  n   (0)    (2)    (3)
2  10    1   0.02   0.01   0.02
2  20    1   0.23   0.03   0.03

In the above benchmarks gp is at most 2.5x faster;
in those cases the overhead is negligible, so I guess that using directly
the PARI library the speedup would be the same.

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Feb 9, 2011

comment:11

Replying to @sagetrac-pernici:

In the above benchmarks gp is at most 2.5x faster;
in those cases the overhead is negligible, so I guess that using directly
the PARI library the speedup would be the same.

I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 10, 2011

comment:12

Replying to @nilesjohnson:

Replying to @sagetrac-pernici:

In the above benchmarks gp is at most 2.5x faster;
in those cases the overhead is negligible, so I guess that using directly
the PARI library the speedup would be the same.

I see -- maybe -- can this patch be written to use PARI directly by default then? In the cases where it isn't as fast, the computation time should be small anyway, right?

It depends on the use cases; a user may need to do usually a lot of computations with low precision series; then it is important that they are done fast; in the rare case in which he needs a high precision computation, maybe he does not care that it takes 3x longer. Another user may care only for long computations.

Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be PowerSeriesRing(self.__poly_ring, T), but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.

Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.

@nilesjohnson
Copy link
Mannequin Author

nilesjohnson mannequin commented Feb 10, 2011

comment:13

Replying to @sagetrac-pernici:

Another observation: I suspect that total degree multivariate power series implemented with PARI would require changes in the implementation in ticket #1956, not only in this patch; the background power series ring would not be PowerSeriesRing(self.__poly_ring, T), but a PARI power series. For certain rings probably one cannot use PARI (e.g. symbolic rings?), then the current implementation should be used instead.

Maybe it would be better to make a specialized multivariate power series class using PARI; this would be the argument of another ticket, if someone is interested in working on it; ticket #1956 and this patch would deal with the generic implementation.

This sounds like a good idea -- thanks for clarifying. The improvements you've made are certainly worth finishing and including in sage.

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 15, 2011

Attachment: trac_10532_send_to_bg.patch.gz

@sagetrac-pernici
Copy link
Mannequin

sagetrac-pernici mannequin commented Feb 15, 2011

comment:14

The attached trac_10532_send_to_bg.patch fixes _send_to_bg in the case
of multivariate series in one variable, see comments 64-70 in ticket #1956

sage: R = PowerSeriesRing(QQ,1,'x')                                                
sage: f = R._poly_ring().gens()[0]
sage: R._send_to_bg(f + f^2)
x*T + x^2*T^2
sage: R.random_element(4)
-3/25*x^3 - 3*x^2 + x + O(x)^4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants