-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise multiplication #295
Conversation
test multiply_0 ... bench: 42 ns/iter (+/- 0) test multiply_1 ... bench: 4,271 ns/iter (+/- 68) test multiply_2 ... bench: 289,339 ns/iter (+/- 10,532) test multiply_3 ... bench: 663,387 ns/iter (+/- 17,381) test multiply_4 ... bench: 7,474 ns/iter (+/- 384) test multiply_5 ... bench: 16,376 ns/iter (+/- 254)
Introduces Half-Karatsuba in multiplication module for cases where there is a significant size disparity between factors. Optimizes performance by evenly splitting the larger factor for more balanced multiplication calculations. -test multiply_0 ... bench: 42 ns/iter (+/- 0) +test multiply_0 ... bench: 42 ns/iter (+/- 0) -test multiply_1 ... bench: 4,271 ns/iter (+/- 68) +test multiply_1 ... bench: 4,242 ns/iter (+/- 164) -test multiply_2 ... bench: 289,339 ns/iter (+/- 10,532) +test multiply_2 ... bench: 288,857 ns/iter (+/- 8,715) -test multiply_3 ... bench: 663,387 ns/iter (+/- 17,381) +test multiply_3 ... bench: 583,635 ns/iter (+/- 14,997) -test multiply_4 ... bench: 7,474 ns/iter (+/- 384) +test multiply_4 ... bench: 6,649 ns/iter (+/- 164) -test multiply_5 ... bench: 16,376 ns/iter (+/- 254) +test multiply_5 ... bench: 13,922 ns/iter (+/- 323)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This does seem like a very straightforward improvement -- almost too obvious (like you said), but I can't find any fault with it. 😄
src/biguint/multiplication.rs
Outdated
// Add temp shifted by m2 to the accumulator | ||
// This simulates the effect of multiplying temp by b^m2. | ||
// Add directly starting at index m2 in the accumulator. | ||
add2(&mut acc[m2..], &p.data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each product is immediately added to acc
, I think we don't even need the p
buffer at all:
mac3(acc, x, low2);
mac3(&mut acc[m2..], x, high2);
It doesn't change the benchmark times for me, but still, it's an easy allocation to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Very nice. Thanks, I've pushed the simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the format error due to trailing newline, fixed and pushed.
Per suggestion from cuviper
I suspect the reason this isn't normally considered is that it seems to add a layer of recursive multiplications. That is, this "half" step recurses to two "full" Karatsuba steps with the same Whatever it is, the benchmarks do favor this change. I'm just trying to think if there are untested scenarios that this will make worse. |
The pseudo-code in the Wikipedia article on Karatsuba for some reason splits on the longer factor:
I noticed this code: // When x is smaller than y, it's significantly faster to pick b such that x is split in
// half, not y:
let b = x.len() / 2; It seems that, with the Half-Karatsuba step, it doesn't matter if we in the Full-Karatsuba step, split The reason why Full-Karatsuba, without the half-step, when splitting on I think splitting the longer I've done quite a lot of benchmarks on my PostgreSQL numeric mul_var() patch, but not for In my PostgreSQL patch benchmark, I've measured the performance_ratio as defined as the execution time for the current implementation divided by the execution time for the patched version, by exposing both implementations in the same runtime, to allow increasing the number of executions exponentially, until the performance_ratio converge. I think it would be fun to work together on developing a very capable benchmark suite for Important: The above plot is not for this patch, it's from my separate PostgreSQL patch, included here just for inspiration. |
Hi @cuviper, I've now done benchmarks across varying sizes of BigUint. Here's what I found: when the factor sizes are identical, there's no noticeable difference in the performance ratio. It's when we introduce a length disparity between factors that the performance ratio starts to climb, highlighting the Half Karatsuba's efficiency in handling size asymmetry. Interestingly, keeping the larger factor constant and varying the smaller factor shows that the performance_ratio increases with the length disparity, but only to a certain point. Beyond this point, it begins to taper off, until it approaches 1.0 when factor sizes are equal. This seems encouraging. Eager to hear your thoughts. The benchmark below was produced using https://github.com/joelonsql/rust-timeit/blob/master/examples/num_bigint_half_karatsuba.rs As shown in the plot, |
@cuviper wrote:
To address this, I've examined the data points, with the lowest performance ratios from the last 49152 measurements:
I then reran these tests with higher precision and additional runs to ensure reliability. Here are the refined results:
The average of these is approx 1.006, which suggests that the initial observations were statistical anomalies. My conclusion from all of this is that I now feel pretty convinced that the Half-Karatsuba step doesn't cause any measurable performance regressions across any factor size combinations, with significant performance gains for quite a substantial part of the total space of factor size combinations. |
That's incredibly thorough, thank you! I am quite satisified. 😄 Another thing about the anomalies -- one of the first things that |
Hi maintainers of rust-num/num-bigint,
I've been hacking on PostgreSQL's numeric.c lately, implementing the Karatsuba algorithm [1].
During this work, I realised that when splitting by half the larger factor's size, then if the smaller factor is less than that, its high part will be zero. I realised that a variable that is zero usually means a formula can be simplified, so I tried to plug in the zero high1 in the formula, and voila, that eliminated a multiplication and most of the additions/subtractions, needing only one multiplication and an addition. This simplification felt too obvious to be novel, have looked around a bit but haven't found it so far. Should be mentioned on the Karatsuba Wikipedia or something I think.
I of course checked also Rust's implementation, and it also seems to benefit from this trick, hence this Pull Request.
Would be fun to know what you think.
Benchmark:
Thanks for great work!
/Joel
[1] https://www.postgresql.org/message-id/flat/7f95163f-2019-4416-a042-6e2141619e5d@app.fastmail.com