Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base: twiceprecision: optimize mul12 #49568

Merged
merged 1 commit into from
May 4, 2023
Merged

Conversation

nsajko
Copy link
Contributor

@nsajko nsajko commented Apr 29, 2023

It is well known and obvious that the algorithm behind Base.mul12 (sometimes known as "Fast2Mult") doesn't require a "Fast2Sum" (known in the Julia codebase as Base.canonicalize2) call at the end, so remove it. This follows from the fact that IEEE-754 floating-point multiplication is required to be well-rounded, so Fast2Sum can't change the result.

Reference, for example, the beginning of https://doi.org/10.1145/3121432 by Joldes, Muller, Popescu.

Furthermore, Base.Math.two_mul already exists, so use it as a kernel function. This required adding a general fallback method for Base.Math.two_mul, required, for example, for BigFloat.

Also removed the iszero check that is now obviously not necessary.

@oscardssmith oscardssmith self-requested a review April 29, 2023 18:16
@oscardssmith
Copy link
Member

this is a good improvement. that said, it might make sense to do the deduplication now.

@oscardssmith oscardssmith added the performance Must go faster label Apr 29, 2023
@nsajko nsajko force-pushed the mul12 branch 2 times, most recently from 4c4b004 to 573cdc6 Compare April 29, 2023 21:24
@nsajko
Copy link
Contributor Author

nsajko commented May 2, 2023

ping

@nsajko
Copy link
Contributor Author

nsajko commented May 2, 2023

Actually, mul12 being a very-low-level function, it shouldn't have either the iszero check or the isfinite check, that should be handled at a higher level. I'll try to improve that in a following PR.

@oscardssmith
Copy link
Member

yeah I think I agree with you that it might be worth just deleting mul12 and only using Base.two_mul

It is well known and obvious that the algorithm behind `Base.mul12`
(sometimes known as "Fast2Mult") doesn't require a "Fast2Sum" (known in
the Julia codebase as `Base.canonicalize2`) call at the end, so remove
it. This follows from the fact that IEEE-754 floating-point
multiplication is required to be well-rounded, so Fast2Sum can't change
the result.

Reference, for example, the beginning of
https://doi.org/10.1145/3121432 by Joldes, Muller, Popescu.

Furthermore, `Base.Math.two_mul` already exists, so use it as a kernel
function. This required adding a general fallback method for
`Base.Math.two_mul`, required, for example, for `BigFloat`.

Also removed the `iszero` check that is now obviously not necessary.
nsajko added a commit to nsajko/julia that referenced this pull request May 4, 2023
nsajko added a commit to nsajko/julia that referenced this pull request May 4, 2023
@KristofferC KristofferC merged commit c90aa77 into JuliaLang:master May 4, 2023
@nsajko nsajko deleted the mul12 branch May 4, 2023 11:05
@LilithHafner
Copy link
Member

This new function is different from the old in about 1% of cases but differs by at most one ulp for finite Float16s.

@nsajko
Copy link
Contributor Author

nsajko commented May 6, 2023

Do you have an example, please?

@oscardssmith
Copy link
Member

Also, for Float16, we should probably be doing this via Float32 mul anyway.

@LilithHafner
Copy link
Member

An example with Float64 is mul12(-2.609923178752538e-225, 4.707396092301997e-78)
An example with Float32 is mul12(-3.219508f-16, 7.302329f-23)
An example with Float16 is mul12(Float16(0.0003116), Float16(0.836))

I don't have any reason to believe that a 1-ulp discrepancy is a problem, I just noticed it and wanted to point it out.

I was using Float16 because I can check all pairs of Float16 values explicitly while that is very hard to do for Float32 and impossible for Float64.

Code to generate examples

Here's code to generate all 37,671,208 examples in the Float16 case.

julia> function two_mul(x::T, y::T) where {T<:Number}
           xy = x*y
           xy, fma(x, y, -xy)
       end;

julia> function mul12_old(x::T, y::T) where {T<:AbstractFloat}
           h = x * y
           ifelse(iszero(h) | !isfinite(h), (h, h), Base.canonicalize2(h, fma(x, y, -h)))
       end;

julia> function mul12_new(x::T, y::T) where {T<:AbstractFloat}
           (h, l) = two_mul(x, y)
           ifelse(!isfinite(h), (h, h), (h, l))
       end;

julia> function get_errors()
           errors = NTuple{2, Float16}[]
           for i in 0x0:typemax(UInt16)
               for j in 0x0:typemax(UInt16)
                   a = reinterpret(Float16, i)
                   b = reinterpret(Float16, j)
                   isfinite(a) && isfinite(b) || continue
                   mul12_old(a, b) != mul12_new(a, b) && push!(errors, (a,b))
               end
           end
           errors
       end; @time errors = get_errors();
  9.056077 seconds (20 allocations: 161.329 MiB, 0.58% gc time)

Here's code to generate examples in the Float64 case

julia> using Base.Threads

julia> f() = while true
           a = reinterpret(Float64, rand(UInt64))
           isfinite(a) || continue
           x = rand(0x0:typemax(UInt64)-typemax(UInt32))
           for i in x:x+typemax(UInt32)
               b = reinterpret(Float64, i)
               if isfinite(b) && mul12_old(a, b) != mul12_new(a, b)
                   return "mul12($a,$b)"
               end
           end
       end; @sync for i in 1:Threads.nthreads() @spawn println(f()) end

@oscardssmith
Copy link
Member

Ah, right. The difference is just the sign of the low piece when the high bit is exactly halfway between two floats. This isn't a problem.

@nsajko
Copy link
Contributor Author

nsajko commented May 7, 2023

Interesting examples.

This seems to be a case of the error-free transformation failing to be error-free, as a result of limited exponent range. Compensated floating-point calculations always rely on the exponent range being large enough, the surprising thing here is that the calculation fails even though there was no overflow or underflow.

julia> function two_mul(x::T, y::T) where {T<:Number}
           xy = x*y
           xy, fma(x, y, -xy)
       end
two_mul (generic function with 1 method)

julia> (a, b) = (Float16(0.0003116), Float16(0.836))
(Float16(0.0003116), Float16(0.836))

julia> a * b
Float16(0.0002606)

julia> m = two_mul(a, b)
(Float16(0.0002606), Float16(-1.0e-7))

julia> sum(m)
Float16(0.0002604)

Even though two_mul is obviously the algorithm that is proven correct, it fails in this case. I guess this is the reason:

two_mul

For Float16 I guess the precision, p, is 11 (mantissa length + 1), while the minimal exponent, e_min, is something like -14. So e_min + p - 1 would end up being the unexpectedly high value of -4, if I'm right. Conclusion: don't rely on error-free transformations more than the size of your exponent allows.

@LilithHafner LilithHafner added the maths Mathematical functions label Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants