Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster sinh and cosh for Float32, Float64 #37426

Merged
merged 8 commits into from
Oct 30, 2020
Merged

Conversation

oscardssmith
Copy link
Member

Use minimax polynomials to reduce the number of cases needed. These functions are only accurate to <1.5 ULPs, but the old functions were over 2x slower and 1.3 ULPs so I view this as an acceptable improvement. The biggest question I have about this is whether exthorner belongs here. It arguably should live somewhere where it would get exported or at least was more widely available as extended precision polynomials make special functions much easier to code. Thoughts?

@oscardssmith
Copy link
Member Author

I think this is good to go other than fixing the doctest.

@oscardssmith oscardssmith mentioned this pull request Sep 7, 2020
for i in length(p)-1:-1:1
pi = p[i]
prod = hi*x
err = fma(hi, x, -prod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we condition use of this on FMA_NATIVE? (Although it's not currently accurate... #33011)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be unconditional. If this is a muladd, llvm will constant fold err to 0. Also are there any relevant platforms without a fast fma? As I understand it, it's required to implement ieee-754 (2008)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We should probably discuss to what extent we want to assume fast fma.

Copy link
Contributor

@chriselrod chriselrod Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten complaints about fma causing bad performance before, e.g.: JuliaSIMD/LoopVectorization.jl#134 (comment)
Some people are on old architectures.

I would personally prefer Julia assume fast fma though, because

julia> Base.Math.FMA_NATIVE
false

julia> run(pipeline(`cat /proc/cpuinfo`, `grep fma`, `head -n1`))
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
Base.ProcessChain(Base.Process[Process(`cat /proc/cpuinfo`, ProcessSignaled(13)), Process(`grep fma`, ProcessSignaled(13)), Process(`head -n1`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

That is, Julia says FMA_NATIVE is false, but it is in fact true.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Sep 9, 2020
@JeffreySarnoff
Copy link
Contributor

Most code I write assumes a fast fma. Absent fast fma, arithmetic slows and inaccuracies are more difficult to manage. I made peace with the perspective that fast fma is available before developing DoubleFloats.

@oscardssmith
Copy link
Member Author

@JefferySarnoff Thank you for writing DoubleFloats I definitely stole a lot of code ideas from it for this.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Sep 10, 2020

Thank you for writing DoubleFloats
Elated you are most welcome.

I definitely stole a lot of code ideas from it for this.
my work, your work, we are Julia at work.

@musm
Copy link
Contributor

musm commented Sep 10, 2020

One thing we should decide upon is whether fma is acceptable to use in the math library, as a general rule.

I recall we had discussions regarding it's use with @simonbyrne and @pkofod awhile back regarding assumptions on whether machine fma is available or not. Back then it seemed not acceptable to have a slow fallback in case of non-machine fma for Base, but the situation may have changed as more modern cpu's include it natively. Currently the libm is accurate and relatively fast regardless whether the machine has native fma or not, since the code doesn't make assumption on fused-add-multiply availability. As a result it's pretty difficult to improve upon the existing implementations. Now if we can assume machine fma, it makes our lives a lot easier. I do think we should decide on an acceptable minimum standard for the math lib.

I took another look (and I think this is why it was more tricky a few years back), but native fma support isn't as old as I initially assumed, but perhaps old enough we can simply assume (ref 1) (ref 2) (looks like at least any CPU within the last 5 years should include it).

I guess it can be tricky to go from something that "always works to something that works in most places and is best in most places" (quoting @pkofod). OTOH, I feel ok about assuming it.

@oscardssmith
Copy link
Member Author

Would it be possible to make fma use a more complicated fallback for machines that don't have it natively? I agree that we want this code to work everywhere, but imo it's fine if our code doesn't have optimal performance on 5 year old systems

@musm
Copy link
Contributor

musm commented Sep 10, 2020

fma falls back to a software implementation to retain exact rounding, which makes it super slow.

5-year old cpu's, but the point stands. Back then the situation was a bit different, but yeah I am agreeing that we could probably drop that. But best to discuss in triage?

@JeffBezanson
Copy link
Member

Oscar mentioned in triage that there is a faster software fma implementation we should be using. If we can get that working the decision is easy. Triage is mostly ok with this anyway though.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Sep 10, 2020
@oscardssmith
Copy link
Member Author

oscardssmith commented Sep 11, 2020

I'm not 100% sure this is correct, but it's passed the tests I've thrown at it. It is 22x slower for Float32, and 11x slower for Float64


@inline function double(x::Float64)
    mask = -(UInt64(1)<<26)
    hi = reinterpret(Float64, reinterpret(UInt64, x) & mask)
    return hi, x-hi
end

@inline function double(x::Float32)
    mask = -(UInt32(1)<<12)
    hi = reinterpret(Float32, reinterpret(UInt32, x) & mask)
    return hi, x-hi
end

@inline function sfma(x, y, z)
    xhi, xlo = double(x)
    yhi, ylo = double(y)@inline function two_hilo_sum(a, b)
    s = a + b
    return s, b - (s - a)
end
@inline function two_hilo_sum(a, b)
    s = a + b
    return s, b - (s - a)
end
@inline function sfma(x, y, z)
    xhi, xlo = double(x)
    yhi, ylo = double(y)
    prod = two_hilo_sum(xhi*yhi,z)
    return prod[1] + (xlo*ylo + prod[2] + xlo*yhi + xhi*ylo);
end
    return prod[1] + (xlo*ylo + prod[2] + xlo*yhi + xhi*ylo);
end

@pkofod
Copy link
Contributor

pkofod commented Sep 15, 2020

As @musm mentions above we discussed this and were both surprised at how recent ubiquitous fma actually is. I had assumed at least five more years. We can assume that many users have hardware that is older than this, especially from economically less advanced areas of the world.

I'm not 100% against using fma, but I'm not 100% for either if it means someone with not-so-old hardware is excluded from using Julia with reasonable performance.

@JeffreySarnoff
Copy link
Contributor

Not using fma does not make Julia more performant on machines that do not support hardware fma. Not using fma makes some software incorrect. That software will run slower on machines without hardware fma .. yet still work correctly using software fma.

@oscardssmith
Copy link
Member Author

One thing I'm not sure about is how to test the performance of current base without fma. The problem is that currently expm1 is a call to a c library, so even if I make Julia think I don't have fma, libc will still use it for expm1. This matters because I don't think a fast float64 expm1 is possible without fma. I could reduce the impact for Float32 by making exthorner do the math in Float64, but that's far from a general solution.

@JeffreySarnoff
Copy link
Contributor

I don't think a fast float64 expm1 is possible without fma.

What is made better by intentionally hobbling the performance or reducing the correctness of essential Float64 elementary functions?

On platforms where there is no hardware fma, one may either emulate it well [11x when called within the elementary function, so likely <11x for the whole function] or provide a less exact and less impactful pseudo-fma [perhaps within 0.5ulp of faithful and only 7x slower]. Of the two, it seems to me that preserving cross-platform ~faithful or better elementary functions is more important for Julia.

@oscardssmith
Copy link
Member Author

I think you misunderstand me. What I'm saying is that testing FMAless behavior is not trivial on a machine work FMA since even if you disable it for Julia, it won't be disabled for C libraries.

@JeffreySarnoff
Copy link
Contributor

I think you misunderstand me.

I did understand -- my previous post was in support of your position.
(any way you slice it, the sandwich tastes best with fma .. imo).

@JeffreySarnoff
Copy link
Contributor

From the AccurateAlgorithms.jl docs, The Fused-Multiply-Add is worth a look.

@chriselrod
Copy link
Contributor

chriselrod commented Sep 15, 2020

(any way you slice it, the sandwich tastes best with fma .. imo).

IMO as well.

But I wonder if @baggepinnen would mind sharing his thoughts, so that we get at least some feedback from someone who may be adversely affected.

Also, I would be a fan of better support for feature detection. Why can't we accurately detect FMA_NATIVE, e.g. FMA3 or FMA4 instruction sets on x86, and use these to decide between implementations?

@baggepinnen
Copy link
Contributor

But I wonder if @baggepinnen would mind sharing his thoughts, so that we get at least some feedback from someone who may be adversely affected.

I'm all for the speed. The biggest problem for me was that the slowdown was surprising and without your help I would have spent a lot of time trying to figure out what was wrong. Knowing what the problem was, I could simply avoid using the @avx macro in the problematic function.

Since the difference in performance between different cpus was rather dramatic, I think it would be important to raise awareness of the issue if it is to be present more generally.

@oscardssmith
Copy link
Member Author

At this point, I'm pretty convinced that fma can't be perfectly emulated at all efficiently. The thing that sinks all attempts I've made is fma(floatmax(Float64),2.0,-floatmax(Float64)). Given this, is this PR mergable with current FMA speed? I think the right balance might be to check if FMA exists, and if not, use Float64 for the internals of the Float32 version. That said, I think that the current strategy, while not perfect has the advantage of being simple, and much better for at least 90% of people. Furthermore, I think it's probably fair to optimize performance for faster machines since they are the ones that are likely doing the most number crunching.

@JeffreySarnoff
Copy link
Contributor

Essentially this same conclusion has been reached more than once.
Prioritize using Julia's fma where fma instructions are helpful (as they are here). Improving the soft fma implementation belongs in a different PR.

@JeffreySarnoff
Copy link
Contributor

On Slack #triage I wrote:

It would be good to get Faster sinh and cosh for Float32, Float64 buttoned into 1.6. It is ready, and now quite clean.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Sep 30, 2020
@simonbyrne
Copy link
Contributor

At this point, I'm pretty convinced that fma can't be perfectly emulated at all efficiently.

For this sort of thing we don't need a perfect emulation, we just need a few bits of extra precision. To do that we can use the same double-double tricks we use elsewhere (either Veltkamp splitting or just zero out the trailing bits).

For these polynomial evaluations we have some advantage in that (a) you only need to split the variable x once, and (b) at each stage the order of the computation should be roughly the order of the leading coefficient, giving you a rough estimate of magnitude. Then by carefully doing Dekker-style double-double tricks, you should be able to get a reasonably efficient implementation.

@musm
Copy link
Contributor

musm commented Oct 1, 2020

I don't agree we should promote for Float32 up then back down. If a Float32 implementation is requested, the entire calculation should be done in Float32. Especially for these 'core' elementary functions.

@StefanKarpinski
Copy link
Member

If a Float32 implementation is requested, the entire calculation should be done in Float32.

Why?

@oscardssmith
Copy link
Member Author

oscardssmith commented Oct 1, 2020

Note that we already do this for other special functions. Iirc log and cbrt both use Float64 for their Float32 kernels. The reason is pretty simple. Compensated arithmetic is roughly 4x slower than native, but Float64 is only about 2x slower than Float32. The conversion overhead is not negligible, but it only takes a couple operations before the switch is worth it.

@chriselrod
Copy link
Contributor

but Float64 is only about 2x slower than Float32

Float64 isn't slower than Float32, but you'd have have the throughput (half as many per vector) if you're doing SIMD.

@JeffBezanson
Copy link
Member

Triage is 👍, but we should leave a note on the issue about fixing fma detection that once that's fixed, this can be updated.

@simonbyrne
Copy link
Contributor

Okay, I haven't had a chance to review the actual approach yet. Will try to do it this week.

@musm musm removed the triage This should be discussed on a triage call label Oct 2, 2020
@musm
Copy link
Contributor

musm commented Oct 2, 2020

Note that we already do this for other special functions. Iirc log and cbrt both use Float64 for their Float32 kernels.

No, there's a separate Float32 table for the log function in base and cbrt dispatches on custom 'kernels' for Float32.

I'm not aware of general elementary math function libraries that take the approach of promotion in the kernels, even though it is a valid one. In principle I don't think it is the correct way to go about it for things in Base. We sometimes do promotion for more generic special functions over at SpecialFunctions.jl, but this is more because we either simply don't have specialized code for this case, the functions are too specialized to bother with a more specific implementation, or developer resources for a more efficient Float32 method.

In any case, since we decide it's ok to 'neglect' the non-fma case in the triage call, I don't think the extra precision in the non-fma case is as relevant anymore.

@oscardssmith
Copy link
Member Author

oscardssmith commented Oct 2, 2020

@msum What I'm doing is exactly the same as what cbrt does.

@inline function _improve_cbrt(x::Float32, t::Float32)
    xx = Float64(x)
    tt = Float64(t)
    tt3 = tt^3
    tt *= (2*xx + tt3)/(x + 2*tt3)
    tt3 = tt^3
    tt *= (2*xx + tt3)/(x + 2*tt3)
    return Float32(tt)

is the cbrt code I'm referring to (it starts on line 80). Like that, my sinh kernel has a specialized Float32 implementation that promotes to Float64 and does much simpler computation than the Float64. Note that I am not doing this for accuracy in the non-FMA case, but mainly because it is 30% faster (for FMA architectures even more for non-fma) than my previous approach.

@oscardssmith
Copy link
Member Author

If 1.6 is branching tomorrow, can we please get this in for that?

@oscardssmith
Copy link
Member Author

@simonbyrne any chance you can review this soon? If not, is there anyone else that could do the review for this?

@oscardssmith
Copy link
Member Author

Note that this is a bug fix also since currently

julia> Float32(sinh(0.45764074))-sinh(0.45764074f0)
5.9604645f-8

julia> eps(.45f0)
2.9802322f-8

This is a 2ULP difference, which is greater than what we typically tolerate for special functions.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Oct 24, 2020

The report of an unsuccessful check seems to be from an earlier build attempt. If this is just buildbot being cranky and the exception vanish with a rebuild, or we understand why it appears and that has no bearing on [does not implicate] the state of this PR, all is well ... wherewith @oscardssmith receives this tip'o 🎩, If the code is clear to merge then to code clearly is to merge.

@JeffreySarnoff
Copy link
Contributor

@oscardssmith Whatever bothers buildbot now .. has quintupled the error rate.

@oscardssmith
Copy link
Member Author

@JeffreySarnoff It turns out that pushing code at 3am can lead to you removing a decimal place that makes your coefficients off by a factor of 10^13

@oscardssmith
Copy link
Member Author

At this point, the only error is InexactError: check_top_bit(UInt64, -1) which is pretty clearly unrelated (this method is in no way related to the PR)

@oscardssmith
Copy link
Member Author

@JeffBezanson can you re-run the package_frebsd64 build? It timed out.

@Keno
Copy link
Member

Keno commented Oct 29, 2020

Rerunning CI, and then will merge once it passes.

@Keno Keno closed this Oct 29, 2020
@Keno Keno reopened this Oct 29, 2020
@JeffreySarnoff
Copy link
Contributor

Do we know what to make of the tester_linux details given that the macos and windows testers pass?

@Keno
Copy link
Member

Keno commented Oct 29, 2020

Yeah, distributed has been broken. Those failures are fine

@oscardssmith
Copy link
Member Author

@Keno, given that, time to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants