Faster sinh and cosh for Float32, Float64 #37426

oscardssmith · 2020-09-06T06:20:57Z

Use minimax polynomials to reduce the number of cases needed. These functions are only accurate to <1.5 ULPs, but the old functions were over 2x slower and 1.3 ULPs so I view this as an acceptable improvement. The biggest question I have about this is whether exthorner belongs here. It arguably should live somewhere where it would get exported or at least was more widely available as extended precision polynomials make special functions much easier to code. Thoughts?

Use minimax polynomials to reduce the number of cases needed.

oscardssmith · 2020-09-06T14:34:28Z

I think this is good to go other than fixing the doctest.

JeffBezanson · 2020-09-08T16:05:23Z

base/special/hyperbolic.jl

+    for i in length(p)-1:-1:1
+        pi = p[i]
+        prod = hi*x
+        err = fma(hi, x, -prod)


Should we condition use of this on FMA_NATIVE? (Although it's not currently accurate... #33011)

I think this should be unconditional. If this is a muladd, llvm will constant fold err to 0. Also are there any relevant platforms without a fast fma? As I understand it, it's required to implement ieee-754 (2008)

Makes sense. We should probably discuss to what extent we want to assume fast fma.

I've gotten complaints about fma causing bad performance before, e.g.: JuliaSIMD/LoopVectorization.jl#134 (comment)
Some people are on old architectures.

I would personally prefer Julia assume fast fma though, because

julia> Base.Math.FMA_NATIVE false julia> run(pipeline(`cat /proc/cpuinfo`, `grep fma`, `head -n1`)) flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d Base.ProcessChain(Base.Process[Process(`cat /proc/cpuinfo`, ProcessSignaled(13)), Process(`grep fma`, ProcessSignaled(13)), Process(`head -n1`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

That is, Julia says FMA_NATIVE is false, but it is in fact true.

JeffreySarnoff · 2020-09-09T22:38:24Z

Most code I write assumes a fast fma. Absent fast fma, arithmetic slows and inaccuracies are more difficult to manage. I made peace with the perspective that fast fma is available before developing DoubleFloats.

oscardssmith · 2020-09-09T22:47:05Z

@JefferySarnoff Thank you for writing DoubleFloats I definitely stole a lot of code ideas from it for this.

JeffreySarnoff · 2020-09-10T01:35:14Z

Thank you for writing DoubleFloats
Elated you are most welcome.

I definitely stole a lot of code ideas from it for this.
my work, your work, we are Julia at work.

musm · 2020-09-10T16:40:07Z

One thing we should decide upon is whether fma is acceptable to use in the math library, as a general rule.

I recall we had discussions regarding it's use with @simonbyrne and @pkofod awhile back regarding assumptions on whether machine fma is available or not. Back then it seemed not acceptable to have a slow fallback in case of non-machine fma for Base, but the situation may have changed as more modern cpu's include it natively. Currently the libm is accurate and relatively fast regardless whether the machine has native fma or not, since the code doesn't make assumption on fused-add-multiply availability. As a result it's pretty difficult to improve upon the existing implementations. Now if we can assume machine fma, it makes our lives a lot easier. I do think we should decide on an acceptable minimum standard for the math lib.

I took another look (and I think this is why it was more tricky a few years back), but native fma support isn't as old as I initially assumed, but perhaps old enough we can simply assume (ref 1) (ref 2) (looks like at least any CPU within the last 5 years should include it).

I guess it can be tricky to go from something that "always works to something that works in most places and is best in most places" (quoting @pkofod). OTOH, I feel ok about assuming it.

oscardssmith · 2020-09-10T17:48:44Z

Would it be possible to make fma use a more complicated fallback for machines that don't have it natively? I agree that we want this code to work everywhere, but imo it's fine if our code doesn't have optimal performance on 5 year old systems

musm · 2020-09-10T18:16:48Z

fma falls back to a software implementation to retain exact rounding, which makes it super slow.

5-year old cpu's, but the point stands. Back then the situation was a bit different, but yeah I am agreeing that we could probably drop that. But best to discuss in triage?

JeffBezanson · 2020-09-10T18:30:05Z

Oscar mentioned in triage that there is a faster software fma implementation we should be using. If we can get that working the decision is easy. Triage is mostly ok with this anyway though.

oscardssmith · 2020-09-11T04:18:09Z

I'm not 100% sure this is correct, but it's passed the tests I've thrown at it. It is 22x slower for Float32, and 11x slower for Float64


@inline function double(x::Float64)
    mask = -(UInt64(1)<<26)
    hi = reinterpret(Float64, reinterpret(UInt64, x) & mask)
    return hi, x-hi
end

@inline function double(x::Float32)
    mask = -(UInt32(1)<<12)
    hi = reinterpret(Float32, reinterpret(UInt32, x) & mask)
    return hi, x-hi
end

@inline function sfma(x, y, z)
    xhi, xlo = double(x)
    yhi, ylo = double(y)@inline function two_hilo_sum(a, b)
    s = a + b
    return s, b - (s - a)
end
@inline function two_hilo_sum(a, b)
    s = a + b
    return s, b - (s - a)
end
@inline function sfma(x, y, z)
    xhi, xlo = double(x)
    yhi, ylo = double(y)
    prod = two_hilo_sum(xhi*yhi,z)
    return prod[1] + (xlo*ylo + prod[2] + xlo*yhi + xhi*ylo);
end
    return prod[1] + (xlo*ylo + prod[2] + xlo*yhi + xhi*ylo);
end

pkofod · 2020-09-15T14:16:48Z

As @musm mentions above we discussed this and were both surprised at how recent ubiquitous fma actually is. I had assumed at least five more years. We can assume that many users have hardware that is older than this, especially from economically less advanced areas of the world.

I'm not 100% against using fma, but I'm not 100% for either if it means someone with not-so-old hardware is excluded from using Julia with reasonable performance.

JeffreySarnoff · 2020-09-15T14:29:24Z

Not using fma does not make Julia more performant on machines that do not support hardware fma. Not using fma makes some software incorrect. That software will run slower on machines without hardware fma .. yet still work correctly using software fma.

oscardssmith · 2020-09-15T14:55:43Z

One thing I'm not sure about is how to test the performance of current base without fma. The problem is that currently expm1 is a call to a c library, so even if I make Julia think I don't have fma, libc will still use it for expm1. This matters because I don't think a fast float64 expm1 is possible without fma. I could reduce the impact for Float32 by making exthorner do the math in Float64, but that's far from a general solution.

JeffreySarnoff · 2020-09-15T15:18:41Z

I don't think a fast float64 expm1 is possible without fma.

What is made better by intentionally hobbling the performance or reducing the correctness of essential Float64 elementary functions?

On platforms where there is no hardware fma, one may either emulate it well [11x when called within the elementary function, so likely <11x for the whole function] or provide a less exact and less impactful pseudo-fma [perhaps within 0.5ulp of faithful and only 7x slower]. Of the two, it seems to me that preserving cross-platform ~faithful or better elementary functions is more important for Julia.

oscardssmith · 2020-09-15T15:30:32Z

I think you misunderstand me. What I'm saying is that testing FMAless behavior is not trivial on a machine work FMA since even if you disable it for Julia, it won't be disabled for C libraries.

JeffreySarnoff · 2020-09-15T15:33:34Z

I think you misunderstand me.

I did understand -- my previous post was in support of your position.
(any way you slice it, the sandwich tastes best with fma .. imo).

JeffreySarnoff · 2020-09-15T15:42:09Z

From the AccurateAlgorithms.jl docs, The Fused-Multiply-Add is worth a look.

chriselrod · 2020-09-15T15:44:26Z

(any way you slice it, the sandwich tastes best with fma .. imo).

IMO as well.

But I wonder if @baggepinnen would mind sharing his thoughts, so that we get at least some feedback from someone who may be adversely affected.

Also, I would be a fan of better support for feature detection. Why can't we accurately detect FMA_NATIVE, e.g. FMA3 or FMA4 instruction sets on x86, and use these to decide between implementations?

baggepinnen · 2020-09-15T16:19:33Z

But I wonder if @baggepinnen would mind sharing his thoughts, so that we get at least some feedback from someone who may be adversely affected.

I'm all for the speed. The biggest problem for me was that the slowdown was surprising and without your help I would have spent a lot of time trying to figure out what was wrong. Knowing what the problem was, I could simply avoid using the @avx macro in the problematic function.

Since the difference in performance between different cpus was rather dramatic, I think it would be important to raise awareness of the issue if it is to be present more generally.

oscardssmith · 2020-09-30T06:03:49Z

At this point, I'm pretty convinced that fma can't be perfectly emulated at all efficiently. The thing that sinks all attempts I've made is fma(floatmax(Float64),2.0,-floatmax(Float64)). Given this, is this PR mergable with current FMA speed? I think the right balance might be to check if FMA exists, and if not, use Float64 for the internals of the Float32 version. That said, I think that the current strategy, while not perfect has the advantage of being simple, and much better for at least 90% of people. Furthermore, I think it's probably fair to optimize performance for faster machines since they are the ones that are likely doing the most number crunching.

JeffreySarnoff · 2020-09-30T08:18:06Z

Essentially this same conclusion has been reached more than once.
Prioritize using Julia's fma where fma instructions are helpful (as they are here). Improving the soft fma implementation belongs in a different PR.

JeffreySarnoff · 2020-09-30T13:57:25Z

On Slack #triage I wrote:

It would be good to get Faster sinh and cosh for Float32, Float64 buttoned into 1.6. It is ready, and now quite clean.

simonbyrne · 2020-09-30T17:01:07Z

At this point, I'm pretty convinced that fma can't be perfectly emulated at all efficiently.

For this sort of thing we don't need a perfect emulation, we just need a few bits of extra precision. To do that we can use the same double-double tricks we use elsewhere (either Veltkamp splitting or just zero out the trailing bits).

For these polynomial evaluations we have some advantage in that (a) you only need to split the variable x once, and (b) at each stage the order of the computation should be roughly the order of the leading coefficient, giving you a rough estimate of magnitude. Then by carefully doing Dekker-style double-double tricks, you should be able to get a reasonably efficient implementation.

musm · 2020-10-01T16:38:04Z

I don't agree we should promote for Float32 up then back down. If a Float32 implementation is requested, the entire calculation should be done in Float32. Especially for these 'core' elementary functions.

StefanKarpinski · 2020-10-01T16:51:19Z

If a Float32 implementation is requested, the entire calculation should be done in Float32.

Why?

oscardssmith · 2020-10-01T17:15:30Z

Note that we already do this for other special functions. Iirc log and cbrt both use Float64 for their Float32 kernels. The reason is pretty simple. Compensated arithmetic is roughly 4x slower than native, but Float64 is only about 2x slower than Float32. The conversion overhead is not negligible, but it only takes a couple operations before the switch is worth it.

chriselrod · 2020-10-01T18:03:43Z

but Float64 is only about 2x slower than Float32

Float64 isn't slower than Float32, but you'd have have the throughput (half as many per vector) if you're doing SIMD.

JeffBezanson · 2020-10-01T18:49:40Z

Triage is 👍, but we should leave a note on the issue about fixing fma detection that once that's fixed, this can be updated.

simonbyrne · 2020-10-01T18:55:55Z

Okay, I haven't had a chance to review the actual approach yet. Will try to do it this week.

musm · 2020-10-02T02:48:39Z

Note that we already do this for other special functions. Iirc log and cbrt both use Float64 for their Float32 kernels.

No, there's a separate Float32 table for the log function in base and cbrt dispatches on custom 'kernels' for Float32.

I'm not aware of general elementary math function libraries that take the approach of promotion in the kernels, even though it is a valid one. In principle I don't think it is the correct way to go about it for things in Base. We sometimes do promotion for more generic special functions over at SpecialFunctions.jl, but this is more because we either simply don't have specialized code for this case, the functions are too specialized to bother with a more specific implementation, or developer resources for a more efficient Float32 method.

In any case, since we decide it's ok to 'neglect' the non-fma case in the triage call, I don't think the extra precision in the non-fma case is as relevant anymore.

oscardssmith · 2020-10-02T02:59:39Z

@msum What I'm doing is exactly the same as what cbrt does.

@inline function _improve_cbrt(x::Float32, t::Float32)
    xx = Float64(x)
    tt = Float64(t)
    tt3 = tt^3
    tt *= (2*xx + tt3)/(x + 2*tt3)
    tt3 = tt^3
    tt *= (2*xx + tt3)/(x + 2*tt3)
    return Float32(tt)

is the cbrt code I'm referring to (it starts on line 80). Like that, my sinh kernel has a specialized Float32 implementation that promotes to Float64 and does much simpler computation than the Float64. Note that I am not doing this for accuracy in the non-FMA case, but mainly because it is 30% faster (for FMA architectures even more for non-fma) than my previous approach.

oscardssmith · 2020-10-07T15:45:29Z

If 1.6 is branching tomorrow, can we please get this in for that?

oscardssmith · 2020-10-21T03:08:57Z

@simonbyrne any chance you can review this soon? If not, is there anyone else that could do the review for this?

oscardssmith · 2020-10-23T23:48:32Z

Note that this is a bug fix also since currently

julia> Float32(sinh(0.45764074))-sinh(0.45764074f0)
5.9604645f-8

julia> eps(.45f0)
2.9802322f-8

This is a 2ULP difference, which is greater than what we typically tolerate for special functions.

JeffreySarnoff · 2020-10-24T02:59:06Z

The report of an unsuccessful check seems to be from an earlier build attempt. If this is just buildbot being cranky and the exception vanish with a rebuild, or we understand why it appears and that has no bearing on [does not implicate] the state of this PR, all is well ... wherewith @oscardssmith receives this tip'o 🎩, If the code is clear to merge then to code clearly is to merge.

JeffreySarnoff · 2020-10-24T08:38:45Z

@oscardssmith Whatever bothers buildbot now .. has quintupled the error rate.

oscardssmith · 2020-10-24T22:22:32Z

@JeffreySarnoff It turns out that pushing code at 3am can lead to you removing a decimal place that makes your coefficients off by a factor of 10^13

oscardssmith · 2020-10-25T01:40:38Z

At this point, the only error is InexactError: check_top_bit(UInt64, -1) which is pretty clearly unrelated (this method is in no way related to the PR)

oscardssmith · 2020-10-25T21:46:31Z

@JeffBezanson can you re-run the package_frebsd64 build? It timed out.

Keno · 2020-10-29T18:37:19Z

Rerunning CI, and then will merge once it passes.

JeffreySarnoff · 2020-10-29T22:51:46Z

Do we know what to make of the tester_linux details given that the macos and windows testers pass?

Keno · 2020-10-29T22:59:40Z

Yeah, distributed has been broken. Those failures are fine

oscardssmith · 2020-10-30T02:25:38Z

@Keno, given that, time to merge?

oscardssmith added 3 commits September 6, 2020 01:17

Faster sinh and cosh for Float32, Float64

4b29618

Use minimax polynomials to reduce the number of cases needed.

Fix COSH_SMALL_X

22b501d

I should probably give the functions the correct names.

8fb3cb7

oscardssmith mentioned this pull request Sep 7, 2020

faster expm1 #37440

Merged

JeffBezanson reviewed Sep 8, 2020

View reviewed changes

JeffBezanson added the triage This should be discussed on a triage call label Sep 9, 2020

JeffBezanson removed the triage This should be discussed on a triage call label Sep 10, 2020

StefanKarpinski added the triage This should be discussed on a triage call label Sep 30, 2020

musm removed the triage This should be discussed on a triage call label Oct 2, 2020

Add a tiny bit of bonus accuracy for Float64 to re-trigger build

2dd8923

turns out decimal places matter

23135a9

Merge branch 'master' of github.com:JuliaLang/julia into patch-4

2182ba9

Keno closed this Oct 29, 2020

Keno reopened this Oct 29, 2020

Keno merged commit 300e34c into JuliaLang:master Oct 30, 2020

oscardssmith deleted the patch-4 branch October 30, 2020 02:40

giordano mentioned this pull request Feb 9, 2021

Speeding up computation of sinh(x) and cosh(x) ericagol/NbodyGradient.jl#37

Closed

1 task

simeonschaub mentioned this pull request Jul 5, 2021

USE_INTEL_MKL and USE_INTEL_LIBM doesn't seem to have any effect anymore? #41470

Closed

Faster sinh and cosh for Float32, Float64 #37426

Faster sinh and cosh for Float32, Float64 #37426

Conversation

oscardssmith commented Sep 6, 2020

oscardssmith commented Sep 6, 2020

JeffBezanson Sep 8, 2020

Choose a reason for hiding this comment

oscardssmith Sep 8, 2020

Choose a reason for hiding this comment

JeffBezanson Sep 9, 2020

Choose a reason for hiding this comment

chriselrod Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

JeffreySarnoff commented Sep 9, 2020

oscardssmith commented Sep 9, 2020

JeffreySarnoff commented Sep 10, 2020 • edited Loading

musm commented Sep 10, 2020

oscardssmith commented Sep 10, 2020

musm commented Sep 10, 2020

JeffBezanson commented Sep 10, 2020

oscardssmith commented Sep 11, 2020 • edited Loading

pkofod commented Sep 15, 2020 • edited Loading

JeffreySarnoff commented Sep 15, 2020

oscardssmith commented Sep 15, 2020

JeffreySarnoff commented Sep 15, 2020

oscardssmith commented Sep 15, 2020

JeffreySarnoff commented Sep 15, 2020

JeffreySarnoff commented Sep 15, 2020

chriselrod commented Sep 15, 2020 • edited Loading

baggepinnen commented Sep 15, 2020

oscardssmith commented Sep 30, 2020

JeffreySarnoff commented Sep 30, 2020

JeffreySarnoff commented Sep 30, 2020

simonbyrne commented Sep 30, 2020

musm commented Oct 1, 2020

StefanKarpinski commented Oct 1, 2020

oscardssmith commented Oct 1, 2020 • edited Loading

chriselrod commented Oct 1, 2020

JeffBezanson commented Oct 1, 2020

simonbyrne commented Oct 1, 2020

musm commented Oct 2, 2020

oscardssmith commented Oct 2, 2020 • edited Loading

oscardssmith commented Oct 7, 2020

oscardssmith commented Oct 21, 2020

oscardssmith commented Oct 23, 2020

JeffreySarnoff commented Oct 24, 2020 • edited Loading

JeffreySarnoff commented Oct 24, 2020

oscardssmith commented Oct 24, 2020

oscardssmith commented Oct 25, 2020

oscardssmith commented Oct 25, 2020

Keno commented Oct 29, 2020

JeffreySarnoff commented Oct 29, 2020

Keno commented Oct 29, 2020

oscardssmith commented Oct 30, 2020

chriselrod Sep 9, 2020 •

edited

Loading

JeffreySarnoff commented Sep 10, 2020 •

edited

Loading

oscardssmith commented Sep 11, 2020 •

edited

Loading

pkofod commented Sep 15, 2020 •

edited

Loading

chriselrod commented Sep 15, 2020 •

edited

Loading

oscardssmith commented Oct 1, 2020 •

edited

Loading

oscardssmith commented Oct 2, 2020 •

edited

Loading

JeffreySarnoff commented Oct 24, 2020 •

edited

Loading