-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster sinh and cosh for Float32, Float64 #37426
Conversation
Use minimax polynomials to reduce the number of cases needed.
I think this is good to go other than fixing the doctest. |
for i in length(p)-1:-1:1 | ||
pi = p[i] | ||
prod = hi*x | ||
err = fma(hi, x, -prod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we condition use of this on FMA_NATIVE? (Although it's not currently accurate... #33011)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be unconditional. If this is a muladd, llvm will constant fold err
to 0. Also are there any relevant platforms without a fast fma? As I understand it, it's required to implement ieee-754 (2008)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. We should probably discuss to what extent we want to assume fast fma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gotten complaints about fma
causing bad performance before, e.g.: JuliaSIMD/LoopVectorization.jl#134 (comment)
Some people are on old architectures.
I would personally prefer Julia assume fast fma though, because
julia> Base.Math.FMA_NATIVE
false
julia> run(pipeline(`cat /proc/cpuinfo`, `grep fma`, `head -n1`))
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
Base.ProcessChain(Base.Process[Process(`cat /proc/cpuinfo`, ProcessSignaled(13)), Process(`grep fma`, ProcessSignaled(13)), Process(`head -n1`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())
That is, Julia says FMA_NATIVE is false
, but it is in fact true.
Most code I write assumes a fast fma. Absent fast fma, arithmetic slows and inaccuracies are more difficult to manage. I made peace with the perspective that fast fma is available before developing |
@JefferySarnoff Thank you for writing |
|
One thing we should decide upon is whether I recall we had discussions regarding it's use with @simonbyrne and @pkofod awhile back regarding assumptions on whether machine fma is available or not. Back then it seemed not acceptable to have a slow fallback in case of non-machine fma for I took another look (and I think this is why it was more tricky a few years back), but native fma support isn't as old as I initially assumed, but perhaps old enough we can simply assume (ref 1) (ref 2) (looks like at least any CPU within the last 5 years should include it). I guess it can be tricky to go from something that "always works to something that works in most places and is best in most places" (quoting @pkofod). OTOH, I feel ok about assuming it. |
Would it be possible to make |
5-year old cpu's, but the point stands. Back then the situation was a bit different, but yeah I am agreeing that we could probably drop that. But best to discuss in triage? |
Oscar mentioned in triage that there is a faster software fma implementation we should be using. If we can get that working the decision is easy. Triage is mostly ok with this anyway though. |
I'm not 100% sure this is correct, but it's passed the tests I've thrown at it. It is 22x slower for
|
As @musm mentions above we discussed this and were both surprised at how recent ubiquitous fma actually is. I had assumed at least five more years. We can assume that many users have hardware that is older than this, especially from economically less advanced areas of the world. I'm not 100% against using fma, but I'm not 100% for either if it means someone with not-so-old hardware is excluded from using Julia with reasonable performance. |
Not using fma does not make Julia more performant on machines that do not support hardware fma. Not using fma makes some software incorrect. That software will run slower on machines without hardware fma .. yet still work correctly using software fma. |
One thing I'm not sure about is how to test the performance of current base without fma. The problem is that currently expm1 is a call to a c library, so even if I make Julia think I don't have fma, libc will still use it for expm1. This matters because I don't think a fast float64 expm1 is possible without fma. I could reduce the impact for Float32 by making exthorner do the math in Float64, but that's far from a general solution. |
What is made better by intentionally hobbling the performance or reducing the correctness of essential Float64 elementary functions? On platforms where there is no hardware fma, one may either emulate it well [11x when called within the elementary function, so likely <11x for the whole function] or provide a less exact and less impactful pseudo-fma [perhaps within 0.5ulp of faithful and only 7x slower]. Of the two, it seems to me that preserving cross-platform ~faithful or better elementary functions is more important for Julia. |
I think you misunderstand me. What I'm saying is that testing FMAless behavior is not trivial on a machine work FMA since even if you disable it for Julia, it won't be disabled for C libraries. |
I did understand -- my previous post was in support of your position. |
From the |
IMO as well. But I wonder if @baggepinnen would mind sharing his thoughts, so that we get at least some feedback from someone who may be adversely affected. Also, I would be a fan of better support for feature detection. Why can't we accurately detect FMA_NATIVE, e.g. FMA3 or FMA4 instruction sets on x86, and use these to decide between implementations? |
I'm all for the speed. The biggest problem for me was that the slowdown was surprising and without your help I would have spent a lot of time trying to figure out what was wrong. Knowing what the problem was, I could simply avoid using the Since the difference in performance between different cpus was rather dramatic, I think it would be important to raise awareness of the issue if it is to be present more generally. |
At this point, I'm pretty convinced that |
Essentially this same conclusion has been reached more than once. |
On Slack #triage I wrote: It would be good to get |
For this sort of thing we don't need a perfect emulation, we just need a few bits of extra precision. To do that we can use the same double-double tricks we use elsewhere (either Veltkamp splitting or just zero out the trailing bits). For these polynomial evaluations we have some advantage in that (a) you only need to split the variable |
I don't agree we should promote for Float32 up then back down. If a Float32 implementation is requested, the entire calculation should be done in Float32. Especially for these 'core' elementary functions. |
Why? |
Note that we already do this for other special functions. Iirc log and cbrt both use Float64 for their Float32 kernels. The reason is pretty simple. Compensated arithmetic is roughly 4x slower than native, but Float64 is only about 2x slower than Float32. The conversion overhead is not negligible, but it only takes a couple operations before the switch is worth it. |
|
Triage is 👍, but we should leave a note on the issue about fixing fma detection that once that's fixed, this can be updated. |
Okay, I haven't had a chance to review the actual approach yet. Will try to do it this week. |
No, there's a separate I'm not aware of general elementary math function libraries that take the approach of promotion in the kernels, even though it is a valid one. In principle I don't think it is the correct way to go about it for things in Base. We sometimes do promotion for more generic special functions over at In any case, since we decide it's ok to 'neglect' the non-fma case in the triage call, I don't think the extra precision in the non-fma case is as relevant anymore. |
@msum What I'm doing is exactly the same as what
is the |
If 1.6 is branching tomorrow, can we please get this in for that? |
@simonbyrne any chance you can review this soon? If not, is there anyone else that could do the review for this? |
Note that this is a bug fix also since currently
This is a 2ULP difference, which is greater than what we typically tolerate for special functions. |
The report of an unsuccessful check seems to be from an earlier build attempt. If this is just buildbot being cranky and the exception vanish with a rebuild, or we understand why it appears and that has no bearing on [does not implicate] the state of this PR, all is well ... wherewith @oscardssmith receives this tip'o 🎩, If the code is clear to merge then to code clearly is to merge. |
@oscardssmith Whatever bothers buildbot now .. has quintupled the error rate. |
@JeffreySarnoff It turns out that pushing code at 3am can lead to you removing a decimal place that makes your coefficients off by a factor of 10^13 |
At this point, the only error is |
@JeffBezanson can you re-run the package_frebsd64 build? It timed out. |
Rerunning CI, and then will merge once it passes. |
Do we know what to make of the tester_linux details given that the macos and windows testers pass? |
Yeah, distributed has been broken. Those failures are fine |
@Keno, given that, time to merge? |
Use minimax polynomials to reduce the number of cases needed. These functions are only accurate to <1.5 ULPs, but the old functions were over 2x slower and 1.3 ULPs so I view this as an acceptable improvement. The biggest question I have about this is whether
exthorner
belongs here. It arguably should live somewhere where it would get exported or at least was more widely available as extended precision polynomials make special functions much easier to code. Thoughts?