Use different methods depending on presence of FMA

pkofod · June 29, 2017, 5:01am

Say you’re doing argument reduction for trig functions for double floats (wonder where this comes from…), then you should use different algorithms depending on the magnitude of your argument. Therefore, you end up with some branching and special cases depending on the required precision (say, around multiples of pi/2 or for numbers larger than 2.0^20).

Now, let’s add FMA to the mix. Due to the fact that FMAs are calculated with just one final rounding, we can in some sense calculate the above to higher precision for a given algorithm. However, given this increased precision, it no longer makes sense to switch to the more expensive versions of the algorithms near pi/2 multiples, and we can wait longer to switch to the expensive Payne Hanek scheme. The reduction scheme changes quite a lot.

In https://github.com/JuliaLang/julia/pull/22603 I do not use FMAs, but many systems have these instruction sets available these days, and it would be stupid not to take advantage of this. However, I am unsure how to best go about this. Do you check for the presence of FMA instructions when makeing and include either of the two depending on a boolean flag, or?

Per · June 29, 2017, 6:09am

https://github.com/JuliaLang/julia/issues/9855

Tamas_Papp · June 29, 2017, 6:12am

Perhaps I am missing something, but I thought that Base.fma always does, well, FMA, just that it is done in software when not supported by the CPU. AFAIK the latter is available on all “recent” (< 5 yo) Intel and AMD CPUs, and quite a few ARM ones as well (Raspberry Pi is perhaps an exception).

If your concern is speed on platforms that don’t support native FMA, you could check Sys.ARCH and use the FMA version of your algorithm only when it is :x86_64. You would miss a few cases though, but this would cover a lot.

pkofod · June 29, 2017, 7:09am

The problem is not speed, it’s the fact that if you have FMA you can use a better approach that doesn’t have to jump through as many hoops. For example, you can avoid using the 3 split version of Cody Waite for the medium case, and you can use Cody Waite for larger values than 2.0^20. So, as I tried to explain above, I want to distinct methods to be used depending on whether Julia was built on a system with or without fma.

Tamas_Papp · June 29, 2017, 7:18am

Then I am confused — I thought that FMA was always available, via Base.fma. At least that’s what the documentation says.

pkofod · June 29, 2017, 7:25am

Alright, then let me be 110% explicit. If you have hardware FMA support, that’s why I used the word “instruction” above.

If you don’t have hardware fma support then fma(x,y,z) can be rather slow, so you want to use muladd(x,y,z), but again, this is not what I am after. If I write muladd on a system without hardware FMA and use the approach I asked for above, then I will get incorrect results. If I use fma on a system without hardware FMA I can use the approach above, but there’s a good chance it will be slower, and then I should have used the original reduction scheme. This is why I want to use two different implementations based on hardware FMA being available or not.

I know most modern systems should have it, but I don’t think it makes sense to have old systems have a) an incorrect reduction scheme or b) an unnecessarily slow reduction scheme.

Edit: basically, we are repeating the discussion in the link Per posted…

Per · June 29, 2017, 7:49am

I’m facing the same problem. At the moment, I’m simply assuming that Base.fma is fast, and don’t bother about the fact that a better algorithm should be used when hardware fma is not available. This approach works well on 100% of the systems that I’ve tested my code on.

It’d be really neat to be able to do @static if has_fast_fma(T) alg1() else alg2() end, but so far I’ve been too lazy to write that function myself.

tkelman · June 29, 2017, 12:24pm

You can inquire about what cpu type llvm and openblas think your processor / system image are. There’s also a jl_cpuid for the raw data. Yichao has a PR open that adds a lot more of this functionality.

yuyichao · June 29, 2017, 8:13pm

Note that it is never a good idea to query feature support from could at runtime, it will be extremely slow.

And yes, I do want to add user API for CPU dispatch too, not sure what it will look like and it’ll probably not be in the current pr.

pkofod · June 29, 2017, 8:39pm

I’m not sure if I didn’t really explain myself that well, but this is exactly the thing I want to avoid. I want this to be settled at build time.

yuyichao · June 30, 2017, 6:15am

Well, that’s replying to Tony’s comment.

Per · June 30, 2017, 12:07pm

How about something like this?

"Predicate for testing if fast fused-multiply-add is available for the given type on this system."
has_fast_fma(::Type) = false
for T in [Float32, Float64]
  if contains(sprint(code_native, fma, (T, T, T)),"vfmadd")
    @eval has_fast_fma(::Type{$T}) = true
  end
end

yuyichao · June 30, 2017, 12:46pm

What does it supposed to do?

Per · June 30, 2017, 12:58pm

It creates the function has_fast_fma that I referred to earlier. (Obviously this is a quick example, and not intended to be comprehensive.)

The idea is to be able to write code like:

@static if has_fast_fma(Float64)
  algorithm 1, which uses fma
else
  algorithm 2, which does does the same thing, without using fma
  and which is faster than algorithm 1 would be with software fma.
end

yuyichao · June 30, 2017, 1:26pm

I mean what’s the semantics you want?
If you just want a flag that works on only one host, just checking the result of muladd should work. Checking the assembly is much less relyable.
If you want a way to make the compiled code portable, you can use a non-const Bool flag if you can afford a branch. If you don’t want the branch, it’s not possible ATM.

Per · June 30, 2017, 2:02pm

OK, so something like this?

has_fast_fma(::Type{T}) where {T} = muladd(one(T)+eps(T),-one(T)+eps(T),one(T)) == eps(T)*eps(T)

I don’t need to make the compiled code portable. (I didn’t even know that Julia supported creating compiled libraries.) I intend to use this in a module which is distributed as source code and (pre)compiled on the host where the code runs.

yuyichao · June 30, 2017, 2:10pm

Better make it a constant. LLVM constant folds muladd.

Per · June 30, 2017, 2:15pm

Would this work?

@generated has_fast_fma(::Type{T}) where {T} = muladd(one(T)+eps(T),-one(T)+eps(T),one(T)) == eps(T)*eps(T)

yuyichao · June 30, 2017, 2:26pm

That’s worse.

Per · June 30, 2017, 3:20pm

Why is that worse? I thought the @generated macro would call the muladd function (rather than inlining and constant-folding it) and the macro would then return a constant Bool as the method body. Maybe I’ve misunderstood how @generated works.

I tried running the code on both a computer with fma and one that doesn’t have it. The results form the @generated function seem correct on both systems. (Without @generated, the function returns true even on the non-fma system.)

Topic		Replies	Views
X * y + z does not automatically use FMA instruction Performance fast-math	31	2499	July 27, 2023
Why doesn’t Julia fma automatically? General Usage	43	2375	October 18, 2023
Floating point optimizations General Usage question , float	5	463	November 1, 2022
"Best practice" for constant expressions (1/2 vs. 1.0/2.0)? New to Julia	10	1175	November 15, 2018
Feedback on benchmark General Usage	0	241	January 10, 2020

Use different methods depending on presence of FMA

Related topics