Use different methods depending on presence of FMA


Say you’re doing argument reduction for trig functions for double floats (wonder where this comes from…), then you should use different algorithms depending on the magnitude of your argument. Therefore, you end up with some branching and special cases depending on the required precision (say, around multiples of pi/2 or for numbers larger than 2.0^20).

Now, let’s add FMA to the mix. Due to the fact that FMAs are calculated with just one final rounding, we can in some sense calculate the above to higher precision for a given algorithm. However, given this increased precision, it no longer makes sense to switch to the more expensive versions of the algorithms near pi/2 multiples, and we can wait longer to switch to the expensive Payne Hanek scheme. The reduction scheme changes quite a lot.

In I do not use FMAs, but many systems have these instruction sets available these days, and it would be stupid not to take advantage of this. However, I am unsure how to best go about this. Do you check for the presence of FMA instructions when makeing and include either of the two depending on a boolean flag, or?



Perhaps I am missing something, but I thought that Base.fma always does, well, FMA, just that it is done in software when not supported by the CPU. AFAIK the latter is available on all “recent” (< 5 yo) Intel and AMD CPUs, and quite a few ARM ones as well (Raspberry Pi is perhaps an exception).

If your concern is speed on platforms that don’t support native FMA, you could check Sys.ARCH and use the FMA version of your algorithm only when it is :x86_64. You would miss a few cases though, but this would cover a lot.


The problem is not speed, it’s the fact that if you have FMA you can use a better approach that doesn’t have to jump through as many hoops. For example, you can avoid using the 3 split version of Cody Waite for the medium case, and you can use Cody Waite for larger values than 2.0^20. So, as I tried to explain above, I want to distinct methods to be used depending on whether Julia was built on a system with or without fma.


Then I am confused — I thought that FMA was always available, via Base.fma. At least that’s what the documentation says.


Alright, then let me be 110% explicit. If you have hardware FMA support, that’s why I used the word “instruction” above.

If you don’t have hardware fma support then fma(x,y,z) can be rather slow, so you want to use muladd(x,y,z), but again, this is not what I am after. If I write muladd on a system without hardware FMA and use the approach I asked for above, then I will get incorrect results. If I use fma on a system without hardware FMA I can use the approach above, but there’s a good chance it will be slower, and then I should have used the original reduction scheme. This is why I want to use two different implementations based on hardware FMA being available or not.

I know most modern systems should have it, but I don’t think it makes sense to have old systems have a) an incorrect reduction scheme or b) an unnecessarily slow reduction scheme.

Edit: basically, we are repeating the discussion in the link Per posted…


I’m facing the same problem. At the moment, I’m simply assuming that Base.fma is fast, and don’t bother about the fact that a better algorithm should be used when hardware fma is not available. This approach works well on 100% of the systems that I’ve tested my code on. :wink:

It’d be really neat to be able to do @static if has_fast_fma(T) alg1() else alg2() end, but so far I’ve been too lazy to write that function myself.


You can inquire about what cpu type llvm and openblas think your processor / system image are. There’s also a jl_cpuid for the raw data. Yichao has a PR open that adds a lot more of this functionality.


Note that it is never a good idea to query feature support from could at runtime, it will be extremely slow.

And yes, I do want to add user API for CPU dispatch too, not sure what it will look like and it’ll probably not be in the current pr.


I’m not sure if I didn’t really explain myself that well, but this is exactly the thing I want to avoid. I want this to be settled at build time.


Well, that’s replying to Tony’s comment.


How about something like this?

"Predicate for testing if fast fused-multiply-add is available for the given type on this system."
has_fast_fma(::Type) = false
for T in [Float32, Float64]
  if contains(sprint(code_native, fma, (T, T, T)),"vfmadd")
    @eval has_fast_fma(::Type{$T}) = true


What does it supposed to do?


It creates the function has_fast_fma that I referred to earlier. (Obviously this is a quick example, and not intended to be comprehensive.)

The idea is to be able to write code like:

@static if has_fast_fma(Float64)
  algorithm 1, which uses fma
  algorithm 2, which does does the same thing, without using fma
  and which is faster than algorithm 1 would be with software fma.


I mean what’s the semantics you want?
If you just want a flag that works on only one host, just checking the result of muladd should work. Checking the assembly is much less relyable.
If you want a way to make the compiled code portable, you can use a non-const Bool flag if you can afford a branch. If you don’t want the branch, it’s not possible ATM.


OK, so something like this?

has_fast_fma(::Type{T}) where {T} = muladd(one(T)+eps(T),-one(T)+eps(T),one(T)) == eps(T)*eps(T)

I don’t need to make the compiled code portable. (I didn’t even know that Julia supported creating compiled libraries.) I intend to use this in a module which is distributed as source code and (pre)compiled on the host where the code runs.


Better make it a constant. LLVM constant folds muladd.


Would this work?

@generated has_fast_fma(::Type{T}) where {T} = muladd(one(T)+eps(T),-one(T)+eps(T),one(T)) == eps(T)*eps(T)


That’s worse.


Why is that worse? I thought the @generated macro would call the muladd function (rather than inlining and constant-folding it) and the macro would then return a constant Bool as the method body. Maybe I’ve misunderstood how @generated works.

I tried running the code on both a computer with fma and one that doesn’t have it. The results form the @generated function seem correct on both systems. (Without @generated, the function returns true even on the non-fma system.)