[EDIT: FMA is (still) worrying, see my next post.]
That seemed to be very worrying. And there seemed no limit to the error. I changed the code slightly to see if I can get larger err:

err = 8.731107228996319e43
a = 6.208364721873767e29
b = 8.740381590017811e29

So what does that mean? I was starting to think we might have discovered a hardware bug, but possibly this is the only “valid” calculated value (based on FMA’s definition). Then users of fma just need to have that in mind (but will not), and indirect users, e.g. BLAS users (they will not). Since this is data dependent, is this just going to be rare (and accepted) and should fma not be used or only with care? Is there a way to fix BLAS to avoid FMA when dangerous? I would think it might not be possible, since such conditional code would slow it down enough to FMA not being effective (otherwise).

I don’t think that the conclusion here is that FMA is “dangerous”. Sometimes it will give more accurate results than non-fused operations, sometimes (probably less often) it will happen to give less accurate results, but in all cases the results are within the tolerance you would expect from floating-point calculations.

I think you are being confused here by the concept of relative error. Is 8.5e41 a “big” error? You have to ask big compared to what? In this case, the summands are ±1.9e58, and 8.5e41 / 1.9e58 ≈ 4.4e-17, so it’s pretty small compared to the inputs.

Compared to the exact output of 0.0, of course, the relative forward error is infinite (you have no correct significant digits), but that would be true of any nonzero answer, and is to be expected because the sum is ill-conditioned (a sum of nonzero inputs with an exact cancellation has an infinite relative condition number).

In the case of fma(-x,y,x*y), the result is the roundoff error of x*y. That is to say, it’ll never be zero unless x*y requires no rounding of the result.

FMA is doing exactly what it’s supposed to. One can avoid this behavior via sum(a.*b) or sum(prod,zip(a,b)) if necessary, however.

Every rounding step introduces some error, so in “normal” cases the result will be likely to be more accurate with FMA rather than less. Also, it has the advantage of being a single instruction rather than separate multiply and add instructions. For things like linear algebra, this can be a huge speedup (close to a factor of two).

Assuming your real use-case is ill-conditioned (*), but much larger than the toy example here, you could perhaps benefit from the compensated dot product implementation in AccurateArithmetic.jl:

julia> using AccurateArithmetic
julia> a = [4.738285026766486e28, 4.738285026766486e28]
2-element Vector{Float64}:
4.738285026766486e28
4.738285026766486e28
julia> b = [-4.0686784105686624e29, 4.0686784105686624e29]
2-element Vector{Float64}:
-4.0686784105686624e29
4.0686784105686624e29
julia> dot_oro(a, b)
0.0

It does not provide matrix-vector multiplication, though…

(*) note that this particular example is “infinitely” ill-conditioned in that the relative error can not be bounded w.r.t the exact result (which is 0 in this case). The error w.r.t the operands magnitude may however be drastically reduced with compensated algorithms.

You may see it by running version -blas on MATLAB:

>> version -blas
ans =
'Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX2'

It comes with the price of performance (Though when comparing to Julia it seems to be negligible if at all), but it means you get the same result every time on MATLAB.
I actually think Julia should have a switch to chose this mode on MKL.jl.

Remark
It seems that MATLAB is using relatively old version of MKL. My guess they keep it in order to keep the hack of forcing code path for AMD CPU’s.

That’s not a very satisfying answer (though I upvoted), i.e. doesn’t put me at ease. If it were say only -3.583133801120832e-16 from OP, rather than the correct exact 0.0, I wouldn’t worry much.

And it’s not simply that you get two different answers, I can understand that, now.

I still think FMA is dangerous (if you don’t know what you’re doing, as with all floating-point, which is a wicked can-of worms), while useful. But let’s just concentrate first on BLAS only. An easy apparent “fix” would be to just not use FMA there, then this wouldn’t have happened, slight or big differences, nor two different answers. Getting two different can actually be a good thing. Then on FMA hardware, you could run with and without and compare, at least occasionally, for a sanity check (is there an easy way to disable it on such hardware? julia --cpu-target=sandybridge seems to do it but likely only for everything Julia, except for BLAS; “-fma” and “-fma4” options for --cpu-target didn’t work while seemingly should have).

I can expect and am willing to accept slightly different answers (and thus not all results accurate) across systems, as long as they do not diverge a lot, for some definition of “a lot”. If I could run some kind of isapprox, with some fixed (maybe not the default) parameters and get true. Some people would want bit-identical answer (see Java historically, an abandoned plan).

Well, compared to the correct value , there 0.0. By atol appropriate? But it seem like you can get arbitrarily large absolute difference, so wouldn’t be useful to check and besides you wouldn’t know the correct answer beforehand. Aren’t matrix multiply algorithms used (without FMA) considered “stable”, i.e. with low relative tolerance (and absolute tol)? I suppose with FMA, you much more often get a relatively correct answer.

Good to know, but I suppose you get wrong answers there too, now just the same wrong answer, while always the “bitwise identical results”, so you wouldn’t notice.

You should be able to use (strict) CNR in Julia with MKL.jl, so someone could check, with:

julia> ENV["MKL_CBWR"] = "AVX2,STRICT"

Strict CNR Mode

Intel® MKL 2019 Update 3 introduces a new strict CNR mode which provides stronger reproducibility guarantees for a limited set of routines and conditions. When strict CNR is enabled, MKL will produce bitwise identical results, regardless of the number of threads, for the following functions:

?gemm, ?trsm, ?symm, and their CBLAS equivalents.

Additionally, strict CNR is only supported under the following conditions:

the CNR code-path must be set to AVX2 or later, or the AUTO code-path is used and the processor is an Intel processor supporting AVX2 or later instructions;

before you open another can of worms (you already did…;). The godfathers thought about this. But it certainly would be good to know how to navigate the minefield of speed vs. accuracy (in Julia?) most of the time.

fma is a red herring here. Any way of computing a matmul will have cases with arbitrarily high forward error. On average, fma lowers forward error, but not always.
For example, in this case

These two statements are contradictory. Compared to 0.0, any nonzero answer (even -3e-16) is an infinite relative forward error — -3e-16 has no correct significant digits.

It’s simply a mistake to always think of 1e-16 as a “small” number in floating-point arithmetic, because it implicitly assumes your relevant scale is \sim 1. (As a physical example, imagine simulating electromagnetism in SI units at optical wavelengths, where the typical timescale is \lesssim 10^{-15} seconds.) Floating-point arithmetic is all about relative errors.

They are backwards stable (with or without FMA), which essentially means they have a “small” backwards relative error. (More precisely, the backwards relative error decreases as O(ε), whether or not the coefficient is small.) This does not guarantee a small forwards error — the forwards relative error can grow arbitrarily large as the condition number diverges.

Sorry, you cannot guarantee small relative forwards error (the error in the “answer”) for an ill-conditioned problem in finite precision.

These are basic considerations that apply to any floating-point algorithm, regardless of whether FMA is used. The fact that FMA gave a less accurate result in this particular case is just “luck” — there’s no reason to expect that to happen consistently.

After searching a bit I don’t think so. I saw some mention of XBLAS being used internally in MKL, but other than that we probably would need to interface the reference implementation via ccall.

Note that you can already use Julia’s generic linear-algebra backends with extended-precision floating-point types (double-double etc.). See also the discussion at: MPLAPACK (multiprecision blas+lapack)