[EDIT: FMA is (still) worrying, see my next post.]
That seemed to be very worrying. And there seemed no limit to the error. I changed the code slightly to see if I can get larger err:
err = 8.731107228996319e43
a = 6.208364721873767e29
b = 8.740381590017811e29
So what does that mean? I was starting to think we might have discovered a hardware bug, but possibly this is the only âvalidâ calculated value (based on FMAâs definition). Then users of fma just need to have that in mind (but will not), and indirect users, e.g. BLAS users (they will not). Since this is data dependent, is this just going to be rare (and accepted) and should fma not be used or only with care? Is there a way to fix BLAS to avoid FMA when dangerous? I would think it might not be possible, since such conditional code would slow it down enough to FMA not being effective (otherwise).
I donât think that the conclusion here is that FMA is âdangerousâ. Sometimes it will give more accurate results than non-fused operations, sometimes (probably less often) it will happen to give less accurate results, but in all cases the results are within the tolerance you would expect from floating-point calculations.
I think you are being confused here by the concept of relative error. Is 8.5e41 a âbigâ error? You have to ask big compared to what? In this case, the summands are ±1.9e58, and 8.5e41 / 1.9e58 â 4.4e-17, so itâs pretty small compared to the inputs.
Compared to the exact output of 0.0, of course, the relative forward error is infinite (you have no correct significant digits), but that would be true of any nonzero answer, and is to be expected because the sum is ill-conditioned (a sum of nonzero inputs with an exact cancellation has an infinite relative condition number).
In the case of fma(-x,y,x*y), the result is the roundoff error of x*y. That is to say, itâll never be zero unless x*y requires no rounding of the result.
FMA is doing exactly what itâs supposed to. One can avoid this behavior via sum(a.*b) or sum(prod,zip(a,b)) if necessary, however.
Every rounding step introduces some error, so in ânormalâ cases the result will be likely to be more accurate with FMA rather than less. Also, it has the advantage of being a single instruction rather than separate multiply and add instructions. For things like linear algebra, this can be a huge speedup (close to a factor of two).
Assuming your real use-case is ill-conditioned (*), but much larger than the toy example here, you could perhaps benefit from the compensated dot product implementation in AccurateArithmetic.jl:
julia> using AccurateArithmetic
julia> a = [4.738285026766486e28, 4.738285026766486e28]
2-element Vector{Float64}:
4.738285026766486e28
4.738285026766486e28
julia> b = [-4.0686784105686624e29, 4.0686784105686624e29]
2-element Vector{Float64}:
-4.0686784105686624e29
4.0686784105686624e29
julia> dot_oro(a, b)
0.0
It does not provide matrix-vector multiplication, thoughâŠ
(*) note that this particular example is âinfinitelyâ ill-conditioned in that the relative error can not be bounded w.r.t the exact result (which is 0 in this case). The error w.r.t the operands magnitude may however be drastically reduced with compensated algorithms.
You may see it by running version -blas on MATLAB:
>> version -blas
ans =
'Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX2'
It comes with the price of performance (Though when comparing to Julia it seems to be negligible if at all), but it means you get the same result every time on MATLAB.
I actually think Julia should have a switch to chose this mode on MKL.jl.
Remark
It seems that MATLAB is using relatively old version of MKL. My guess they keep it in order to keep the hack of forcing code path for AMD CPUâs.
Thatâs not a very satisfying answer (though I upvoted), i.e. doesnât put me at ease. If it were say only -3.583133801120832e-16 from OP, rather than the correct exact 0.0, I wouldnât worry much.
And itâs not simply that you get two different answers, I can understand that, now.
I still think FMA is dangerous (if you donât know what youâre doing, as with all floating-point, which is a wicked can-of worms), while useful. But letâs just concentrate first on BLAS only. An easy apparent âfixâ would be to just not use FMA there, then this wouldnât have happened, slight or big differences, nor two different answers. Getting two different can actually be a good thing. Then on FMA hardware, you could run with and without and compare, at least occasionally, for a sanity check (is there an easy way to disable it on such hardware? julia --cpu-target=sandybridge seems to do it but likely only for everything Julia, except for BLAS; â-fmaâ and â-fma4â options for --cpu-target didnât work while seemingly should have).
I can expect and am willing to accept slightly different answers (and thus not all results accurate) across systems, as long as they do not diverge a lot, for some definition of âa lotâ. If I could run some kind of isapprox, with some fixed (maybe not the default) parameters and get true. Some people would want bit-identical answer (see Java historically, an abandoned plan).
Well, compared to the correct value , there 0.0. By atol appropriate? But it seem like you can get arbitrarily large absolute difference, so wouldnât be useful to check and besides you wouldnât know the correct answer beforehand. Arenât matrix multiply algorithms used (without FMA) considered âstableâ, i.e. with low relative tolerance (and absolute tol)? I suppose with FMA, you much more often get a relatively correct answer.
Good to know, but I suppose you get wrong answers there too, now just the same wrong answer, while always the âbitwise identical resultsâ, so you wouldnât notice.
You should be able to use (strict) CNR in Julia with MKL.jl, so someone could check, with:
julia> ENV["MKL_CBWR"] = "AVX2,STRICT"
Strict CNR Mode
IntelÂź MKL 2019 Update 3 introduces a new strict CNR mode which provides stronger reproducibility guarantees for a limited set of routines and conditions. When strict CNR is enabled, MKL will produce bitwise identical results, regardless of the number of threads, for the following functions:
?gemm, ?trsm, ?symm, and their CBLAS equivalents.
Additionally, strict CNR is only supported under the following conditions:
the CNR code-path must be set to AVX2 or later, or the AUTO code-path is used and the processor is an Intel processor supporting AVX2 or later instructions;
before you open another can of worms (you already didâŠ;). The godfathers thought about this. But it certainly would be good to know how to navigate the minefield of speed vs. accuracy (in Julia?) most of the time.
fma is a red herring here. Any way of computing a matmul will have cases with arbitrarily high forward error. On average, fma lowers forward error, but not always.
For example, in this case
These two statements are contradictory. Compared to 0.0, any nonzero answer (even -3e-16) is an infinite relative forward error â -3e-16 has no correct significant digits.
Itâs simply a mistake to always think of 1e-16 as a âsmallâ number in floating-point arithmetic, because it implicitly assumes your relevant scale is \sim 1. (As a physical example, imagine simulating electromagnetism in SI units at optical wavelengths, where the typical timescale is \lesssim 10^{-15} seconds.) Floating-point arithmetic is all about relative errors.
They are backwards stable (with or without FMA), which essentially means they have a âsmallâ backwards relative error. (More precisely, the backwards relative error decreases as O(Δ), whether or not the coefficient is small.) This does not guarantee a small forwards error â the forwards relative error can grow arbitrarily large as the condition number diverges.
Sorry, you cannot guarantee small relative forwards error (the error in the âanswerâ) for an ill-conditioned problem in finite precision.
These are basic considerations that apply to any floating-point algorithm, regardless of whether FMA is used. The fact that FMA gave a less accurate result in this particular case is just âluckâ â thereâs no reason to expect that to happen consistently.
After searching a bit I donât think so. I saw some mention of XBLAS being used internally in MKL, but other than that we probably would need to interface the reference implementation via ccall.
Note that you can already use Juliaâs generic linear-algebra backends with extended-precision floating-point types (double-double etc.). See also the discussion at: MPLAPACK (multiprecision blas+lapack)