Inaccurate Matrix Multiplication

We experienced an issue with Matrix-Vector (or row-vector times column-vector) product.
In the example the result should return zero:

a = [1.3190624946305824, 1.3190624946305824]
b = [-3.74 3.74]

# Matrix-Vector-Product
b * a        # not always zero
sum(b .* a)  # zero

On Machine 1, only the second calculation returns zero:

julia> b * a
1-element Vector{Float64}:
-3.583133801120832e-16

julia> sum(b .* a)
0.0

On Machine 2 both are zero:

julia> b * a
1-element Vector{Float64}:
 0.0

julia> sum(b .* a)
0.0

However, doing the same calculation in Matlab returns zero also on Machine 1. Given that matlab uses MKL we expected an issue there but using MKL.jl does not change the results.

Is the difference the expected inaccuracy due to Floating Point calculation?

Machine 1:

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll

Machine 2:

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll
4 Likes

I find the deviation surprisingly large. Could you check with @code_native b * a if the same code is generated on both machines? If yes then maybe it is time to study CPU errata…

1 Like

I think MKL.jl should give the same result as MKL in MATLAB (assuming it does use MKL, and the exact same version on the same processor). I think you’re not actually getting 0.0 there, just MATLAB showing the value as such? Some languages, including it, show more user-friendly (well that’s debatable) version of your number by default.

The answer you get is because of floating-point (and more…), but the difference isn’t strictly because of it. The answer “yes” would be a simplification.

As you can see at PSA: floating-point arithmetic - #8 by simonbyrne

Across machines, you “might get non-deterministic results”, for “BLAS or LAPACK operations” (also for e.g. “muladd”), and not just the default BLAS, i.e. openBLAS, also for MKL. And not just in Julia. That’s too be expected across languages using these dependencies, such as Python and MATLAB.

The reason is, e.g. “skylake” vs “tigerlake” even though the floating point-math is deterministic, and should I believe be the same across those (and all x86) architectures, that would only apply for individual scalar operations (or exact same SIMD instructions), but for the full code using them, when it’s not the same (or run in same sequence). It’s just that OpenBLAS has separate code for each architecture, different assembly code. You can use @Elrod’s pure Julia code, but I’m not sure it’s NOT special-cased too (or if not yet, then might be later). Another thing I could think of is, if you have many threads then that might, or likely will, add non-determinism, but I doubt for such small matrices. You could try disabling it.

So in short, this is to be expected, and isn’t strictly considered a bug. But:

If you need your answer to contain the right value, you could look into interval arithmetic (Julia has packages for), and use with Chris Elrod’s code. I’m not sure the end-points will be deterministic, actually. Another thing to look into are Posits (Unum) instead of traditional floating point.

Note while -3.583133801120832e-16 isn’t exactly 0.0 it is close enough, while not:

julia> -3.583133801120832e-16 ≈ 0.0
false

It’s been enough for me to use that operator (when porting from MATLAB), an alias for isapprox with the defaults.

Note, however:

julia> isapprox(-3.583133801120832e-16, 0.0; atol=0.05)
true

I’ve never had to use absolute tolerance (atol) before, or relative (rtol), so I’m not sure which values are most appropriate, I just took that one from the docs.

But

julia> 1 + -3.583133801120832e-16
0.9999999999999997
julia> nextfloat(nextfloat(nextfloat(1 + -3.583133801120832e-16)))
1.0

i.e. 3 ulp?

1 Like
1 Like

From the isapprox documentation:

Note that x ≈ 0 (i.e., comparing to zero with the default tolerances) is equivalent to x == 0 since the default atol is 0 . In such cases, you should either supply an appropriate atol (or use norm(x) ≤ atol ) or rearrange your code (e.g. use x ≈ y rather than x - y ≈ 0 ). It is not possible to pick a nonzero atol automatically because it depends on the overall scaling (the “units”) of your problem: for example, in x - y ≈ 0 , atol=1e-9 is an absurdly small tolerance if x is the radius of the Earth in meters, but an absurdly large tolerance if x is the radius of a Hydrogen atom in meters.

2 Likes

I’m not sure that’s a reliable test. I.e. would you get exact same code (or very similar), on the Julia side, if calling BLAS, and the differences there not showing?

If the deviation is actually considered too large (3 ulps) then could it rather be a software bug? I just doubt errata, while also a possibility. I believe all simple operations should have only 1 ulp error, but they add up, even with the few operations there (but clearly need not).

AFAIU the cancellation error could be larger than 3 ulps. But that would mean the products should differ somewhere, which I don’t see:

julia> reinterpret(UInt64, 1.3190624946305824 * -3.74)
0xc013bbb159fe3ec4

julia> reinterpret(UInt64, 1.3190624946305824 * 3.74)
0x4013bbb159fe3ec4

To OP: could you check these, please?

Architectures with FMA will use it, those without will not.
Block sizes differ, which will also result in different orderings for larger matrices.

1 Like

And that’s what happening here it seems, I at least get the exact value with:

julia> fma(1.3190624946305824, -3.74, 1.3190624946305824 * 3.74)
3.583133801120832e-16

So to be “expected”?

2 Likes

I agree with the sentiment that roundoff errors are expected when doing sums, but in this particular case I admit I’m mystified.

x + (-x) is always exactlly == 0 in floating-point arithmetic (for finite numbers), and similarly I think you should always have x * (-y) == -(x * y) exactly in the default rounding mode (round-to-nearest/ties-to-even). The combination of these two rules seems like it should mean that [x -x] * [y, y] == 0 exactly.

I’m not sure how you could implement the dot product to get a nonzero result here, short of changing the rounding mode? It’s a puzzle.

Mind you, I’m not saying it’s a bug — the -3.6e-16 answer is well within the tolerance I would generically expect for floating-point calculations on numbers of magnitude ≈ 4 — just that I’m curious how the BLAS dot could be implemented differently for length-2 vectors here. (I’m guessing it’s some SIMD thing, but I’m still not sure how SIMD would change the rounding in this case.)

3 Likes

As @Palli indicated above, using muladd reproduces:

julia> function mydot(a,b)
           s = zero(Base.promote_eltype(a,b))
           for i in eachindex(a,b)
               s = muladd(a[i],b[i],s)
           end
           s
       end
mydot (generic function with 1 method)

julia> mydot(fill(1.3190624946305824, 2), [-3.74, 3.74])
-3.583133801120832e-16

I’m guessing TigerLake gets the correct answer because feature detection doesn’t work on older versions of OpenBLAS (but I would’ve thought Julia 1.7.2 comes with a new enough version).

gemv will normally SIMD across rows of b in b * a, while muladding across the inner axis.
Hence, the dot above is representative of what gemv should be doing.

4 Likes

No need to be mystified anymore. That’s not true when the add, is part of fma. The docs:

Computes xy+z without rounding the intermediate result xy. […] fma is used to improve accuracy in certain algorithms.

Well, it didn’t improve the accuracy (nor the speed really here), but no fault of Julia (this is even in BLAS; well, should also occur with Elrod’s software).

Usually when you do x*y+z you round first for * then for +, and as far as I understand getting rid of the double-rounding, is not just for speed, but also for accuracy. Or better usually… I’ve yet to do the fma calculation on paper, I expect this is the expected answer, and would fall under “catastrophic cancellation”? fma is defined in IEEE, and must give this answer (across platforms). [It’s not usually considered SIMD, while is in a sense.] Can anyone check the fma caclulation across (the relevant) platforms, and also with muladd.

2 Likes

Sure

julia> function mydot(a,b)
           s = zero(Base.promote_eltype(a,b))
           for i in eachindex(a,b)
               s = muladd(a[i],b[i],s)
           end
           s
       end
mydot (generic function with 1 method)

julia> mydot(fill(1.3190624946305824, 2), [-3.74, 3.74])
-3.583133801120832e-16

julia> fma(1.3190624946305824, -3.74, 1.3190624946305824 * 3.74)
3.583133801120832e-16

julia> versioninfo()
Julia Version 1.7.3-pre.0
Commit 2ca8b0cf3d* (2022-02-07 18:56 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

As I said, the OpenBLAS version Julia 1.7 uses by default just doesn’t know tigerlake can use fma instructions.

Edit: now checking relative error…

I tried

let
    err = 0
    for i in 1:1000
        a = 1e16 * rand()
        b = 1e16 * rand()
        res = fma(a, b, -a * b)
        err = max(res / abs(a * b), err)
        @show err
    end
end

yielding for example

err = 1.0668502600919318e-16
1 Like

So what we gathered from the discussion so far:

  • Matlab is just rounding the last digit and thus returns zero
  • The difference is probably due to a fma inaccuracy

Here the @code_native for Machine 1:

julia> @code_native b*a
        .text
; ┌ @ matmul.jl:44 within `*`
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %rsi
        pushq   %rdi
        pushq   %rbx
        subq    $88, %rsp
        movq    %rdx, %rsi
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %xmm0, -48(%rbp)
        movq    $0, -32(%rbp)
        movq    %rsi, -56(%rbp)
        movl    $jl_get_pgcstack, %eax
        callq   *%rax
        movq    %rax, %rdi
        movq    $4, -48(%rbp)
        movq    (%rdi), %rax
        movq    %rax, -40(%rbp)
        leaq    -48(%rbp), %rax
        movq    %rax, (%rdi)
        movq    (%rsi), %rbx
        movq    8(%rsi), %rsi
; │ @ matmul.jl:47 within `*`
; │┌ @ array.jl:150 within `size`
        movq    24(%rbx), %rdx
        movl    $jl_alloc_array_1d, %eax
; │└
; │┌ @ abstractarray.jl:785 within `similar` @ array.jl:378
; ││┌ @ boot.jl:466 within `Array` @ boot.jl:457
        movl    $284725168, %ecx                # imm = 0x10F88FB0
        callq   *%rax
        movq    %rax, -32(%rbp)
; │└└
; │┌ @ matmul.jl:275 within `mul!` @ matmul.jl:66
        movb    $0, 40(%rsp)
        movb    $1, 32(%rsp)
        movabsq $"gemv!", %r10
        movq    %rax, %rcx
        movl    $1308622848, %edx               # imm = 0x4E000000
        movq    %rbx, %r8
        movq    %rsi, %r9
        callq   *%r10
        movq    -40(%rbp), %rcx
        movq    %rcx, (%rdi)
; │└
        addq    $88, %rsp
        popq    %rbx
        popq    %rdi
        popq    %rsi
        popq    %rbp
        retq
        nop
; └

and Machine 2:

julia> @code_native b*a
        .text
; ┌ @ matmul.jl:44 within `*`
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %rsi
        pushq   %rdi
        pushq   %rbx
        subq    $88, %rsp
        movq    %rdx, %rsi
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %xmm0, -48(%rbp)
        movq    $0, -32(%rbp)
        movq    %rsi, -56(%rbp)
        movl    $jl_get_pgcstack, %eax
        callq   *%rax
        movq    %rax, %rdi
        movq    $4, -48(%rbp)
        movq    (%rdi), %rax
        movq    %rax, -40(%rbp)
        leaq    -48(%rbp), %rax
        movq    %rax, (%rdi)
        movq    (%rsi), %rbx
        movq    8(%rsi), %rsi
; │ @ matmul.jl:47 within `*`
; │┌ @ array.jl:150 within `size`
        movq    24(%rbx), %rdx
        movl    $jl_alloc_array_1d, %eax
; │└
; │┌ @ abstractarray.jl:785 within `similar` @ array.jl:378
; ││┌ @ boot.jl:466 within `Array` @ boot.jl:457
        movl    $284790704, %ecx                # imm = 0x10F98FB0
        callq   *%rax
        movq    %rax, -32(%rbp)
; │└└
; │┌ @ matmul.jl:275 within `mul!` @ matmul.jl:66
        movb    $0, 40(%rsp)
        movb    $1, 32(%rsp)
        movabsq $"gemv!", %r10
        movq    %rax, %rcx
        movl    $1308622848, %edx               # imm = 0x4E000000
        movq    %rbx, %r8
        movq    %rsi, %r9
        callq   *%r10
        movq    -40(%rbp), %rcx
        movq    %rcx, (%rdi)
; │└
        addq    $88, %rsp
        popq    %rbx
        popq    %rdi
        popq    %rsi
        popq    %rbp
        retq
        nop
; └

reinterpret(UInt64, 1.3190624946305824 * -3.74) returns 0xc013bbb159fe3ec4 on all machines. And reinterpret(UInt64, 1.3190624946305824 * 3.74) returns 0x4013bbb159fe3ec4.

An additional note: Matlab on Machine 1 return zero also when asked to show 16 digits.

>> a = [1.3190624946305824; 1.3190624946305824]

a =

    1.3191
    1.3191

>> b = [-3.74 3.74]

b =

   -3.7400    3.7400

>> sprintf('%.16f',b*a)

ans =

    '0.0000000000000000'

>>

1 Like

Just to share my results:

Julia 1.7.0:     -3.583133801120832e-16 (both OpenBLAS and MKL.jl)
Julia 1.8.0-DEV:  0.0 (OpenBLAS) 
                 -3.583133801120832e-16 (MKL.jl)
MATLAB:           0.0

Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
3 Likes

I tried MKL on Machine 2 and there I get the inaccuracy too:

julia> b * a
1-element Vector{Float64}:
 -3.583133801120832e-16

julia> sum(b .* a) 
0.0

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] mkl_rt.2.dll

Yes; it’s an ill-conditioned sum.

Thanks, I forgot that fma would change the results here.