Worse performance of LU-decomposition for Float32 than for Float64

For some applications one might gain performance by using 32-bit arithmetic on a 64-bit system because this could improve memory-bandwidth as well as (possible) SIMD operations.

I noticed however that some functions seem to actually suffer from using Float32, which seems a bit surprising:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  1.241 ms (4 allocations: 79.11 KiB)

julia> @btime lu($A_32);
  3.248 ms (4 allocations: 40.05 KiB)

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)

Does anyone have an idea what is causing this?

Does this persist as you increase the size of the problem (200, 400, 1000)?
What about lu!

Yes, it does not seem to be dependend on the matrix size:

julia> const A_64 = rand(1000,1000);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  20.049 ms (4 allocations: 7.64 MiB)

julia> @btime lu($A_32);
  35.531 ms (4 allocations: 3.82 MiB)

For lu! I changed the code a little bit (of course there is now some overhead of rand and non-identical matrices, but I think their impact should be neglectable).

julia> @btime lu!(rand(100,100));
  1.546 ms (4 allocations: 79.11 KiB)

julia> @btime lu!(rand(Float32,100,100));
  2.847 ms (4 allocations: 40.05 KiB)

I think the issue might be related to Windows:
Running the same example on Linux, I get:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  757.133 μs (3 allocations: 79.08 KiB)

julia> @btime lu($A_32);
  738.773 μs (3 allocations: 40.02 KiB)

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, westmere)

Note that the performance is also much better, although the CPU should be slower.

Tried on a Mac (1.5.2 and latest OS). I got what I’d expect.

julia> n=1000; A=rand(n,n); A32=Float32.(A);

julia> @btime lu($A);
  8.922 ms (3 allocations: 7.64 MiB)

julia> @btime lu($A32);
  6.611 ms (3 allocations: 3.82 MiB)

julia> @btime lu!($A);
  7.637 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A32);
  4.943 ms (1 allocation: 7.94 KiB)

These results indeed seem reasonable! I don’t know what is going on in the windows world…

using MKL.jl:

julia> n=1000; A=rand(n,n); A32=Float32.(A); Ac = similar(A); A32c = similar(A32);

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark lu!(copyto!($Ac, $A))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     3.067 ms (0.00% GC)
  median time:      3.189 ms (0.00% GC)
  mean time:        3.219 ms (0.00% GC)
  maximum time:     3.616 ms (0.00% GC)
  --------------
  samples:          1553
  evals/sample:     1

julia> @benchmark lu!(copyto!($A32c, $A32))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.073 ms (0.00% GC)
  mean time:        2.086 ms (0.00% GC)
  maximum time:     2.679 ms (0.00% GC)
  --------------
  samples:          2397
  evals/sample:     1

This is a great idea!

Using MKL I indeed see an improvement in performance (also on windows):

julia> @btime lu(A);
  6.667 ms (4 allocations: 7.64 MiB)

julia> @btime lu(A32);
  3.543 ms (4 allocations: 3.82 MiB)