Worse performance of LU-decomposition for Float32 than for Float64

For some applications one might gain performance by using 32-bit arithmetic on a 64-bit system because this could improve memory-bandwidth as well as (possible) SIMD operations.

I noticed however that some functions seem to actually suffer from using Float32, which seems a bit surprising:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  1.241 ms (4 allocations: 79.11 KiB)

julia> @btime lu($A_32);
  3.248 ms (4 allocations: 40.05 KiB)

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)

Does anyone have an idea what is causing this?

1 Like

Does this persist as you increase the size of the problem (200, 400, 1000)?
What about lu!

Yes, it does not seem to be dependend on the matrix size:

julia> const A_64 = rand(1000,1000);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  20.049 ms (4 allocations: 7.64 MiB)

julia> @btime lu($A_32);
  35.531 ms (4 allocations: 3.82 MiB)

For lu! I changed the code a little bit (of course there is now some overhead of rand and non-identical matrices, but I think their impact should be neglectable).

julia> @btime lu!(rand(100,100));
  1.546 ms (4 allocations: 79.11 KiB)

julia> @btime lu!(rand(Float32,100,100));
  2.847 ms (4 allocations: 40.05 KiB)

I think the issue might be related to Windows:
Running the same example on Linux, I get:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  757.133 μs (3 allocations: 79.08 KiB)

julia> @btime lu($A_32);
  738.773 μs (3 allocations: 40.02 KiB)

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, westmere)

Note that the performance is also much better, although the CPU should be slower.

Tried on a Mac (1.5.2 and latest OS). I got what I’d expect.

julia> n=1000; A=rand(n,n); A32=Float32.(A);

julia> @btime lu($A);
  8.922 ms (3 allocations: 7.64 MiB)

julia> @btime lu($A32);
  6.611 ms (3 allocations: 3.82 MiB)

julia> @btime lu!($A);
  7.637 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A32);
  4.943 ms (1 allocation: 7.94 KiB)

These results indeed seem reasonable! I don’t know what is going on in the windows world…

using MKL.jl:

julia> n=1000; A=rand(n,n); A32=Float32.(A); Ac = similar(A); A32c = similar(A32);

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark lu!(copyto!($Ac, $A))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     3.067 ms (0.00% GC)
  median time:      3.189 ms (0.00% GC)
  mean time:        3.219 ms (0.00% GC)
  maximum time:     3.616 ms (0.00% GC)
  --------------
  samples:          1553
  evals/sample:     1

julia> @benchmark lu!(copyto!($A32c, $A32))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.073 ms (0.00% GC)
  mean time:        2.086 ms (0.00% GC)
  maximum time:     2.679 ms (0.00% GC)
  --------------
  samples:          2397
  evals/sample:     1
2 Likes

This is a great idea!

Using MKL I indeed see an improvement in performance (also on windows):

julia> @btime lu(A);
  6.667 ms (4 allocations: 7.64 MiB)

julia> @btime lu(A32);
  3.543 ms (4 allocations: 3.82 MiB)
1 Like