Worse performance of LU-decomposition for Float32 than for Float64

DavidBerghaus · November 5, 2020, 4:22pm

For some applications one might gain performance by using 32-bit arithmetic on a 64-bit system because this could improve memory-bandwidth as well as (possible) SIMD operations.

I noticed however that some functions seem to actually suffer from using Float32, which seems a bit surprising:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  1.241 ms (4 allocations: 79.11 KiB)

julia> @btime lu($A_32);
  3.248 ms (4 allocations: 40.05 KiB)

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)

Does anyone have an idea what is causing this?

ctkelley · November 5, 2020, 4:28pm

Does this persist as you increase the size of the problem (200, 400, 1000)?
What about lu!

DavidBerghaus · November 5, 2020, 4:36pm

Yes, it does not seem to be dependend on the matrix size:

julia> const A_64 = rand(1000,1000);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  20.049 ms (4 allocations: 7.64 MiB)

julia> @btime lu($A_32);
  35.531 ms (4 allocations: 3.82 MiB)

For lu! I changed the code a little bit (of course there is now some overhead of rand and non-identical matrices, but I think their impact should be neglectable).

julia> @btime lu!(rand(100,100));
  1.546 ms (4 allocations: 79.11 KiB)

julia> @btime lu!(rand(Float32,100,100));
  2.847 ms (4 allocations: 40.05 KiB)

DavidBerghaus · November 5, 2020, 4:44pm

I think the issue might be related to Windows:
Running the same example on Linux, I get:

julia> using LinearAlgebra, BenchmarkTools

julia> const A_64 = rand(100,100);

julia> const A_32 = Float32.(A_64);

julia> @btime lu($A_64);
  757.133 μs (3 allocations: 79.08 KiB)

julia> @btime lu($A_32);
  738.773 μs (3 allocations: 40.02 KiB)

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, westmere)

Note that the performance is also much better, although the CPU should be slower.

ctkelley · November 5, 2020, 4:50pm

Tried on a Mac (1.5.2 and latest OS). I got what I’d expect.

julia> n=1000; A=rand(n,n); A32=Float32.(A);

julia> @btime lu($A);
  8.922 ms (3 allocations: 7.64 MiB)

julia> @btime lu($A32);
  6.611 ms (3 allocations: 3.82 MiB)

julia> @btime lu!($A);
  7.637 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A32);
  4.943 ms (1 allocation: 7.94 KiB)

DavidBerghaus · November 5, 2020, 5:01pm

These results indeed seem reasonable! I don’t know what is going on in the windows world…

Elrod · November 5, 2020, 5:03pm

using MKL.jl:

julia> n=1000; A=rand(n,n); A32=Float32.(A); Ac = similar(A); A32c = similar(A32);

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark lu!(copyto!($Ac, $A))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     3.067 ms (0.00% GC)
  median time:      3.189 ms (0.00% GC)
  mean time:        3.219 ms (0.00% GC)
  maximum time:     3.616 ms (0.00% GC)
  --------------
  samples:          1553
  evals/sample:     1

julia> @benchmark lu!(copyto!($A32c, $A32))
BenchmarkTools.Trial:
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.073 ms (0.00% GC)
  mean time:        2.086 ms (0.00% GC)
  maximum time:     2.679 ms (0.00% GC)
  --------------
  samples:          2397
  evals/sample:     1

DavidBerghaus · November 5, 2020, 5:26pm

This is a great idea!

Using MKL I indeed see an improvement in performance (also on windows):

julia> @btime lu(A);
  6.667 ms (4 allocations: 7.64 MiB)

julia> @btime lu(A32);
  3.543 ms (4 allocations: 3.82 MiB)

Topic		Replies	Views
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3616	February 11, 2020
Linear solve in mixed precision koan Performance question	4	454	November 5, 2020
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	788	May 2, 2022
LU factorization performance issue New to Julia linearalgebra	30	716	June 6, 2022
Why is the matrix multiplication with integer matrices much slower than with float ones? Performance	6	407	June 27, 2023

Worse performance of LU-decomposition for Float32 than for Float64

Related topics