Apples to apples comparison of A\b with Float64 and Float16 on A64FX

Hi all, sorry if this is a weird newbie question but I’ll go ahead and ask in hopes to learn something. I recently followed the advice in this github issues page to build Julia to not escalate Float16 to Float32 for arithmetic on the A64FX and am trying out different functions to see what the performance is like between the different precisions.

The computation I am interested in right now is a linear solve, and so I was looking at the following using different n for T is either Float64 or Float16

a = rand(T, n, n);
b = rand(T, n);
@code_llvm c = a\b

I realize that Julia uses OpenBLAS as a backend for a lot of its operations, so for Float16 it would need to use something else. In the output for @code_llvm c = a\b for Float16 I see that it uses j_#generic_lufact! and for Float64 it uses j_getrf! and I’m guessing that because of this Float64 performs faster than Float16 (I’m only interested in performance right now not necessarily accuracy). This leads me to the questions that I have. If Float64 were to use j_#generic_lufact! as well, would I potentially see that Float16 performs faster on the A64FX that has half precision arithmetic? How would I go about telling Julia to not use the BLAS backend for Float64 and to instead use the same generic LU factorization to test this?

Thanks in advance for indulging me and my weird question, and I would appreciate any insights.

1 Like

Have you made any progress? After the new build, are scalar products faster in Float16?

ie if you do

x=rand(100000); x16=Float16.(x);
@btime dot($x,$x)
@btime dot($x16,$x16)

is the half precision dot product any faster?

I have the standand .dmg M1 version and lu is over 40 times slower in F16 that F32. The dot product is 3 times slower in F16. Does the custom build do any better?

I haven’t made much progress on this unfortunately. I believe the dot product runs into the same issue I was seeing with a linear solve. Running the test you outline gives:

julia> @btime dot($x,$x)
  250.302 μs (0 allocations: 0 bytes)

julia> @btime dot($x16,$x16)
  590.756 μs (0 allocations: 0 bytes)

Its not quite as bad as the 40 times slower that you observed on the M1 so thats a plus, but half precision is still outperformed by standard Float64. Looking at @code_llvm dot(x,x) shows that double precision is using a routine in LinearAlgebra/src/matmul.jl while @code_llvm dot(x16,x16) shows that half precision is using LinearAlgebra/src/generic.jl. I think it would still be helpful to be able to get a fair comparison of double to half precision by them both using the generic.jl linear algebra routines.

You should be able to invoke the method of your choice.

What do you see for lu and lu!?

@ctkelley For a test

n = 1000;
a = rand(Float64, n, n);
b = rand(Float64, n);
a16 = Float16.(a);
b16 = Float16.(b);
@btime a\b
@btime a16\b16

I see

julia> @btime a\b
  245.372 ms (4 allocations: 7.64 MiB)
1000-element Vector{Float64}:

julia> @btime a16\b16
  775.579 ms (4 allocations: 1.92 MiB)
1000-element Vector{Float16}:

Again the time difference seems to be due to Float64 using the LinearAlgebra BLAS backend while Float16 uses the generic.jl backend.

@carstenbauer I think I am misunderstanding what you are suggesting so please correct me, but I’m not sure I understand how to use the Core.invoke method to tell Julia to use the generic.jl when I perform a\b or another method provided by LinearAlgebra.jl. I’m still relatively new to Julia and don’t quite know how to take what I see in the output of @code_llvm a\b to invoke a lower level function. Could you provide an example of how to do that? I would really appreciate any help you can provide with this.

That factor of only three is pretty convincing. Does it scale with dimension? What happens if n=10^4?

This is fast enough for me to do the work I want to do, but no fast enough for half-precision to really be useful.

It’s looking like compiling from source and changing those two
lines pays off. On a 2021 MacBook Pro with an M1…

For 1.7.2

julia> n=10^4;A=rand(n,n); A32=Float32.(A); A16=Float16.(A);

julia> @btime lu!($A);
  3.446 s (2 allocations: 78.17 KiB)

julia> @btime lu!($A32);
  1.687 s (2 allocations: 78.17 KiB)

julia> @btime lu!($A16);
  235.080 s (2 allocations: 78.17 KiB)

Float16 time is 67 x double and 138 x single. Far worse than a factor
of 3.

and for 1.8 beta 3, it’s the same.

julia> @btime lu!($A);
  3.476 s (2 allocations: 78.17 KiB)

julia> @btime lu!($A32);
  1.679 s (2 allocations: 78.17 KiB)

julia> @btime lu!($A16);
  231.528 s (2 allocations: 78.17 KiB)

and for a smaller problem with n=1000, it’s not as bad, but still
bad. This for 1.8 beta 3.

julia> n=10^3;

julia> A=rand(n,n); A32=Float32.(A); A16=Float16.(A);

julia> @btime lu!($A);
  7.952 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A32);
  4.987 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A16);
  220.491 ms (1 allocation: 7.94 KiB)

I finally got around to trying the recompile and build from source with 1.8 beta 3.

The performance of Float16 lu seems to be 35x worse than double with the modification to the Julia source and the recompile, but is still a bit better than without. It looks like we will have to wait for the Apple version of MKL.

Herewith the details. I have a script to dump the timings and the ratio of double/half as a function of dimension.

I ran it on an Intel Mac and a Macbook Pro with an M1. OS 12.3.1.

julia> using BenchmarkTools

julia> using Printf

julia> function halftime()
       for n in [128, 256, 512, 1024, 2048, 4096, 8192]
       A=rand(n,n); A16=Float16.(A);
       t64=@belapsed lu($A);
       t16=@belapsed lu($A16);
       @printf("dim = %5d    t64 = %5.2e    t16 = %5.2e   ratio = %5.2e \n",
                n, t64, t16, rpre)
halftime (generic function with 1 method)

First the results from a 2019 8-core IMac, Julia 1.7.2, MKL BLAS. The half precision computation seems to settle at roughly 300X the cost of the double precision one

julia> halftime()
dim =   128    t64 = 6.12e-05    t16 = 1.44e-03   ratio = 2.35e+01 
dim =   256    t64 = 1.85e-04    t16 = 1.14e-02   ratio = 6.18e+01 
dim =   512    t64 = 7.03e-04    t16 = 9.18e-02   ratio = 1.31e+02 
dim =  1024    t64 = 4.32e-03    t16 = 7.39e-01   ratio = 1.71e+02 
dim =  2048    t64 = 3.24e-02    t16 = 5.89e+00   ratio = 1.82e+02 
dim =  4096    t64 = 1.53e-01    t16 = 4.74e+01   ratio = 3.11e+02 
dim =  8192    t64 = 1.12e+00    t16 = 3.81e+02   ratio = 3.39e+02

Now the same computation from a 2020 M1 Macbook Pro, Julia 1.8 beta 3. This
MAC is about 2x slower than the intel for double, but nearly 3X as fast for half. The half precision sits at 60X slower than the Float64, but the slowness is still increasing with dimension.

dim =   128    t64 = 2.01e-04    t16 = 4.92e-04   ratio = 2.45e+00 
dim =   256    t64 = 5.08e-04    t16 = 3.78e-03   ratio = 7.44e+00 
dim =   512    t64 = 1.68e-03    t16 = 2.98e-02   ratio = 1.77e+01 
dim =  1024    t64 = 9.68e-03    t16 = 2.38e-01   ratio = 2.46e+01 
dim =  2048    t64 = 6.30e-02    t16 = 1.98e+00   ratio = 3.15e+01 
dim =  4096    t64 = 3.27e-01    t16 = 1.59e+01   ratio = 4.87e+01 
dim =  8192    t64 = 2.09e+00    t16 = 1.27e+02   ratio = 6.07e+01

Finally the computation from a 2020 M1 Macbook Pro, Julia 1.8 beta 3 with the modified source. Half seems to be about 1.5X faster. Better than nothing, but hard to get excited about.

dim =   128    t64 = 1.89e-04    t16 = 2.94e-04   ratio = 1.56e+00 
dim =   256    t64 = 4.97e-04    t16 = 2.26e-03   ratio = 4.56e+00 
dim =   512    t64 = 1.69e-03    t16 = 1.76e-02   ratio = 1.04e+01 
dim =  1024    t64 = 9.46e-03    t16 = 1.41e-01   ratio = 1.49e+01 
dim =  2048    t64 = 5.83e-02    t16 = 1.16e+00   ratio = 1.98e+01 
dim =  4096    t64 = 3.24e-01    t16 = 9.10e+00   ratio = 2.81e+01 
dim =  8192    t64 = 2.12e+00    t16 = 7.36e+01   ratio = 3.48e+01 

I’m not sure why you’re surprised: Hardware Float16 on A64fx · Issue #40216 · JuliaLang/julia · GitHub (already mentioned above and you also commented in that issue) is still open, no one has fixed it yet.

Don’t think I said I was surprised. Just reporting my findings.

1 Like

Perhaps RecursiveFactorization.jl gives you better results for half precision lu? I can’t test it, because I am on my phone.

Interesting idea. It’ll take me a while to try this. My use cases are lu and lu! and I’m working on a paper and a book right now.

If you can make the comparison, please report back.

I see a Turing award nomination if you can do it on a phone.