Massive performance penalty for Float16 compared to Float32

performance

#1

Hello,

I just do the Coursera lectures about Deep Leanring by Andrew Ng, and want to implement what I learned in Julia (lecture is in Python), so I can check my understanding. So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up.

Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64.
Time. Not speed.

I tried to isolate the issue and found a penalty of factor 10 for point wise multiplication of matrices.

This is the code in my test file:

numType = Float16

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B

and the result is (without some ramp up, just start the file, so that’s just a very crude test:

First two starts with Float32

  0.235396 seconds (46.17 k allocations: 2.444 MiB)

  0.209013 seconds (46.17 k allocations: 2.444 MiB)

And the two with Float16.

  1.848572 seconds (46.60 k allocations: 2.468 MiB)

  1.847379 seconds (46.60 k allocations: 2.468 MiB)

I found in the docs that Float16 is “implemented in software”, but that’s less performant than I expected.

Am I doing something wrong?

Thank you, and kind regards, z.


#2

Btw. Why is it allocating memory in the first place?


#3

Don’t benchmark in the global scope. Put that in a function and run twice.


#4

Float16 is in an odd place right now. For every calculation the Float16 is first converted to a Float32 in which the computation is performed and then converted back to Float16. This is necessary since most hardware has no native support for Float16. The only hardware where Float16 is really relevant is GPUs otherwise it is primarily a storage type.


#5

Check out how Float16 multiplication is defined:

julia> x, y = rand(Float16), rand(Float16)
(Float16(0.3818), Float16(0.825))

julia> @which x*y
*(a::Float16, b::Float16) in Base at float.jl:372

julia> @edit x*y

You’ll see it first converts to Float32, performs the multiplication, and then converts back to Float16. Hence it cannot be as fast as Float32. FPUs support Float32 natively but not Float16.

Even more significantly, julia used parallelized BLAS routines (written in Fortran) for Float32. For Float16 it is not much better than naive multiplication (it’s a little more cache friendly, but not parallelized nor as optimized as it is for Float32).


#6

@ChrisRackauckas Ah, right. Should have known that.

> t32();
  0.158636 seconds

> t32();
  0.134481 seconds

> t16();
  1.785036 seconds

> t16();
  1.782366 seconds

for

function t16()

    numType = Float16

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B
end


function t32()

    numType = Float32

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B
end

#7

Thank you all for the explanations!


#8

Intel added vector instructions to do conversions to/from 16-bit floats many years ago, and in fact, showed that (because of using half the memory, better cache utilization) that using 16-bit could be faster than 32-bit, for larger operations, and not that much slower for smaller vectors.

https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

It seems that making sure that Julia can use the SIMD instructions when doing vector operations on 16-bit floats could acheive some nice performance benefits.