Massive performance penalty for Float16 compared to Float32

zenon · November 3, 2017, 7:16pm

Hello,

I just do the Coursera lectures about Deep Leanring by Andrew Ng, and want to implement what I learned in Julia (lecture is in Python), so I can check my understanding. So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up.

Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64.
Time. Not speed.

I tried to isolate the issue and found a penalty of factor 10 for point wise multiplication of matrices.

This is the code in my test file:

numType = Float16

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B

and the result is (without some ramp up, just start the file, so that’s just a very crude test:

First two starts with Float32

  0.235396 seconds (46.17 k allocations: 2.444 MiB)

  0.209013 seconds (46.17 k allocations: 2.444 MiB)

And the two with Float16.

  1.848572 seconds (46.60 k allocations: 2.468 MiB)

  1.847379 seconds (46.60 k allocations: 2.468 MiB)

I found in the docs that Float16 is “implemented in software”, but that’s less performant than I expected.

Am I doing something wrong?

Thank you, and kind regards, z.

zenon · November 3, 2017, 7:18pm

Btw. Why is it allocating memory in the first place?

ChrisRackauckas · November 3, 2017, 7:23pm

Don’t benchmark in the global scope. Put that in a function and run twice.

vchuravy · November 3, 2017, 7:24pm

Float16 is in an odd place right now. For every calculation the Float16 is first converted to a Float32 in which the computation is performed and then converted back to Float16. This is necessary since most hardware has no native support for Float16. The only hardware where Float16 is really relevant is GPUs otherwise it is primarily a storage type.

tim.holy · November 3, 2017, 7:25pm

Check out how Float16 multiplication is defined:

julia> x, y = rand(Float16), rand(Float16)
(Float16(0.3818), Float16(0.825))

julia> @which x*y
*(a::Float16, b::Float16) in Base at float.jl:372

julia> @edit x*y

You’ll see it first converts to Float32, performs the multiplication, and then converts back to Float16. Hence it cannot be as fast as Float32. FPUs support Float32 natively but not Float16.

Even more significantly, julia used parallelized BLAS routines (written in Fortran) for Float32. For Float16 it is not much better than naive multiplication (it’s a little more cache friendly, but not parallelized nor as optimized as it is for Float32).

zenon · November 3, 2017, 8:26pm

@ChrisRackauckas Ah, right. Should have known that.

> t32();
  0.158636 seconds

> t32();
  0.134481 seconds

> t16();
  1.785036 seconds

> t16();
  1.782366 seconds

for

function t16()

    numType = Float16

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B
end


function t32()

    numType = Float32

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B
end

zenon · November 3, 2017, 8:27pm

Thank you all for the explanations!

ScottPJones · November 4, 2017, 12:16pm

Intel added vector instructions to do conversions to/from 16-bit floats many years ago, and in fact, showed that (because of using half the memory, better cache utilization) that using 16-bit could be faster than 32-bit, for larger operations, and not that much slower for smaller vectors.

https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

It seems that making sure that Julia can use the SIMD instructions when doing vector operations on 16-bit floats could acheive some nice performance benefits.

chobbes · June 19, 2022, 7:25am

Is this still the case now in June 2022 that Float16 is first converted to Float32 before any calculation is invoked and converted back to Float16 after the calculation is done?

JeffreySarnoff · June 19, 2022, 8:18am

There is not yet widespread support for Float16 in Floating Point hardware. So yes.
However, see the next note by @giordano.

giordano · June 19, 2022, 9:17am

The Apple Silicon CPUs such as the M1 have hardware support for float16

JeffreySarnoff · June 19, 2022, 9:52am

Is the float16 support fully interwoven into whatever SIMD they support?

also informative is this from 2020

giordano · June 19, 2022, 10:02am

I ran this benchmark on a64fx, another CPU with hardware support for float16, and simd scaling was pretty good. I seem to recall I tried the same on M1, with comparable results.

chobbes · June 19, 2022, 12:04pm

@JeffreySarnoff @giordano Thanks a lot! I don’t work on M1 or Fujitsu a64fx. Does this mean that I basically stand no chance at the moment?

A dumb question - who should I expect to have the Float16 problem solved? The CPU manufacturer or Julia community? Or both?

Oscar_Smith · June 19, 2022, 1:09pm

On the Julia side, we could make float16 mammal pretty fast (@celrod), but for general use, we need cpu support

chobbes · June 19, 2022, 2:22pm

Thanks, buddy. By ‘pretty fast’, do you mean that we can achieve something faster than fp32 so that there is an clear advantage of using fp16 if only half precision is needed?

Oscar_Smith · June 19, 2022, 5:08pm

For small matrices float32 will be faster without hardware support, but for matrices bigger than cache, you might be able to go faster than float32 by but converting as you pack sub blocks

chobbes · June 20, 2022, 1:15am

Thanks for the hint! @Oscar_Smith

Topic		Replies	Views
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	788	May 2, 2022
Does float16 run natively on a compatible CPU? General Usage	14	580	July 11, 2024
Time it takes to multiply two floats Numerics float	2	429	August 18, 2022
Lux.jl Chain of Dense layers don't see speed up going from 32 bits -> 16 bits Performance machine-learning	5	903	August 7, 2022
Float16 with AMDGPU GPU	10	246	August 30, 2024

Massive performance penalty for Float16 compared to Float32

Related topics