Massive performance penalty for Float16 compared to Float32


I just do the Coursera lectures about Deep Leanring by Andrew Ng, and want to implement what I learned in Julia (lecture is in Python), so I can check my understanding. So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up.

Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64.
Time. Not speed.

I tried to isolate the issue and found a penalty of factor 10 for point wise multiplication of matrices.

This is the code in my test file:

numType = Float16

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B

and the result is (without some ramp up, just start the file, so that’s just a very crude test:

First two starts with Float32

  0.235396 seconds (46.17 k allocations: 2.444 MiB)

  0.209013 seconds (46.17 k allocations: 2.444 MiB)

And the two with Float16.

  1.848572 seconds (46.60 k allocations: 2.468 MiB)

  1.847379 seconds (46.60 k allocations: 2.468 MiB)

I found in the docs that Float16 is “implemented in software”, but that’s less performant than I expected.

Am I doing something wrong?

Thank you, and kind regards, z.

Btw. Why is it allocating memory in the first place?

Don’t benchmark in the global scope. Put that in a function and run twice.


Float16 is in an odd place right now. For every calculation the Float16 is first converted to a Float32 in which the computation is performed and then converted back to Float16. This is necessary since most hardware has no native support for Float16. The only hardware where Float16 is really relevant is GPUs otherwise it is primarily a storage type.


Check out how Float16 multiplication is defined:

julia> x, y = rand(Float16), rand(Float16)
(Float16(0.3818), Float16(0.825))

julia> @which x*y
*(a::Float16, b::Float16) in Base at float.jl:372

julia> @edit x*y

You’ll see it first converts to Float32, performs the multiplication, and then converts back to Float16. Hence it cannot be as fast as Float32. FPUs support Float32 natively but not Float16.

Even more significantly, julia used parallelized BLAS routines (written in Fortran) for Float32. For Float16 it is not much better than naive multiplication (it’s a little more cache friendly, but not parallelized nor as optimized as it is for Float32).


@ChrisRackauckas Ah, right. Should have known that.

> t32();
  0.158636 seconds

> t32();
  0.134481 seconds

> t16();
  1.785036 seconds

> t16();
  1.782366 seconds


function t16()

    numType = Float16

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B

function t32()

    numType = Float32

    A = rand(numType, 10000, 10000)
    B = rand(numType, 10000, 10000)
    C = Array{numType, 2}(10000,10000)

    @time C .= A .* B

Thank you all for the explanations!

Intel added vector instructions to do conversions to/from 16-bit floats many years ago, and in fact, showed that (because of using half the memory, better cache utilization) that using 16-bit could be faster than 32-bit, for larger operations, and not that much slower for smaller vectors.

It seems that making sure that Julia can use the SIMD instructions when doing vector operations on 16-bit floats could acheive some nice performance benefits.


Is this still the case now in June 2022 that Float16 is first converted to Float32 before any calculation is invoked and converted back to Float16 after the calculation is done?

There is not yet widespread support for Float16 in Floating Point hardware. So yes.
However, see the next note by @giordano.

The Apple Silicon CPUs such as the M1 have hardware support for float16


Is the float16 support fully interwoven into whatever SIMD they support?

also informative is this from 2020

I ran this benchmark on a64fx, another CPU with hardware support for float16, and simd scaling was pretty good. I seem to recall I tried the same on M1, with comparable results.

@JeffreySarnoff @giordano Thanks a lot! I don’t work on M1 or Fujitsu a64fx. Does this mean that I basically stand no chance at the moment?

A dumb question - who should I expect to have the Float16 problem solved? The CPU manufacturer or Julia community? Or both?

On the Julia side, we could make float16 mammal pretty fast (@celrod), but for general use, we need cpu support

Thanks, buddy. By ‘pretty fast’, do you mean that we can achieve something faster than fp32 so that there is an clear advantage of using fp16 if only half precision is needed?

For small matrices float32 will be faster without hardware support, but for matrices bigger than cache, you might be able to go faster than float32 by but converting as you pack sub blocks

Thanks for the hint! @Oscar_Smith