# Massive performance penalty for Float16 compared to Float32

#1

Hello,

I just do the Coursera lectures about Deep Leanring by Andrew Ng, and want to implement what I learned in Julia (lecture is in Python), so I can check my understanding. So there is a small neural net, and it takes time, and I change the numerical type from Float64 to Float32 and get like 30% speed up.

Great, I think, and try Float16. That didn’t go so well. Time went up by a factor of 100 compared to Float64.
Time. Not speed.

I tried to isolate the issue and found a penalty of factor 10 for point wise multiplication of matrices.

This is the code in my test file:

``````numType = Float16

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B
``````

and the result is (without some ramp up, just start the file, so that’s just a very crude test:

First two starts with Float32

``````  0.235396 seconds (46.17 k allocations: 2.444 MiB)

0.209013 seconds (46.17 k allocations: 2.444 MiB)
``````

And the two with Float16.

``````  1.848572 seconds (46.60 k allocations: 2.468 MiB)

1.847379 seconds (46.60 k allocations: 2.468 MiB)
``````

I found in the docs that Float16 is “implemented in software”, but that’s less performant than I expected.

Am I doing something wrong?

Thank you, and kind regards, z.

#2

Btw. Why is it allocating memory in the first place?

#3

Don’t benchmark in the global scope. Put that in a function and run twice.

#4

`Float16` is in an odd place right now. For every calculation the `Float16` is first converted to a `Float32` in which the computation is performed and then converted back to `Float16`. This is necessary since most hardware has no native support for `Float16`. The only hardware where `Float16` is really relevant is GPUs otherwise it is primarily a storage type.

#5

Check out how `Float16` multiplication is defined:

``````julia> x, y = rand(Float16), rand(Float16)
(Float16(0.3818), Float16(0.825))

julia> @which x*y
*(a::Float16, b::Float16) in Base at float.jl:372

julia> @edit x*y
``````

You’ll see it first converts to `Float32`, performs the multiplication, and then converts back to `Float16`. Hence it cannot be as fast as `Float32`. FPUs support `Float32` natively but not `Float16`.

Even more significantly, julia used parallelized BLAS routines (written in Fortran) for `Float32`. For `Float16` it is not much better than naive multiplication (it’s a little more cache friendly, but not parallelized nor as optimized as it is for `Float32`).

#6

@ChrisRackauckas Ah, right. Should have known that.

``````> t32();
0.158636 seconds

> t32();
0.134481 seconds

> t16();
1.785036 seconds

> t16();
1.782366 seconds
``````

for

``````function t16()

numType = Float16

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B
end

function t32()

numType = Float32

A = rand(numType, 10000, 10000)
B = rand(numType, 10000, 10000)
C = Array{numType, 2}(10000,10000)

@time C .= A .* B
end
``````

#7

Thank you all for the explanations!

#8

Intel added vector instructions to do conversions to/from 16-bit floats many years ago, and in fact, showed that (because of using half the memory, better cache utilization) that using 16-bit could be faster than 32-bit, for larger operations, and not that much slower for smaller vectors.

https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

It seems that making sure that Julia can use the SIMD instructions when doing vector operations on 16-bit floats could acheive some nice performance benefits.