Float64 is typecasted to float16

sudo · July 21, 2020, 6:18pm

I am trying to calculate mean of some values. These values are of type Float16. When performing mean operation, I see the value to be Inf16. I tried to assign the result to a variable of type float64 as well, but it didn’t work out and I still see the type of the final mean as Float16. Can you suggest how to handle the result going out of the range of float16 here?

jling · July 21, 2020, 6:19pm

you need to convert before the calculation happens; convert the result wont work because the ‘true value’ is already lost.

Henrique_Becker · July 21, 2020, 6:42pm

Or, if you do not want to convert the whole array to Float64, and you do not mind losing some performance of sum, you can just define your “own sum” that uses a Float64 as accumulator:

julia> a = rand(Float16, 1000);
julia> r = sum(a)
Float16(496.8)
julia> r2 = foldl(+, a; init = zero(Float64))
497.798828125

stevengj · July 21, 2020, 7:07pm

sum(Float64, a) also works.

Henrique_Becker · July 21, 2020, 7:25pm

Oh, ok. I did not know that sum could take a function as first parameter, it makes sense.

If the smaller type is always promoted before addition then the sum(Float64, a) seems the way to go.

DNF · July 22, 2020, 11:44pm

I think there’s some misunderstanding here. Variables don’t have a type, so you cannot assign anything to a “variable of type Float64”. If you assign a value of type Float16 to a variable, that will not change the type.

stevengj · July 23, 2020, 1:03am

It is possible to assign a type to a variable, in which case any assignment to that variable converts the right-hand side to that type.

ffevotte · July 23, 2020, 7:45pm

It looks like using sum as a higher-order function allocates in this case. If performance matters, a variant of this approach would be to use reduce, taking special care of initializing the accumulator to a Float64 zero. This does not allocate and should be faster for not-too-large arrays:

julia> x = rand(Float16, 10_000);

# standard use of sum, everything in Float16
julia> @btime sum($x)
  198.709 μs (0 allocations: 0 bytes)
Float16(4.972e3)

# using sum as a higher-order function to convert each element to Float64
julia> s1(x) = sum(Float64, x)
julia> @btime s1($x)
  930.281 μs (29999 allocations: 468.73 KiB)
4976.8818359375

# using reduce with the accumulator initialized to a Float64 zero
julia> s2(x) = reduce(+, x, init=0.)
julia> @btime s2($x)
  32.444 μs (0 allocations: 0 bytes)
4976.8818359375

stevengj · July 23, 2020, 8:53pm

It seems like this is a inference bug of some kind. If I pass a function instead of the Float64 constructor, I get no allocations:

julia> s1(x) = sum(y -> Float64(y), x)
s1 (generic function with 1 method)

julia> @btime s1($x);
  41.182 μs (0 allocations: 0 bytes)

Filed as julia#36783.

Topic		Replies	Views
How to convert vectors and arrays from Float32 to Float64? New to Julia question	3	5497	November 12, 2021
Performance of a type-unstable accumulator General Usage performance	6	580	July 10, 2019
Type inference of sum result General Usage question	2	539	December 26, 2016
Converting default Float type to Float32 Performance question	7	1922	May 2, 2022
Efficiently type alias Float128, Float64, and Float32 New to Julia question , performance	9	1100	February 7, 2022

Float64 is typecasted to float16

Related topics