Float64 is typecasted to float16

I am trying to calculate mean of some values. These values are of type Float16. When performing mean operation, I see the value to be Inf16. I tried to assign the result to a variable of type float64 as well, but it didn’t work out and I still see the type of the final mean as Float16. Can you suggest how to handle the result going out of the range of float16 here?

you need to convert before the calculation happens; convert the result wont work because the ‘true value’ is already lost.

Or, if you do not want to convert the whole array to Float64, and you do not mind losing some performance of sum, you can just define your “own sum” that uses a Float64 as accumulator:

julia> a = rand(Float16, 1000);
julia> r = sum(a)
Float16(496.8)
julia> r2 = foldl(+, a; init = zero(Float64))
497.798828125

sum(Float64, a) also works.

5 Likes

Oh, ok. I did not know that sum could take a function as first parameter, it makes sense.

If the smaller type is always promoted before addition then the sum(Float64, a) seems the way to go.

I think there’s some misunderstanding here. Variables don’t have a type, so you cannot assign anything to a “variable of type Float64”. If you assign a value of type Float16 to a variable, that will not change the type.

It is possible to assign a type to a variable, in which case any assignment to that variable converts the right-hand side to that type.

2 Likes

It looks like using sum as a higher-order function allocates in this case. If performance matters, a variant of this approach would be to use reduce, taking special care of initializing the accumulator to a Float64 zero. This does not allocate and should be faster for not-too-large arrays:

julia> x = rand(Float16, 10_000);

# standard use of sum, everything in Float16
julia> @btime sum($x)
  198.709 μs (0 allocations: 0 bytes)
Float16(4.972e3)

# using sum as a higher-order function to convert each element to Float64
julia> s1(x) = sum(Float64, x)
julia> @btime s1($x)
  930.281 μs (29999 allocations: 468.73 KiB)
4976.8818359375

# using reduce with the accumulator initialized to a Float64 zero
julia> s2(x) = reduce(+, x, init=0.)
julia> @btime s2($x)
  32.444 μs (0 allocations: 0 bytes)
4976.8818359375
2 Likes

It seems like this is a inference bug of some kind. If I pass a function instead of the Float64 constructor, I get no allocations:

julia> s1(x) = sum(y -> Float64(y), x)
s1 (generic function with 1 method)

julia> @btime s1($x);
  41.182 μs (0 allocations: 0 bytes)

Filed as julia#36783.

5 Likes